Suraponn commited on
Commit
56e5bed
·
verified ·
1 Parent(s): 352e0bc

update_readme

Browse files
Files changed (1) hide show
  1. README.md +246 -166
README.md CHANGED
@@ -1,199 +1,279 @@
1
  ---
2
  library_name: transformers
3
- tags: []
 
 
 
 
 
 
 
 
 
 
4
  ---
5
 
6
- # Model Card for Model ID
7
 
8
- <!-- Provide a quick summary of what the model is/does. -->
9
 
 
10
 
 
11
 
12
- ## Model Details
13
 
14
- ### Model Description
15
 
16
- <!-- Provide a longer summary of what this model is. -->
17
 
18
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
 
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
 
28
- ### Model Sources [optional]
29
 
30
- <!-- Provide the basic links for the model. -->
 
 
 
31
 
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
 
 
 
35
 
36
- ## Uses
37
 
38
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
 
40
- ### Direct Use
 
41
 
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
 
44
- [More Information Needed]
 
 
45
 
46
- ### Downstream Use [optional]
47
 
48
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
 
50
- [More Information Needed]
 
 
51
 
52
- ### Out-of-Scope Use
53
 
54
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
 
56
- [More Information Needed]
57
 
58
- ## Bias, Risks, and Limitations
 
 
 
59
 
60
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
 
61
 
62
- [More Information Needed]
 
 
 
63
 
64
- ### Recommendations
65
 
66
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
 
 
 
 
67
 
68
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
 
 
 
 
 
69
 
70
- ## How to Get Started with the Model
71
-
72
- Use the code below to get started with the model.
73
-
74
- [More Information Needed]
75
-
76
- ## Training Details
77
-
78
- ### Training Data
79
-
80
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
-
82
- [More Information Needed]
83
-
84
- ### Training Procedure
85
-
86
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
-
88
- #### Preprocessing [optional]
89
-
90
- [More Information Needed]
91
-
92
-
93
- #### Training Hyperparameters
94
-
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
-
97
- #### Speeds, Sizes, Times [optional]
98
-
99
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
-
101
- [More Information Needed]
102
-
103
- ## Evaluation
104
-
105
- <!-- This section describes the evaluation protocols and provides the results. -->
106
-
107
- ### Testing Data, Factors & Metrics
108
-
109
- #### Testing Data
110
-
111
- <!-- This should link to a Dataset Card if possible. -->
112
-
113
- [More Information Needed]
114
-
115
- #### Factors
116
-
117
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
-
119
- [More Information Needed]
120
-
121
- #### Metrics
122
-
123
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
-
125
- [More Information Needed]
126
-
127
- ### Results
128
-
129
- [More Information Needed]
130
-
131
- #### Summary
132
-
133
-
134
-
135
- ## Model Examination [optional]
136
-
137
- <!-- Relevant interpretability work for the model goes here -->
138
-
139
- [More Information Needed]
140
-
141
- ## Environmental Impact
142
-
143
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
-
145
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
-
147
- - **Hardware Type:** [More Information Needed]
148
- - **Hours used:** [More Information Needed]
149
- - **Cloud Provider:** [More Information Needed]
150
- - **Compute Region:** [More Information Needed]
151
- - **Carbon Emitted:** [More Information Needed]
152
-
153
- ## Technical Specifications [optional]
154
-
155
- ### Model Architecture and Objective
156
-
157
- [More Information Needed]
158
-
159
- ### Compute Infrastructure
160
-
161
- [More Information Needed]
162
-
163
- #### Hardware
164
-
165
- [More Information Needed]
166
-
167
- #### Software
168
-
169
- [More Information Needed]
170
-
171
- ## Citation [optional]
172
-
173
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
-
175
- **BibTeX:**
176
-
177
- [More Information Needed]
178
-
179
- **APA:**
180
-
181
- [More Information Needed]
182
-
183
- ## Glossary [optional]
184
-
185
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
-
187
- [More Information Needed]
188
-
189
- ## More Information [optional]
190
-
191
- [More Information Needed]
192
-
193
- ## Model Card Authors [optional]
194
-
195
- [More Information Needed]
196
-
197
- ## Model Card Contact
198
-
199
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  library_name: transformers
3
+ language:
4
+ - en
5
+ - th
6
+ base_model:
7
+ - Qwen/Qwen2.5-VL-3B-Instruct
8
+ tags:
9
+ - OCR
10
+ - vision-language
11
+ - document-understanding
12
+ - multilingual
13
+ license: apache-2.0
14
  ---
15
 
 
16
 
 
17
 
18
+ **Typhoon-OCR-3B**: A bilingual document parsing model built specifically for real-world documents in Thai and English inspired by models like olmOCR based on Qwen2.5-VL-Instruction.
19
 
20
+ **Try our demo available on [Demo](https://ocr.opentyphoon.ai/)**
21
 
22
+ **Code / Examples available on [Github](https://github.com/scb-10x/typhoon-ocr)**
23
 
24
+ **Release Blog available on [OpenTyphoon Blog](https://opentyphoon.ai/blog/en/typhoon-ocr-release)**
25
 
26
+ *Remark: This model is intended to be used with a specific prompt only; it will not work with any other prompts.
27
 
 
28
 
29
+ ## **Real-World Document Support**
 
 
 
 
 
 
30
 
31
+ **1. Structured Documents**: Financial reports, Academic papers, Books, Government forms
32
 
33
+ **Output format**:
34
+ - Markdown for general text
35
+ - HTML for tables (including merged cells and complex layouts)
36
+ - Figures, charts, and diagrams are represented using figure tags for structured visual understanding
37
 
38
+ **Each figure undergoes multi-layered interpretation**:
39
+ - **Observation**: Detects elements like landscapes, buildings, people, logos, and embedded text
40
+ - **Context Analysis**: Infers context such as location, event, or document section
41
+ - **Text Recognition**: Extracts and interprets embedded text (e.g., chart labels, captions) in Thai or English
42
+ - **Artistic & Structural Analysis**: Captures layout style, diagram type, or design choices contributing to document tone
43
+ - **Final Summary**: Combines all insights into a structured figure description for tasks like summarization and retrieval
44
 
 
45
 
46
+ **2. Layout-Heavy & Informal Documents**: Receipts, Menus papers, Tickets, Infographics
47
 
48
+ **Output format**:
49
+ - Markdown with embedded tables and layout-aware structures
50
 
51
+ ## Performance
52
 
53
+ ![finance performance](https://storage.googleapis.com/typhoon-public/assets/typhoon_ocr/eval_finance.png)
54
+ ![gov performance](https://storage.googleapis.com/typhoon-public/assets/typhoon_ocr/eval_gov.png)
55
+ ![book performance](https://storage.googleapis.com/typhoon-public/assets/typhoon_ocr/eval_books.png)
56
 
 
57
 
58
+ ## Summary of Findings
59
 
60
+ Typhoon OCR outperforms both GPT-4o and Gemini 2.5 Flash in Thai document understanding, particularly on documents with complex layouts and mixed-language content.
61
+ However, in the Thai books benchmark, performance slightly declined due to the high frequency and diversity of embedded figures. These images vary significantly in type and structure, which poses challenges for our current figure tag parsing. This highlights a potential area for future improvement—specifically, in enhancing the model's image understanding capabilities.
62
+ For this version, our primary focus has been on achieving high-quality OCR for both English and Thai text. Future releases may extend support to more advanced image analysis and figure interpretation.
63
 
64
+ ## Usage Example
65
 
66
+ **(Recommended): Full inference code available on [Colab](https://colab.research.google.com/drive/1z4Fm2BZnKcFIoWuyxzzIIIn8oI2GKl3r?usp=sharing)**
67
 
 
68
 
69
+ **(Recommended): Using Typhoon-OCR Package**
70
+ ```bash
71
+ pip install typhoon-ocr
72
+ ```
73
 
74
+ ```python
75
+ from typhoon_ocr import ocr_document
76
 
77
+ # please set env TYPHOON_OCR_API_KEY or OPENAI_API_KEY to use this function
78
+ markdown = ocr_document("test.png")
79
+ print(markdown)
80
+ ```
81
 
82
+ **(Recommended): Local Model via vllm (GPU Required)**:
83
 
84
+ ```bash
85
+ pip install vllm
86
+ vllm serve scb10x/typhoon-ocr-7b --max-model-len 32000 --served-model-name typhoon-ocr-preview # OpenAI Compatible at http://localhost:8000 (or other port)
87
+ # then you can supply base_url in to ocr_document
88
+ ```
89
 
90
+ ```python
91
+ from typhoon_ocr import ocr_document
92
+ markdown = ocr_document('image.png', base_url='http://localhost:8000/v1', api_key='no-key')
93
+ print(markdown)
94
+ ```
95
+ To read more about [vllm](https://docs.vllm.ai/en/latest/getting_started/quickstart.html)
96
 
97
+ **Run Manually**
98
+
99
+ Below is a partial snippet. You can run inference using either the API or a local model.
100
+
101
+ *API*:
102
+ ```python
103
+ from typing import Callable
104
+ from openai import OpenAI
105
+ from PIL import Image
106
+ from typhoon_ocr.ocr_utils import render_pdf_to_base64png, get_anchor_text
107
+
108
+ PROMPTS_SYS = {
109
+ "default": lambda base_text: (f"Below is an image of a document page along with its dimensions. "
110
+ f"Simply return the markdown representation of this document, presenting tables in markdown format as they naturally appear.\n"
111
+ f"If the document contains images, use a placeholder like dummy.png for each image.\n"
112
+ f"Your final output must be in JSON format with a single key `natural_text` containing the response.\n"
113
+ f"RAW_TEXT_START\n{base_text}\nRAW_TEXT_END"),
114
+ "structure": lambda base_text: (
115
+ f"Below is an image of a document page, along with its dimensions and possibly some raw textual content previously extracted from it. "
116
+ f"Note that the text extraction may be incomplete or partially missing. Carefully consider both the layout and any available text to reconstruct the document accurately.\n"
117
+ f"Your task is to return the markdown representation of this document, presenting tables in HTML format as they naturally appear.\n"
118
+ f"If the document contains images or figures, analyze them and include the tag <figure>IMAGE_ANALYSIS</figure> in the appropriate location.\n"
119
+ f"Your final output must be in JSON format with a single key `natural_text` containing the response.\n"
120
+ f"RAW_TEXT_START\n{base_text}\nRAW_TEXT_END"
121
+ ),
122
+ }
123
+
124
+ def get_prompt(prompt_name: str) -> Callable[[str], str]:
125
+ """
126
+ Fetches the system prompt based on the provided PROMPT_NAME.
127
+
128
+ :param prompt_name: The identifier for the desired prompt.
129
+ :return: The system prompt as a string.
130
+ """
131
+ return PROMPTS_SYS.get(prompt_name, lambda x: "Invalid PROMPT_NAME provided.")
132
+
133
+
134
+
135
+ # Render the first page to base64 PNG and then load it into a PIL image.
136
+ image_base64 = render_pdf_to_base64png(filename, page_num, target_longest_image_dim=1800)
137
+ image_pil = Image.open(BytesIO(base64.b64decode(image_base64)))
138
+
139
+ # Extract anchor text from the PDF (first page)
140
+ anchor_text = get_anchor_text(filename, page_num, pdf_engine="pdfreport", target_length=8000)
141
+
142
+ # Retrieve and fill in the prompt template with the anchor_text
143
+ prompt_template_fn = get_prompt(task_type)
144
+ PROMPT = prompt_template_fn(anchor_text)
145
+
146
+ messages = [{
147
+ "role": "user",
148
+ "content": [
149
+ {"type": "text", "text": PROMPT},
150
+ {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_base64}"}},
151
+ ],
152
+ }]
153
+ # send messages to openai compatible api
154
+ openai = OpenAI(base_url="https://api.opentyphoon.ai/v1", api_key="TYPHOON_API_KEY")
155
+ response = openai.chat.completions.create(
156
+ model="typhoon-ocr-preview",
157
+ messages=messages,
158
+ max_tokens=16384,
159
+ temperature=0.1,
160
+ top_p=0.6,
161
+ extra_body={
162
+ "repetition_penalty": 1.2,
163
+ },
164
+ )
165
+ text_output = response.choices[0].message.content
166
+ print(text_output)
167
+ ```
168
+
169
+ *(Not Recommended): Local Model - Transformers (GPU Required)*:
170
+ ```python
171
+ # Initialize the model
172
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained("scb10x/typhoon-ocr-7b", torch_dtype=torch.bfloat16 ).eval()
173
+ processor = AutoProcessor.from_pretrained("scb10x/typhoon-ocr-7b")
174
+
175
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
176
+ model.to(device)
177
+ # Apply the chat template and processor
178
+ text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
179
+ main_image = Image.open(BytesIO(base64.b64decode(image_base64)))
180
+
181
+ inputs = processor(
182
+ text=[text],
183
+ images=[main_image],
184
+ padding=True,
185
+ return_tensors="pt",
186
+ )
187
+ inputs = {key: value.to(device) for (key, value) in inputs.items()}
188
+
189
+ # Generate the output
190
+ output = model.generate(
191
+ **inputs,
192
+ temperature=0.1,
193
+ max_new_tokens=12000,
194
+ num_return_sequences=1,
195
+ repetition_penalty=1.2,
196
+ do_sample=True,
197
+ )
198
+ # Decode the output
199
+ prompt_length = inputs["input_ids"].shape[1]
200
+ new_tokens = output[:, prompt_length:]
201
+ text_output = processor.tokenizer.batch_decode(
202
+ new_tokens, skip_special_tokens=True
203
+ )
204
+ print(text_output[0])
205
+ ```
206
+
207
+ ## Prompting
208
+
209
+ This model only works with the specific prompts defined below, where `{base_text}` refers to information extracted from the PDF metadata using the `get_anchor_text` function from the `typhoon-ocr` package. It will not function correctly with any other prompts.
210
+
211
+ ```python
212
+ PROMPTS_SYS = {
213
+ "default": lambda base_text: (f"Below is an image of a document page along with its dimensions. "
214
+ f"Simply return the markdown representation of this document, presenting tables in markdown format as they naturally appear.\n"
215
+ f"If the document contains images, use a placeholder like dummy.png for each image.\n"
216
+ f"Your final output must be in JSON format with a single key `natural_text` containing the response.\n"
217
+ f"RAW_TEXT_START\n{base_text}\nRAW_TEXT_END"),
218
+ "structure": lambda base_text: (
219
+ f"Below is an image of a document page, along with its dimensions and possibly some raw textual content previously extracted from it. "
220
+ f"Note that the text extraction may be incomplete or partially missing. Carefully consider both the layout and any available text to reconstruct the document accurately.\n"
221
+ f"Your task is to return the markdown representation of this document, presenting tables in HTML format as they naturally appear.\n"
222
+ f"If the document contains images or figures, analyze them and include the tag <figure>IMAGE_ANALYSIS</figure> in the appropriate location.\n"
223
+ f"Your final output must be in JSON format with a single key `natural_text` containing the response.\n"
224
+ f"RAW_TEXT_START\n{base_text}\nRAW_TEXT_END"
225
+ ),
226
+ }
227
+ ```
228
+
229
+ ### Generation Parameters
230
+
231
+ We suggest using the following generation parameters. Since this is an OCR model, we do not recommend using a high temperature. Make sure the temperature is set to 0 or 0.1, not higher.
232
+ ```python
233
+ temperature=0.1,
234
+ top_p=0.6,
235
+ repetition_penalty: 1.2
236
+ ```
237
+
238
+ ## Hosting
239
+
240
+ We recommend to inference typhoon-ocr using [vllm](https://github.com/vllm-project/vllm) instead of huggingface transformers, and using typhoon-ocr library to ocr documents. To read more about [vllm](https://docs.vllm.ai/en/latest/getting_started/quickstart.html)
241
+ ```bash
242
+ pip install vllm
243
+ vllm serve scb10x/typhoon-ocr-7b --max-model-len 32000 --served-model-name typhoon-ocr-preview # OpenAI Compatible at http://localhost:8000
244
+ # then you can supply base_url in to ocr_document
245
+ ```
246
+
247
+ ```python
248
+ from typhoon_ocr import ocr_document
249
+ markdown = ocr_document('image.png', base_url='http://localhost:8000/v1', api_key='no-key')
250
+ print(markdown)
251
+ ```
252
+
253
+ ## **Intended Uses & Limitations**
254
+
255
+ This is a task-specific model intended to be used only with the provided prompts. It does not include any guardrails or VQA capability. Due to the nature of large language models (LLMs), a certain level of hallucination may occur. We recommend that developers carefully assess these risks in the context of their specific use case.
256
+
257
+ ## **Follow us**
258
+
259
+ **https://twitter.com/opentyphoon**
260
+
261
+ ## **Support**
262
+
263
+ **https://discord.gg/us5gAYmrxw**
264
+
265
+
266
+ ## **Citation**
267
+
268
+ - If you find Typhoon2 useful for your work, please cite it using:
269
+ ```
270
+ @misc{typhoon2,
271
+ title={Typhoon 2: A Family of Open Text and Multimodal Thai Large Language Models},
272
+ author={Kunat Pipatanakul and Potsawee Manakul and Natapong Nitarach and Warit Sirichotedumrong and Surapon Nonesung and Teetouch Jaknamon and Parinthapat Pengpun and Pittawat Taveekitworachai and Adisai Na-Thalang and Sittipong Sripaisarnmongkol and Krisanapong Jirayoot and Kasima Tharnpipitchai},
273
+ year={2024},
274
+ eprint={2412.13702},
275
+ archivePrefix={arXiv},
276
+ primaryClass={cs.CL},
277
+ url={https://arxiv.org/abs/2412.13702},
278
+ }
279
+ ```