--- license: mit base_model: Qwen/Qwen2.5-VL-7B tags: - vision-language - document-to-markdown - reinforcement-learning - grpo - qwen2.5 - markdown model_name: NuMarkdown-reasoning library_name: transformers pipeline_tag: text-generation ---

🖥️ API / Platform   |   📑 Blog   |   🗣️ Discord

--- # NuMarkdown-reasoning 📄 **NuMarkdown-reasoning** is the first reasoning vision-language model trained specifically to convert documents into clean GitHub-flavoured Markdown. It is a fine-tune of **Qwen 2.5-VL-7B** using ~10k synthetic Doc-to-Reasoning-to-Markdown pairs, followed by an RL phase (GRPO) with a layout-centric reward. *(Note: the number of thinking tokens can vary from 20% to 5X the number of tokens of the final answers)* ## Results **NuMarkdown-reasoning** is significantly better than similar size non-reasoning models trained for markdown generation on complex documents, and achieves competitive results against top closed source alternatives. ### Arena ranking (using trueskill-2 ranking system, with around 500 votes):

| Rank | Model | μ | σ | μ − 3σ | | ---- | --------------------------------------- | ----- | ---- | ------ | | 🥇 1 | **gemini-flash-reasoning** | 26.75 | 0.80 | 24.35 | | 🥈 2 | **NuMarkdown-reasoning** | 26.10 | 0.79 | 23.72 | | 🥉 3 | **NuMarkdown-reasoning-w/o\_grpo** | 25.32 | 0.80 | 22.93 | | 4 | **OCRFlux-3B** | 24.63 | 0.80 | 22.22 | | 5 | **gpt-4o** | 24.48 | 0.80 | 22.08 | | 6 | **gemini-flash-w/o\_reasoning** | 24.11 | 0.79 | 21.74 | | 7 | **RolmoOCR** | 23.53 | 0.82 | 21.07 |

*We plan to realease a markdown arena, similar to llmArena, for complex document-to-markdown tasks to provide a tool to evaluate different solutions.* ### Win-rate against others models (image-only):

## Training 1. **SFT**: One-epoch supervised fine-tuning on synthetic reasoning traces generated from public PDFs (10K input/output pairs). 2. **RL (GRPO)**: RL phase using a layout-centric reward (5K difficult image examples). ## Example:

``` 1. **Analyze the overall structure:** The document consists of two main sections, each containing a header, a set of bullet points, a title, a table, and a footer. The layout is consistent within each section but different between the two. 2. **Handle the first section header and bullet points:** - The top left corner has "Generalitat de Catalunya Departament d'Educació Institut Gal·lecs". This is a consistent header and should be transcribed as a level 3 heading. - The top right corner has "Curs 2021-22". This is a distinct header and should be transcribed as a level 3 heading. - Below the header, there are four bullet points. These are standard list items. I will use Markdown's unordered list syntax (`- `). 3. **Process the first section's main content - Table 1:** - **Identify table structure:** This is a complex table with a clear header row: "MATERIAL", "TÍTOL", "AUTOR", "EDITORIAL", "ISBN". - **Determine columns:** There are 5 columns. - **Handle merged cells:** - "Llengua Catalana" spans two rows. - "Llengua Castellana" spans three rows. - "Anglès" spans two rows. - **Populate table data:** Transcribe all text, including multi-line content within cells (e.g., "Babulinka Books\nEditorialCol·lecció: Pettes\nJoies per a Grans Lectors.\n2013"). - **Markdown table creation:** Use standard Markdown table syntax, ensuring correct alignment and handling of multi-line content within cells. 4. **Handle the first section's footer:** - The footer contains "Codí: 04mp02", "Responsable: Coordinador Qualitat", "Versió: 5", "Full d'Informació a l'alumnat i famílies", "Aquest document pot quedar obsolet una vegada imprès", and "Pàgina 1 de 2". This is a simple text block at the bottom. I will transcribe it as a separate paragraph or list, maintaining the original layout if possible, but a simple paragraph is sufficient for a footer. 5. **Handle the second section header and bullet points:** - The top left corner has "Generalitat de Catalunya Departament d'Educació Institut Gal·lecs". This is a consistent header and should be transcribed as a level 3 heading. - The top right corner has "Curs 2021-22". This is a distinct header and should be transcribed as a level 3 heading. - Below the header, there are no bullet points, but there is a table. 6. **Process the second section's main content - Table 2:** - **Identify table structure:** This is a simpler table with a clear header row: "MATERIAL", "TÍTOL", "AUTOR", "EDITORIAL", "ISBN". - **Determine columns:** There are 5 columns. - **Populate table data:** Transcribe all text, including multi-line content within cells (e.g., "Projecte MOTIMATS-ONMAT (llicència digital) Tekman Books\nES GESTIONA A TRAVÉS DE L'AMPA AL SETEMBRE"). - **Markdown table creation:** Use standard Markdown table syntax. 7. **Handle the second section's footer:** - The footer contains "Codí: 04mp02", "Responsable: Coordinador Qualitat", "Versió: 5", "Full d'Informació a l'alumnat i famílies", "Aquest document pot quedar obsolet una vegada imprès", and "Pàgina 2 de 2". This is a simple text block at the bottom. I will transcribe it as a separate paragraph or list, maintaining the original layout if possible, but a simple paragraph is sufficient for a footer. 8. **Final review:** Check for any missed text, incorrect formatting, or issues with table alignment. Ensure all original content is present and accurately represented in GitHub-Flavored Markdown. The image placeholders are not needed as there are no images in the document. ### Generalitat de Catalunya ### Departament d'Educació ### Institut Gal·lecs ### Curs 2021-22 - Els llibres de color blau indiquen que es manté respecte al curs anterior. - Els llibres de color groc indiquen que es tracta d'un canvi per a aquest curs. - Els llibres de color vermell indiquen que no s'han de comprar perquè van a càrrec del centre. - Matèries sense llibre: Dossiers, llicències i el Kit de tecnologia s'hauran de pagar i de recollir a través de l'AMPA a partir de setembre. # 1 ESO | MATERIAL | TÍTOL | AUTOR | EDITORIAL | ISBN | |---|---|---|---|---| | Llengua Catalana | Punt Volat | | Castellnou (Didacta +) | 9788417803124 | | | Duna, Diari d'un estiu. | Muriel Villanueva | Babulinka Books
EditorialCol·lecció: Pettes
Joies per a Grans Lectors.
2013 | 9788494159077 | | | El nen que xatejava amb Jack Sparrow. | Francesc Puigpelat | Bromera
Col·lecció: L'Elefant. 2015 | 9788490264072 | | Llengua Castellana | Proyecto Asterisco | | Castellnou (Didacta +) | 9788417803186 | | | Manzanas rojas | Luis Matilla | Ed. Anaya | 978846673989 | | | Fàbulas de Esopo | Jerry Pinkney | Vicens Vives | 978843671648 | | Anglès | Think Ahead ESO 1. Student's book.
Think Ahead ESO 1. Workbook (cat). | | Burlington Books
Burlington Books | 9788925300662
9789925300686 | Codí: 04mp02 Responsable: Coordinador Qualitat Versió: 5 Full d'Informació a l'alumnat i famílies Aquest document pot quedar obsolet una vegada imprès Pàgina 1 de 2 ### Generalitat de Catalunya ### Departament d'Educació ### Institut Gal·lecs ### Curs 2021-22 | MATERIAL | TÍTOL | AUTOR | EDITORIAL | ISBN | |---|---|---|---|---| | FRANCÈS | Nouvelle Génération A1-A2 | | Santillana | 9788490494745 | | CIÈNCIES EXPERIMENTALS | Science Bits
ES GESTIONA A TRAVÉS DE L'AMPA AL SETEMBRE | | | 9788412213485 (llicència digital) | | MATEMÀTIQUES | Projecte MOTIMATS-ONMAT (llicència digital) Tekman Books
ES GESTIONA A TRAVÉS DE L'AMPA AL SETEMBRE | | | | | TECNOLOGIA | Tecnologia 1 ESO | TEIDE | | 9788430783175 | | VISUAL I PLÀSTICA | SENSE LLIBRE-KIT DE MATERIAL | | | | | CIÈNCIES SOCIALS | SENSE LLIBRE-dossier | | | | Codí: 04mp02 Responsable: Coordinador Qualitat Versió: 5 Full d'Informació a l'alumnat i famílies Aquest document pot quedar obsolet una vegada imprès Pàgina 2 de 2
``` ## Quick start: 🤗 Transformers ```python from __future__ import annotations import torch from PIL import Image from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration model_id = "Numind/NuMarkdown-reasoning" processor = AutoProcessor.from_pretrained( model_id, trust_remote_code=True, ) model = Qwen2_5_VLForConditionalGeneration.from_pretrained( model_id, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", device_map="auto", trust_remote_code=True, ) img = Image.open("invoice.png").convert("RGB") messages = [{ "role": "user", "content": [ {"type": "image"}, ], }] prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) enc = processor(text=prompt, images=[img], return_tensors="pt").to(model.device) with torch.no_grad(): out = model.generate(**enc, max_new_tokens=5000) print(processor.decode(out[0].split("")[1].split("")[0], skip_special_tokens=True)) ``` ## vLLM: ```python from PIL import Image from vllm import LLM, SamplingParams from transformers import AutoProcessor model_id = "Numind/NuMarkdown-reasoning" llm = LLM( model=model_id, tokenizer=model_id, dtype="bfloat16", gpu_memory_utilization=0.85, max_num_seqs=256, enforce_eager=True, trust_remote_code=True ) sampling_params = SamplingParams( temperature=0.8, max_tokens=5000, ) processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True) inputs = [] messages = [{ "role": "user", "content": [ {"type": "image"}, ] }] prompt = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, ) image = Image.open("invoice.png").convert("RGB") inputs.append({ "prompt": prompt, "multi_modal_data": {"image": image} }) outs = llm.generate(inputs, sampling_params) preds = [o.outputs[0].text.strip().split("")[1].split("")[0] for o in outs] ```