pcuenq HF Staff commited on 6 days ago

Commit

493df70

verified ·

1 Parent(s): 24a5060

Upload folder using huggingface_hub

Browse files

Files changed (33) hide show

.gitattributes +1 -0
README.md +267 -0
bias.md +13 -0
chat_template.jinja +179 -0
config.json +357 -0
configuration.py +57 -0
configuration_nemotron_h.py +245 -0
configuration_radio.py +152 -0
evs.py +73 -0
explainability.md +15 -0
generation_config.json +11 -0
hf_quant_config.json +17 -0
image_processing.py +148 -0
llama_nemotron_toolcall_parser_no_streaming.py +470 -0
model-00001-of-00004.safetensors +3 -0
model-00002-of-00004.safetensors +3 -0
model-00003-of-00004.safetensors +3 -0
model-00004-of-00004.safetensors +3 -0
model.safetensors.index.json +0 -0
modeling.py +287 -0
modeling_nemotron_h.py +1636 -0
nano_v2_inference_chat_template.jinja +125 -0
nano_v2_llm_template.jinja +1 -0
non_reasoning_nano_v2_inference_chat_template.jinja +118 -0
preprocessor_config.json +15 -0
privacy.md +13 -0
processing.py +261 -0
processing_utils.py +83 -0
safety.md +10 -0
special_tokens_map.json +23 -0
tokenizer.json +3 -0
tokenizer_config.json +0 -0
video_io.py +176 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,267 @@

+---
+license: other
+license_name: nvidia-open-model-license
+license_link: >-
+  https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/
+pipeline_tag: image-text-to-text
+library_name: transformers
+tags:
+  - nvidia
+  - VLM
+  - FP8
+---
+# NVIDIA-Nemotron-Nano-VL-12B-V2-FP8
+## Model Overview
+### Description
+NVIDIA-Nemotron-Nano-VL-12B-V2-FP8 is the quantized version of the NVIDIA Nemotron Nano VL V2 model, which is an auto-regressive vision language model that uses an optimized transformer architecture. For more information, please check [here](https://huggingface.co/nvidia/Nemotron-Nano-12B-v2-VL-BF16). The NVIDIA Nemotron Nano VL FP4 QAD model is quantized with [TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer).
+This model was trained on commercial images for all three stages of training and supports single image inference.
+### License/Terms of Use
+**Governing Terms:**
+Your use of the model is governed by the [NVIDIA Open License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/).
+**Additional Information:**
+Backbone LLM: NVIDIA-Nemotron-Nano-12B-v2.
+### Deployment Geography:
+Global
+### Use Case:
+Customers: AI foundry enterprise customers
+Use Cases: Image summarization. Text-image analysis, Optical Character Recognition, Interactive Q&A on images, Text Chain-of-Thought reasoning
+## Release Date:
+- Build.Nvidia.com [October 28th, 2025] via [nvidia/NVIDIA-Nemotron-Nano-VL-12B-V2](https://build.nvidia.com/nvidia/nvidia-nemotron-nano-vl-12b-v2)
+- Hugging Face [October 28th, 2025] via [nvidia/NVIDIA-Nemotron-Nano-VL-12B-V2-BF16](https://huggingface.co/nvidia/Nemotron-Nano-12B-v2-VL-BF16)
+- Hugging Face [October 28th, 2025] via [nvidia/NVIDIA-Nemotron-Nano-VL-12B-V2-FP8](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-FP8)
+- Hugging Face [October 28th, 2025] via [nvidia/NVIDIA-Nemotron-Nano-VL-12B-V2-NVFP4](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-NVFP4-QAD)
+## Model Architecture:
+**Network Type:** Transformer
+**Network Architecture:**
+Vision Encoder: [C-RADIOv2-H](https://huggingface.co/nvidia/C-RADIOv2-VLM-H)
+Language Encoder: NVIDIA-Nemotron-Nano-12B-v2
+### Input
+Input Type(s): Image, Text
+- Input Images
+- Language Supported: German, Spanish, French, Italian, Korean, Portuguese, Russian, Japanese, Chinese, English
+Input Format(s): Image (Red, Green, Blue (RGB)), and Text (String)
+Input Parameters: Image (2D), Text (1D)
+Other Properties Related to Input:
+- Context length up to 128K
+- Maximum Resolution: Determined by a 12-tile layout constraint, with each tile being 512 × 512 pixels. This supports aspect ratios such as:
+  - 4 × 3 layout: up to 2048 × 1536 pixels
+  - 3 × 4 layout: up to 1536 × 2048 pixels
+  - 2 × 6 layout: up to 1024 × 3072 pixels
+  - 6 × 2 layout: up to 3072 × 1024 pixels
+  - Other configurations allowed, provided total tiles ≤ 12
+- Channel Count: 3 channels (RGB)
+- Alpha Channel: Not supported (no transparency)
+### Output
+Output Type(s): Text
+Output Formats: String
+Output Parameters: One-Dimensional (1D): Sequences up to 128K
+Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
+### Software Integration
+Runtime Engine(s): vLLM<br>
+Supported Hardware Microarchitecture Compatibility: H100 SXM 80GB<br>
+Supported Operating System(s): Linux<br>
+### Model Versions:
+Nemotron-Nano-VL-12B-V2-FP8
+## Quick Start
+### Install Dependencies
+```
+pip install causal_conv1d "transformers>4.53,<4.54" torch timm "mamba-ssm==2.2.5" accelerate open_clip_torch numpy pillow
+```
+### Usage
+To serve this checkpoint with [vLLM](https://github.com/vllm-project/vllm), you can start the docker `vllm/vllm-openai:nightly` and run the sample command below:
+```sh
+python3 -m vllm.entrypoints.openai.api_server --model nvidia/Nemotron-Nano-VL-12B-V2-FP8 --trust-remote-code --quantization modelopt
+```
+## Training, Testing, and Evaluation Datasets:
+### Training Datasets:
+**Data Modalities** <br>
+** Total Size: 39'486'703 samples <br>
+** Total Number of Datasets: 270 <br>
+** Text-only datasets: 33 <br>
+** Text-and-image datasets: 176 <br>
+** Video-and-text datasets: 61 <br>
+** Total size: 27.7 TB <br>
+** Data modalities: Text, Image, Video <br>
+** Data Collection Method by dataset: Hybrid: Automated, Human, Synthetic <br>
+** Labeling Method by dataset: Hybrid: Automated, Human, Synthetic <br>
+** Dataset partition: Training [100%], Testing [0%], Validation [0%] <br>
+** Time period for training data collection: 2023-2025 <br>
+** Time period for testing data collection: N/A <br>
+** Time period for validation data collection: N/A <br>
+The post-training datasets consist of a mix of internal and public datasets designed for training vision language models across various tasks. It includes:
+* Public datasets sourced from publicly available images and annotations, supporting tasks like classification, captioning, visual question answering, conversation modeling, document analysis and text/image reasoning.
+* Internal text and image datasets built with public commercial images and internal labels, adapted for the same tasks as listed above.
+* Synthetic image datasets generated programmatically for specific tasks like tabular data understanding and optical character recognition (OCR), for English, Chinese as well as other languages.
+* Video datasets supporting video question answering and reasoning tasks from publicly available video sources, with either publicly available or internally generated annotations.
+* Specialized datasets for safety alignment, function calling, and domain-specific tasks (e.g., science diagrams, financial question answering).
+* NVIDIA-Sourced Synthetic Datasets for text reasoning.
+* Private datasets for safety alignment or VQA on invoices.
+* Crawled or scraped captioning, VQA, and video datasets.
+* Some datasets were improved with Qwen2.5-72B-Instruct annotations
+For around ~30% of our total training corpus and several of the domains listed above, we used commercially permissive models to perform:
+* Language translation
+* Re-labeling of annotations for text, image and video datasets
+* Synthetic data generation
+* Generating chain-of-thought (CoT) traces
+Additional processing for several datasets included rule-based QA generation (e.g., with templates), expanding short answers into longer responses, as well as proper reformatting. More details can be found [here](https://arxiv.org/abs/2501.14818).
+** Image based datasets were all scanned against known CSAM to make sure no such content was included in training.<br>
+# Public Datasets <br>
+| Type | Data Type | Total Samples | Total Size (GB) |
+|------|-----------|---------------|------------------|
+| Function call | text | 8,000 | 0.02 |
+| Image Captioning | image, text | 1,422,102 | 1,051.04 |
+| Image Reasoning | image, text | 1,888,217 | 286.95 |
+| OCR | image, text | 9,830,570 | 5,317.60 |
+| Referring Expression Grounding | image, text | 14,694 | 2.39 |
+| Safety | image, text | 34,187 | 9.21 |
+| Safety | text | 57,223 | 0.52 |
+| Safety | video, text | 12,988 | 11.78 |
+| Text Instruction Tuning | text | 245,056 | 1.13 |
+| Text Reasoning | text | 225,408 | 4.55 |
+| VQA | image, text | 8,174,136 | 2,207.52 |
+| VQA | video, text | 40,000 | 46.05 |
+| Video Captioning | video, text | 3,289 | 6.31 |
+| Video Reasoning | video, text | 42,620 | 49.10 |
+| VideoQA | video, text | 1,371,923 | 17,641.79 |
+| Visual Instruction Tuning | image, text | 1,173,877 | 167.79 |
+| **TOTAL** | | **24,544,290** | **26,803.75** |
+# Private Datasets <br>
+| Type | Modalities | Total Samples | Total Size (GB) |
+|------|------------|---------------|------------------|
+| Image Reasoning | image, text | 17,729 | 15.41 |
+| Text Reasoning | text | 445,958 | 9.01 |
+| **TOTAL** | | **463,687** | **24.42** |
+# Data Crawling and Scraping <br>
+| Type | Modalities | Total Samples | Total Size (GB) |
+|------|------------|---------------|------------------|
+| Image Captioning | image, text | 39,870 | 10.24 |
+| VQA | image, text | 40,348 | 3.94 |
+| VideoQA | video, text | 288,728 | 393.30 |
+| **TOTAL** | | **368,946** | **407.48** |
+# User-Sourced Data (Collected by Provider including Prompts) <br>
+<br>
+# Self-Sourced Synthetic Data <br>
+| Type | Data Type | Total Samples | Total Size (GB) |
+|------|-----------|---------------|------------------|
+| Code | text | 1,165,591 | 54.15 |
+| OCR | image, text | 216,332 | 83.53 |
+| Text Reasoning | text | 12,727,857 | 295.80 |
+| **TOTAL** | | **14,109,780** | **433.48** |
+**Properties**<br>
+* Additionally, the dataset collection (for training and evaluation) consists of a mix of internal and public datasets designed for training and evaluation across various tasks. It includes:
+  * Internal datasets built with public commercial images and internal labels, supporting tasks like conversation modeling and document analysis.
+  * Public datasets sourced from publicly available images and annotations, adapted for tasks such as image captioning and visual question answering.
+  * Synthetic datasets generated programmatically for specific tasks like tabular data understanding.
+  * Specialized datasets for safety alignment, function calling, and domain-specific tasks (e.g., science diagrams, financial question answering).
+### Evaluation Datasets:
+The following external benchmarks are used for evaluating the model: <br>
+| Dataset |
+|---------|
+| [AI2D Test](https://prior.allenai.org/projects/diagram-understanding )  |
+| [ChartQA Test](https://github.com/vis-nlp/ChartQA) |
+| [OCRBench](https://github.com/Yuliang-Liu/MultimodalOCR) |
+| [OCRBenchV2](https://github.com/Yuliang-Liu/MultimodalOCR) English |
+| [DocVQA Val](https://www.docvqa.org/datasets) |
+Data Collection Method by dataset:  <br>
+* Hybrid: Human, Automated <br>
+Labeling Method by dataset:  <br>
+* Hybrid: Human, Automated  <br>
+**Properties (Quantity, Dataset Descriptions, Sensor(s)):** N/A <br>
+**Dataset License(s):** N/A <br>
+## Evaluation Benchmarks:
+| Benchmark | Score (FP8) | Score (BF16)
+| --- | --- | --- |
+| AI2D | 87.6% | 87.1% |
+| OCRBenchV2 | 61.8% | 62.0% |
+| OCRBench | 85.4% | 85.6% |
+| ChartQA  | 89.4% | 89.7% |
+| DocVQA val | 94.3% | 94.4% |
+# Inference:
+**Engine:** vLLM <br>
+**Test Hardware:** <br>
+* 1x NVIDIA H100 SXM 80GB
+## Ethical Considerations:
+NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications.  When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.  For more detailed information on ethical considerations for this model, please see the Model Card++ [Explainability](explainability.md), [Bias](bias.md), [Safety & Security](safety.md), and [Privacy](privacy.md) Subcards.  Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
+Users are responsible for model inputs and outputs. Users are responsible for ensuring safe integration of this model, including implementing guardrails as well as other safety mechanisms, prior to deployment.
+Outputs generated by these models may contain political content or other potentially misleading information, issues with content security and safety, or unwanted bias that is independent of our oversight.

bias.md ADDED Viewed

	@@ -0,0 +1,13 @@

+| Field | Response |
+|:---|:---|
+| Participation considerations from adversely impacted groups [protected classes](https://www.senate.ca.gov/content/protected-classes) in model design and testing: | None |
+| Bias Metric (If Measured): | [BBQ Accuracy Scores in Ambiguous Contexts](https://github.com/nyu-mll/BBQ/) |
+| Which characteristic (feature) show(s) the greatest difference in performance?: | The model shows high variance across many characteristics when used at a high temperature, with the greatest measurable difference seen in categories such as Gender Identity and Race x Gender. |
+| Which feature(s) have the worst performance overall? | Age (ambiguous) has both the lowest category accuracy listed (0.75) and a notably negative bias score (–0.56), indicating it is the worst-performing feature overall in this evaluation. |
+| Measures taken to mitigate against unwanted bias: | None |
+| If using internal data, description of methods implemented in data acquisition or processing, if any, to address the prevalence of identifiable biases in the training, testing, and validation data: | The training datasets contain a large amount of synthetic data generated by LLMs. We manually curated prompts. |
+| Tools used to assess statistical imbalances and highlight patterns that may introduce bias into AI models: | Bias Benchmark for Question Answering (BBQ) |
+| Tools used to assess statistical imbalances and highlight patterns that may introduce bias into AI models: | The datasets, which include video datasets (e.g., YouCook2, VCG Human Dataset) and image captioning datasets, do not collectively or exhaustively represent all demographic groups (and proportionally therein).
+For instance, these datasets do not contain explicit mentions of demographic classes such as age, gender, or ethnicity in over 80% of samples. In the subset where analysis was performed, certain datasets contain skews in the representation of participants—for example, perceived gender of "female" participants may be significant compared to "male" participants for certain datasets. Separately, individuals aged "40 to 49 years" and “20 to 29 years” are the most frequent among ethnic identifiers. Toxicity analysis was additionally performed on several datasets to identify potential not-safe-for-work samples and risks.
+To mitigate these imbalances, we recommend considering evaluation techniques such as bias audits, fine-tuning with demographically balanced datasets, and mitigation strategies like counterfactual data augmentation to align with the desired model behavior. This evaluation was conducted on a data subset ranging from 200 to 3,000 samples per dataset; as such, certain limitations may exist in the reliability of the embeddings. A baseline of 200 samples was used across all datasets, with larger subsets of up to 3,000 samples utilized for certain in-depth analyses.
+ |

chat_template.jinja ADDED Viewed

	@@ -0,0 +1,179 @@

+{%- set ns = namespace(enable_thinking=false, has_sys_prompt=false, non_tool_system_content='', has_video=false, explicit_think_requested=false) -%}
+{%- set msg = namespace(content='') -%}
+{%- for message in messages -%}
+    {%- if message['role'] == 'system' -%}
+        {%- set ns.has_sys_prompt = true -%}
+        {# Extract system content without tool flags #}
+        {%- if message['content'] is string -%}
+            {%- set ns.non_tool_system_content = message['content'].replace('</think>', '<_end_think>').replace('/think', '').replace('/no_think', '').replace('<_end_think>', '</think>').strip() -%}
+        {%- else -%}
+            {%- set ns.non_tool_system_content = '' -%}
+            {%- for content in message['content'] -%}
+                {%- if content['type'] == 'text' -%}
+                    {%- set ns.non_tool_system_content = ns.non_tool_system_content + content['text'].replace('</think>', '<_end_think>').replace('/think', '').replace('/no_think', '').replace('<_end_think>', '</think>') -%}
+                {%- endif -%}
+            {%- endfor -%}
+            {%- set ns.non_tool_system_content = ns.non_tool_system_content.strip() -%}
+        {%- endif -%}
+    {%- endif -%}
+    {# Check for video content in all messages #}
+    {%- if message['content'] is not string -%}
+        {%- for content in message['content'] -%}
+            {%- if content['type'] == 'video' or content['type'] == 'video_url' -%}
+                {%- set ns.has_video = true -%}
+            {%- endif -%}
+        {%- endfor -%}
+    {%- endif -%}
+    {%- if message['content'] is string -%}
+        {%- if message['role'] == 'user' or message['role'] == 'system' -%}
+            {%- if '/think' in message['content'].replace('</think>', '') -%}
+                {%- set ns.enable_thinking = true -%}
+                {%- set ns.explicit_think_requested = true -%}
+            {%- elif '/no_think' in message['content'] -%}
+                {%- set ns.enable_thinking = false -%}
+            {%- endif -%}
+        {%- endif -%}
+    {%- else -%}
+        {%- for content in message['content'] -%}
+            {%- if content['type'] == 'text' -%}
+                {%- if message['role'] == 'user' or message['role'] == 'system' -%}
+                    {%- if '/think' in content['text'].replace('</think>', '') -%}
+                        {%- set ns.enable_thinking = true -%}
+                        {%- set ns.explicit_think_requested = true -%}
+                    {%- elif '/no_think' in content['text'] -%}
+                        {%- set ns.enable_thinking = false -%}
+                    {%- endif -%}
+                {%- endif -%}
+            {%- endif -%}
+        {%- endfor -%}
+    {%- endif -%}
+{%- endfor -%}
+{# Error out if video is present and reasoning is explicitly requested #}
+{%- if ns.has_video and ns.explicit_think_requested -%}
+    {{ raise_exception('Video inputs are not supported with reasoning mode. Please remove /think flag or remove video content.') }}
+{%- endif -%}
+{# Automatically disable reasoning if video is present (without explicit /think request) #}
+{%- if ns.has_video and not ns.explicit_think_requested -%}
+    {%- set ns.enable_thinking = false -%}
+{%- endif -%}
+{%- if messages[0]['role'] != 'system' -%}
+    {{- '<SPECIAL_10>System\n' -}}
+{%- else -%}
+    {{- '<SPECIAL_10>System\n' + ns.non_tool_system_content }}
+{%- endif -%}
+{%- if tools -%}
+    {%- if ns.non_tool_system_content != '' -%}
+        {{- '\n\n' -}}
+    {%- endif -%}
+    {{- 'You can use the following tools to assist the user if required:\n' -}}
+    {{- '<AVAILABLE_TOOLS>[' -}}
+    {%- for tool in tools -%}
+        {{- (tool.function if tool.function is defined else tool) | tojson -}}
+        {{- ', ' if not loop.last else '' -}}
+    {%- endfor -%}
+    {{- ']</AVAILABLE_TOOLS>\n\n' -}}
+    {{- 'If you decide to call any tool(s), use the following format:\n' -}}
+    {{- '<TOOLCALL>[{"name": "tool_name1", "arguments": "tool_args1"}, ' -}}
+    {{- '{"name": "tool_name2", "arguments": "tool_args2"}]</TOOLCALL>\n\n' -}}
+    {{- 'The user will execute tool-calls and return responses from tool(s) in this format:\n' -}}
+    {{- '<TOOL_RESPONSE>[{"response": "tool_response1"}, ' -}}
+    {{- '{"response": "tool_response2"}]</TOOL_RESPONSE>\n\n' -}}
+    {{- 'Based on the tool responses, you can call additional tools if needed, ' -}}
+    {{- 'correct tool calls if any errors are found, or just respond to the user.' -}}
+{%- endif -%}
+{{- '\n' -}}
+{%- set messages = messages[1:] if messages[0]['role'] == 'system' else messages -%}
+{# Prevent no user or assistant message #}
+{%- if messages|length == 0 -%}
+    {%- set messages = [{'role': 'user', 'content': ''}] -%}
+{%- endif -%}
+{%- for message in messages %}
+    {%- if message['content'] is string -%}
+        {%- set msg.content = message['content'].replace('</think>', '<_end_think>').replace('/think', '').replace('/no_think', '').replace('<_end_think>', '</think>').strip() -%}
+    {%- else -%}
+        {%- set msg.content = '' -%}
+        {%- set mm_content = '' -%}
+        {%- set counters = namespace(images=0, videos=0) -%}
+        {%- for content in message['content'] -%}
+            {%- if content['type'] == 'image' -%}
+                {%- set counters.images = counters.images + 1 -%}
+            {%- elif content['type'] == 'video' -%}
+                {%- set counters.videos = counters.videos + 1 -%}
+            {%- elif content['type'] == 'text' -%}
+                {%- set msg.content = msg.content + content['text'] -%}
+            {%- endif -%}
+        {%- endfor -%}
+        {%- if '<image>' in msg.content -%}
+            {%- set counters.images = 0 -%}
+        {%- endif -%}
+        {%- if '<video>' in msg.content -%}
+            {%- set counters.videos = 0 -%}
+        {%- endif -%}
+        {%- if counters.images > 1 -%}
+            {%- set image_tags = namespace(tags=[]) -%}
+            {%- for i in range(counters.images) -%}
+                {%- set image_tags.tags = image_tags.tags + ['<image ' + (i + 1)|string + '><image>'] -%}
+            {%- endfor -%}
+            {%- set mm_content = ' '.join(image_tags.tags) + '\n' -%}
+        {%- elif counters.images == 1 -%}
+            {%- set mm_content = '<image>\n' -%}
+        {%- endif -%}
+        {%- set mm_content = mm_content + '<video>\n' * counters.videos -%}
+        {%- set msg.content = mm_content + msg.content.lstrip('\n') -%}
+    {%- endif -%}
+    {%- if message['role'] == 'user' %}
+        {{- '<SPECIAL_11>User\n' + msg.content.replace('</think>', '<_end_think>').replace('/think', '').replace('/no_think', '').replace('<_end_think>', '</think>').strip() + '\n' }}
+    {%- elif message['role'] == 'tool' %}
+        {%- if loop.first or (messages[loop.index0 - 1].role != 'tool') -%}
+            {{- '<SPECIAL_11>User\n' + '<TOOL_RESPONSE>[' }}
+        {%- endif -%}
+        {{- msg.content -}}
+        {{- ', ' if not loop.last and (messages[loop.index0 + 1].role == 'tool') else '' -}}
+        {%- if loop.last or (messages[loop.index0 + 1].role != 'tool') -%}
+            {{- ']</TOOL_RESPONSE>\n' -}}
+        {%- endif -%}
+    {%- elif message['role'] == 'assistant' %}
+        {%- if '</think>' in msg.content %}
+            {%- set msg.content = msg.content.split('</think>')[1].strip() %}
+        {%- endif %}
+        {{- '<SPECIAL_11>Assistant\n' + msg.content.strip() }}
+        {%- if message.tool_calls -%}
+            {%- if msg.content.strip() != '' -%}
+                {{- '\n\n' -}}
+            {%- endif -%}
+            {{- '<TOOLCALL>[' -}}
+            {%- for call in message.tool_calls -%}
+                {%- set fn = call.function if call.function is defined else call -%}
+                {{- '{"name": "' + fn.name + '", "arguments": ' -}}
+                {%- if fn.arguments is string -%}
+                    {{- fn.arguments -}}
+                {%- else -%}
+                    {{- fn.arguments | tojson -}}
+                {%- endif -%}
+                {{- '}' + (', ' if not loop.last else '') -}}
+            {%- endfor -%}
+            {{- ']</TOOLCALL>' -}}
+        {%- endif -%}
+        {{- '\n<SPECIAL_12>\n' -}}
+    {%- endif %}
+{%- endfor -%}
+{%- if add_generation_prompt %}
+    {{- '<SPECIAL_11>Assistant\n' }}
+    {%- if ns.enable_thinking is defined and ns.enable_thinking is false %}
+        {{- '<think></think>' }}
+    {%- else %}
+        {{- '<think>\n' }}
+    {%- endif %}
+{%- endif %}

config.json ADDED Viewed

	@@ -0,0 +1,357 @@

+{
+    "architectures": [
+        "NemotronH_Nano_VL_V2"
+    ],
+    "auto_map": {
+        "AutoConfig": "configuration.NemotronH_Nano_VL_V2_Config",
+        "AutoModel": "modeling.NemotronH_Nano_VL_V2",
+        "AutoModelForCausalLM": "modeling.NemotronH_Nano_VL_V2"
+    },
+    "downsample_ratio": 0.5,
+    "eos_token_id": 12,
+    "force_image_size": 512,
+    "image_tag_type": "internvl",
+    "img_context_token": "<image>",
+    "img_context_token_id": 131072,
+    "img_end_token": "</img>",
+    "img_start_token": "<img>",
+    "llm_config": {
+        "architectures": [
+            "NemotronHForCausalLM"
+        ],
+        "attention_bias": false,
+        "attention_dropout": 0.0,
+        "attention_head_dim": 128,
+        "auto_map": {
+            "AutoConfig": "nvidia/NVIDIA-Nemotron-Nano-12B-v2-Base--configuration_nemotron_h.NemotronHConfig",
+            "AutoModelForCausalLM": "nvidia/NVIDIA-Nemotron-Nano-12B-v2-Base--modeling_nemotron_h.NemotronHForCausalLM"
+        },
+        "chunk_size": 128,
+        "conv_kernel": 4,
+        "eos_token_id": 12,
+        "expand": 2,
+        "head_dim": 128,
+        "hidden_dropout": 0.0,
+        "hidden_size": 5120,
+        "hybrid_override_pattern": "M-M-M-M*-M-M-M-M*-M-M-M-M*-M-M-M-M*-M-M-M-M*-M-M-M-M*-M-M-M-M-",
+        "initializer_range": 0.02,
+        "intermediate_size": 20480,
+        "layer_norm_epsilon": 1e-05,
+        "mamba_head_dim": 80,
+        "mamba_hidden_act": "silu",
+        "mamba_num_heads": 128,
+        "mamba_proj_bias": false,
+        "max_position_embeddings": 131072,
+        "mlp_bias": false,
+        "mlp_hidden_act": "relu2",
+        "model_type": "nemotron_h",
+        "n_groups": 8,
+        "num_attention_heads": 40,
+        "num_hidden_layers": 62,
+        "num_key_value_heads": 8,
+        "num_logits_to_keep": 1,
+        "rescale_prenorm_residual": true,
+        "residual_in_fp32": false,
+        "rms_norm_eps": 1e-05,
+        "sliding_window": null,
+        "ssm_state_size": 128,
+        "time_step_floor": 0.0001,
+        "time_step_limit": [
+            0.0,
+            Infinity
+        ],
+        "time_step_max": 0.1,
+        "time_step_min": 0.001,
+        "time_step_rank": 256,
+        "torch_dtype": "bfloat16",
+        "use_bias": false,
+        "use_cache": true,
+        "use_conv_bias": true,
+        "use_mamba_kernels": true,
+        "vocab_size": 132096
+    },
+    "max_sequence_length": 131072,
+    "model_type": "NemotronH_Nano_VL_V2",
+    "norm_mean": [
+        0.48145466,
+        0.4578275,
+        0.40821073
+    ],
+    "norm_std": [
+        0.26862954,
+        0.26130258,
+        0.27577711
+    ],
+    "patch_size": 16,
+    "projector_hidden_size": 20480,
+    "ps_version": "v2",
+    "template": "n5h_5p5_nanov2",
+    "torch_dtype": "bfloat16",
+    "transformers_version": "4.53.3",
+    "use_thumbnail": true,
+    "video_context_token": "<video>",
+    "video_context_token_id": 131081,
+    "video_pruning_rate": 0.7,
+    "vision_config": {
+        "adaptor_configs": {},
+        "adaptor_names": null,
+        "architectures": [
+            "RADIOModel"
+        ],
+        "args": {
+            "aa": null,
+            "amp": true,
+            "amp_dtype": "bfloat16",
+            "amp_impl": "native",
+            "aug_repeats": 0,
+            "aug_splits": 0,
+            "bn_eps": null,
+            "bn_momentum": null,
+            "cache_dir": null,
+            "channels_last": false,
+            "checkpoint_hist": 10,
+            "chk_keep_forever": 100,
+            "class_map": "",
+            "clip_grad": null,
+            "clip_mode": "norm",
+            "cls_token_per_teacher": true,
+            "coco_annotations_file": "/datasets/coco2017-adlsa/annotations/captions_val2017.json",
+            "coco_image_dir": "/datasets/coco2017-adlsa/val2017",
+            "color_jitter": 0.4,
+            "cooldown_epochs": 0,
+            "cpe_max_size": 2048,
+            "crd_loss": false,
+            "crd_loss_weight": 0.8,
+            "crop_pct": null,
+            "cutmix": 0.0,
+            "cutmix_minmax": null,
+            "dataset_download": false,
+            "debug_full_knn": false,
+            "decay_epochs": 90,
+            "decay_milestones": [
+                90,
+                180,
+                270
+            ],
+            "decay_rate": 0.1,
+            "depchain": true,
+            "dist_bn": "reduce",
+            "dist_norm_weight": 0.0,
+            "distributed": true,
+            "drop": 0.0,
+            "drop_block": null,
+            "drop_connect": null,
+            "drop_path": null,
+            "dtype": "bfloat16",
+            "epoch_repeats": 0.0,
+            "eval": false,
+            "eval_metric": "knn_top1",
+            "eval_teacher": false,
+            "eval_teacher_only": false,
+            "eval_throughput": false,
+            "fast_norm": false,
+            "fd_loss_fn": "MSE",
+            "feature_normalization": "SHIP_NORM",
+            "feature_summarizer": "cls_token",
+            "feature_upscale_factor": null,
+            "force_new_wandb_id": false,
+            "force_spectral_reparam": true,
+            "freeze_bn": false,
+            "fsdp": false,
+            "fuser": "",
+            "gp": null,
+            "grad_accum_steps": 1,
+            "grad_checkpointing": false,
+            "head_init_bias": null,
+            "head_init_scale": null,
+            "head_warmup": 5,
+            "head_weight_decay": 0.001,
+            "hflip": 0.5,
+            "img_size": null,
+            "in_chans": null,
+            "initial_checkpoint": null,
+            "input_size": null,
+            "interpolation": "",
+            "layer_decay": null,
+            "local_rank": 0,
+            "log_interval": 50,
+            "log_mlflow": false,
+            "log_wandb": true,
+            "loss_auto_balance": false,
+            "lr_base": 0.1,
+            "lr_base_scale": "",
+            "lr_base_size": 256,
+            "lr_cycle_decay": 0.5,
+            "lr_cycle_limit": 1,
+            "lr_cycle_mul": 1.0,
+            "lr_k_decay": 1.0,
+            "lr_noise": null,
+            "lr_noise_pct": 0.67,
+            "lr_noise_std": 1.0,
+            "mean": null,
+            "mesa": false,
+            "min_lr": 0,
+            "mixup": 0.0,
+            "mixup_mode": "batch",
+            "mixup_off_epoch": 0,
+            "mixup_prob": 1.0,
+            "mixup_switch_prob": 0.5,
+            "mlp_hidden_size": 1520,
+            "mlp_num_inner": 3,
+            "mlp_version": "v2",
+            "model": "vit_huge_patch16_224",
+            "model_kwargs": {},
+            "model_norm": false,
+            "momentum": 0.9,
+            "no_aug": false,
+            "no_ddp_bb": true,
+            "no_prefetcher": false,
+            "no_resume_opt": false,
+            "num_classes": null,
+            "opt_betas": null,
+            "opt_eps": null,
+            "patience_epochs": 10,
+            "pin_mem": false,
+            "prefetcher": true,
+            "pretrained": false,
+            "rank": 0,
+            "ratio": [
+                0.75,
+                1.3333333333333333
+            ],
+            "recount": 1,
+            "recovery_interval": 0,
+            "register_multiple": 16,
+            "remode": "pixel",
+            "reprob": 0.0,
+            "reset_loss_state": false,
+            "resplit": false,
+            "save_images": false,
+            "scale": [
+                0.5,
+                1.0
+            ],
+            "sched": "cosine",
+            "seed": 42,
+            "smoothing": 0.1,
+            "spectral_heads": false,
+            "spectral_reparam": false,
+            "split_bn": false,
+            "start_epoch": null,
+            "std": null,
+            "stream_teachers": true,
+            "sync_bn": false,
+            "synchronize_step": false,
+            "teachers": [
+                {
+                    "fd_normalize": false,
+                    "feature_distillation": true,
+                    "input_size": 378,
+                    "model": "ViT-H-14-378-quickgelu",
+                    "name": "clip",
+                    "pretrained": "dfn5b",
+                    "type": "open_clip",
+                    "use_summary": true
+                },
+                {
+                    "fd_normalize": false,
+                    "feature_distillation": true,
+                    "input_size": 378,
+                    "model": "ViT-SO400M-14-SigLIP-384",
+                    "name": "siglip",
+                    "pretrained": "webli",
+                    "type": "open_clip",
+                    "use_summary": true
+                },
+                {
+                    "fd_normalize": false,
+                    "feature_distillation": true,
+                    "input_size": 378,
+                    "model": "dinov2_vitg14_reg",
+                    "name": "dino_v2",
+                    "type": "dino_v2",
+                    "use_summary": true
+                },
+                {
+                    "fd_normalize": false,
+                    "feature_distillation": true,
+                    "input_size": 1024,
+                    "model": "vit-h",
+                    "name": "sam",
+                    "type": "sam",
+                    "use_summary": false
+                }
+            ],
+            "torchcompile": null,
+            "torchscript": false,
+            "train_interpolation": "random",
+            "train_split": "train",
+            "tta": 0,
+            "use_coco": false,
+            "use_multi_epochs_loader": false,
+            "val_ema_only": false,
+            "val_split": "val",
+            "vflip": 0.0,
+            "vitdet_version": 1,
+            "wandb_entity": "",
+            "wandb_job_type": "",
+            "wandb_name": "",
+            "wandb_project": "",
+            "warmup_lr": 1e-05,
+            "warmup_prefix": false,
+            "worker_seeding": "all",
+            "workers": 8,
+            "world_size": 256
+        },
+        "auto_map": {
+            "AutoConfig": "nvidia/C-RADIOv2-H--hf_model.RADIOConfig",
+            "AutoModel": "nvidia/C-RADIOv2-H--hf_model.RADIOModel"
+        },
+        "feature_normalizer_config": null,
+        "inter_feature_normalizer_config": null,
+        "max_resolution": 2048,
+        "model_type": "",
+        "patch_size": 16,
+        "preferred_resolution": [
+            768,
+            768
+        ],
+        "torch_dtype": "bfloat16",
+        "use_flash_attn": false,
+        "version": "radio_v2.5-h",
+        "vitdet_window_size": null
+    },
+    "vit_hidden_size": 1280,
+    "quantization_config": {
+        "config_groups": {
+            "group_0": {
+                "input_activations": {
+                    "dynamic": false,
+                    "num_bits": 8,
+                    "type": "float"
+                },
+                "weights": {
+                    "dynamic": false,
+                    "num_bits": 8,
+                    "type": "float"
+                },
+                "targets": [
+                    "Linear"
+                ]
+            }
+        },
+        "ignore": [
+            "model.layers.language_model.lm_head",
+            "model.layers.mlp1*",
+	        "model.layers.*.conv1d*",
+            "model.layers.vision_model*",
+            "lm_head"
+        ],
+        "quant_algo": "FP8",
+        "producer": {
+            "name": "modelopt",
+            "version": "0.37.0.dev5+g76fb12d47.d20250905"
+        },
+        "quant_method": "modelopt"
+    }
+}

configuration.py ADDED Viewed

	@@ -0,0 +1,57 @@

+# --------------------------------------------------------
+# Adapted from https://huggingface.co/OpenGVLab/InternVL2-Llama3-76B under MIT License
+#     LICENSE is in incl_licenses directory.
+# --------------------------------------------------------
+from transformers.configuration_utils import PretrainedConfig
+from transformers.utils import logging
+from .configuration_nemotron_h import NemotronHConfig
+from .configuration_radio import RADIOConfig
+logger = logging.get_logger(__name__)
+class NemotronH_Nano_VL_V2_Config(PretrainedConfig):
+    model_type = 'NemotronH_Nano_VL_V2'
+    is_composition = True
+    def __init__(
+        self,
+        vision_config=None,
+        llm_config=None,
+        force_image_size=None,
+        downsample_ratio=0.5,
+        template=None,
+        ps_version='v1',
+        image_tag_type="internvl",
+        projector_hidden_size=4096,
+        vit_hidden_size=1280,
+        attn_implementation="flash_attention_2",
+        video_pruning_rate: float = 0.0,
+        **kwargs
+    ):
+        super().__init__(**kwargs)
+        if vision_config is not None:
+            self.vision_config = RADIOConfig(**vision_config)
+        else:
+            self.vision_config = RADIOConfig()
+        # Handle both cases: when loading from JSON (llm_config is dict) and when called internally by transformers (llm_config is None)
+        if llm_config is not None:
+            self.llm_config = NemotronHConfig(**llm_config)
+        else:
+            self.llm_config = NemotronHConfig()
+        # Assign configuration values
+        self.force_image_size = force_image_size
+        self.downsample_ratio = downsample_ratio
+        self.template = template  # TODO move out of here and into the tokenizer
+        self.ps_version = ps_version  # Pixel shuffle version
+        self.image_tag_type = image_tag_type # TODO: into the tokenizer too?
+        self.projector_hidden_size = projector_hidden_size
+        self.vit_hidden_size = vit_hidden_size
+        self.video_pruning_rate = video_pruning_rate
+        self._attn_implementation = attn_implementation
+        self.vision_config.use_flash_attn = self._attn_implementation is not None and "flash_attention" in self._attn_implementation
+        self.llm_config._attn_implementation = self._attn_implementation

configuration_nemotron_h.py ADDED Viewed

	@@ -0,0 +1,245 @@

+# coding=utf-8
+# Copyright 2024 AI21 Labs Ltd. and the HuggingFace Inc. team. All rights reserved.
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""NemotronH model configuration"""
+import re
+from transformers.configuration_utils import PretrainedConfig
+from transformers.utils import logging
+logger = logging.get_logger(__name__)
+class NemotronHConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`NemotronHModel`]. It is used to instantiate a
+    NemotronH model according to the specified arguments, defining the model architecture. Instantiating a configuration
+    with the defaults will yield a similar configuration to that of the NemotronH-v0.1 model.
+    [todo](todo)
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        vocab_size (`int`, *optional*, defaults to 131072):
+            Vocabulary size of the NemotronH model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`NemotronHModel`]
+        tie_word_embeddings (`bool`, *optional*, defaults to `False`):
+            Whether the model's input and output word embeddings should be tied. Note that this is only relevant if the
+            model has a output word embedding layer.
+        hidden_size (`int`, *optional*, defaults to 4096):
+            Dimension of the hidden representations.
+        intermediate_size (`int`, *optional*, defaults to 21504):
+            Dimension of the MLP representations.
+        num_hidden_layers (`int`, *optional*, defaults to 52):
+            Number of hidden layers in the Transformer encoder.
+        hybrid_override_pattern (`str`, *optional*, defaults to `"M-M-M-M*-M-M-M-M-M*-M-M-M-M-M*-M-M-M-M-M*-M-M-M-M-M-"`):
+            The pattern of the hybrid model. The pattern is a string of characters where each character represents M: Mamba2, *: Attention, -: MLP
+        num_attention_heads (`int`, *optional*, defaults to 32):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        attention_head_dim (`int`, *optional*, defaults to 128):
+            Dimension of each attention head.
+        num_key_value_heads (`int`, *optional*, defaults to 8):
+            This is the number of key_value heads that should be used to implement Grouped Query Attention. If
+            `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
+            `num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used.
+        mlp_hidden_act (`str`, *optional*, defaults to "relu2"):
+            The non-linear activation function in the MLP layers.
+        attention_bias (`bool`, *optional*, defaults to `False`):
+            Whether to use bias in attention layers.
+        mlp_bias (`bool`, *optional*, defaults to `False`):
+            Whether to use bias in MLP layers.
+        use_bias (`bool`, *optional*, defaults to `False`):
+            Whether to use bias in the model.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_epsilon (`float`, *optional*, defaults to 1e-5):
+            The epsilon used by the layer normalization layers.
+        residual_in_fp32 (`bool`, *optional*, defaults to `False`):
+            Whether or not residuals should be in `float32`. If set to `False` residuals will keep the same `dtype` as the rest of the model.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions (not used by all models). Only
+            relevant if `config.is_decoder=True`.
+        num_logits_to_keep (`int` or `None`, *optional*, defaults to 1):
+            Number of prompt logits to calculate during generation. If `None`, all logits will be calculated. If an
+            integer value, only last `num_logits_to_keep` logits will be calculated.
+        pad_token_id (`int`, *optional*, defaults to 0):
+            The id of the padding token.
+        bos_token_id (`int`, *optional*, defaults to 1):
+            The id of the "beginning-of-sequence" token.
+        eos_token_id (`int`, *optional*, defaults to 2):
+            The id of the "end-of-sequence" token.
+        sliding_window (`int`, *optional*, defaults to None):
+            Sliding window attention window size.
+        max_position_embeddings (`int`, *optional*, defaults to 4096):
+            The maximum sequence length that this model might ever be used with.
+        attention_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for the attention probabilities.
+        hidden_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for the hidden states.
+        use_mamba_kernels (`bool`, *optional*, defaults to `True`):
+            Flag indicating whether or not to use the fast mamba kernels. These are available only if `mamba-ssm` and
+            `causal-conv1d` are installed, and the mamba modules are running on a CUDA device.
+        ssm_state_size (`int`, *optional*, defaults to 128):
+            The dimension of the mamba state space latents.
+        mamba_num_heads (`int`, *optional*, defaults to 128):
+            Number of heads in Mamba layers.
+        mamba_n_groups (`int`, *optional*, defaults to 8):
+            Number of groups in Mamba layers.
+        mamba_head_dim (`int`, *optional*, defaults to 64):
+            Dimension of each Mamba head.
+        mamba_d_conv (`int`, *optional*, defaults to 4):
+            The size of the mamba convolution kernel.
+        mamba_expand (`int`, *optional*, defaults to 2):
+            Expanding factor used to determine the mamba intermediate size.
+        mamba_hidden_act (`str`, *optional*, defaults to "silu"):
+            The non-linear activation function in the Mamba layers.
+        mamba_dt_min (`float`, *optional*, defaults to 0.001):
+            Minimum value for the time step in Mamba.
+        mamba_dt_max (`float`, *optional*, defaults to 0.1):
+            Maximum value for the time step in Mamba.
+        mamba_dt_limit (`tuple`, *optional*, defaults to (0.0, float("inf"))):
+            Limits for the time step in Mamba.
+        mamba_dt_init_floor (`float`, *optional*, defaults to 1e-4):
+            Floor value for time step initialization in Mamba.
+        mamba_conv_bias (`bool`, *optional*, defaults to `True`):
+            Whether to use bias in the convolution layer of the mamba mixer block.
+        mamba_proj_bias (`bool`, *optional*, defaults to `False`):
+            Whether to use bias in the input and output projections of the mamba mixer block.
+        mamba_chunk_size (`int`, *optional*, defaults to 256):
+            Size of chunks for Mamba processing.
+        rescale_prenorm_residual (`bool`, *optional*, defaults to `True`):
+            Whether to rescale the pre-normalization residual connections.
+    """
+    model_type = "nemotron_h"
+    keys_to_ignore_at_inference = ["past_key_values"]
+    def __init__(
+        self,
+        vocab_size=131072,
+        tie_word_embeddings=False,
+        hidden_size=4096,
+        intermediate_size=21504,
+        num_hidden_layers=52,
+        hybrid_override_pattern="M-M-M-M*-M-M-M-M-M*-M-M-M-M-M*-M-M-M-M-M*-M-M-M-M-M-",
+        num_attention_heads=32,
+        #attention_head_dim=128,
+        head_dim=128,
+        num_key_value_heads=8,  # nemo: num_query_groups
+        mlp_hidden_act="relu2",
+        attention_bias=False,
+        mlp_bias=False,
+        use_bias=False,
+        initializer_range=0.02, # nemo: init_method_std
+        layer_norm_epsilon=1e-5, # nemo: layernorm_epsilon
+        residual_in_fp32=False,  #  Megatron Core default value
+        use_cache=True,
+        num_logits_to_keep=1,
+        pad_token_id=0,
+        bos_token_id=1,
+        eos_token_id=2,
+        sliding_window=None,
+        max_position_embeddings=4096,
+        attention_dropout=0.0,
+        hidden_dropout=0.0, # * ADDED
+        use_mamba_kernels=True,
+        ssm_state_size=128, # mamba_state_size
+        mamba_num_heads=128,
+        mamba_n_groups=8,  # nemo: mamba_ssm_ngroups = num_heads
+        mamba_head_dim=64,
+        mamba_d_conv=4,
+        mamba_expand=2,
+        mamba_hidden_act="silu",
+        mamba_dt_min=0.001,
+        mamba_dt_max=0.1,
+        mamba_dt_limit=(0.0, float("inf")),
+        mamba_dt_init_floor=1e-4,
+        mamba_conv_bias=True,
+        mamba_proj_bias=False,
+        mamba_chunk_size=256,
+        rescale_prenorm_residual=True,
+        **kwargs,
+    ):
+        self.vocab_size = vocab_size
+        self.tie_word_embeddings = tie_word_embeddings
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.num_hidden_layers = num_hidden_layers
+        self.hybrid_override_pattern = hybrid_override_pattern
+        self.num_attention_heads = num_attention_heads
+        #self.attention_head_dim = attention_head_dim
+        self.head_dim = head_dim
+        self.sliding_window = sliding_window
+        self.max_position_embeddings = max_position_embeddings
+        self.attention_dropout = attention_dropout
+        self.hidden_dropout = hidden_dropout
+        # Validate hybrid_override_pattern
+        # M: Mamba2, *: Attention, -: MLP
+        assert len(self.hybrid_override_pattern) == self.num_hidden_layers, "hybrid_override_pattern must have the same length as num_hidden_layers"
+        assert re.match(r"^[*-M]+$", self.hybrid_override_pattern), "hybrid_override_pattern must only contain characters 'M', '*', or '-'"
+        # for backward compatibility
+        if num_key_value_heads is None:
+            num_key_value_heads = num_attention_heads
+        self.num_key_value_heads = num_key_value_heads
+        self.mlp_hidden_act = mlp_hidden_act
+        self.attention_bias = attention_bias
+        self.mlp_bias = mlp_bias
+        self.use_bias = use_bias
+        self.initializer_range = initializer_range
+        self.layer_norm_epsilon = layer_norm_epsilon
+        self.residual_in_fp32 = residual_in_fp32
+        self.use_cache = use_cache
+        self.num_logits_to_keep = num_logits_to_keep
+        self.use_mamba_kernels = use_mamba_kernels
+        self.n_groups = mamba_n_groups
+        self.mamba_head_dim = mamba_head_dim
+        self.ssm_state_size = ssm_state_size
+        self.mamba_num_heads = mamba_num_heads
+        self.conv_kernel = mamba_d_conv
+        self.expand = mamba_expand
+        self.mamba_hidden_act = mamba_hidden_act
+        self.time_step_min = mamba_dt_min
+        self.time_step_max = mamba_dt_max
+        self.time_step_limit = mamba_dt_limit
+        self.time_step_floor = mamba_dt_init_floor
+        self.use_conv_bias = mamba_conv_bias
+        self.mamba_proj_bias = mamba_proj_bias
+        self.chunk_size = mamba_chunk_size
+        self.rescale_prenorm_residual = rescale_prenorm_residual
+        super().__init__(
+            pad_token_id=pad_token_id,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            tie_word_embeddings=tie_word_embeddings,
+            **kwargs,
+        )
+    @property
+    def layers_block_type(self):
+        return [
+            "mamba" if self.hybrid_override_pattern[i] == "M" else
+            "attention" if self.hybrid_override_pattern[i] == "*" else "mlp"
+            for i in range(self.num_hidden_layers)]

configuration_radio.py ADDED Viewed

	@@ -0,0 +1,152 @@

+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+#
+# NVIDIA CORPORATION and its licensors retain all intellectual property
+# and proprietary rights in and to this software, related documentation
+# and any modifications thereto.  Any use, reproduction, disclosure or
+# distribution of this software and related documentation without an express
+# license agreement from NVIDIA CORPORATION is strictly prohibited.
+from dataclasses import dataclass
+from typing import Optional, NamedTuple, Union, List, Dict
+from transformers import PretrainedConfig
+class Resolution(NamedTuple):
+    height: int
+    width: int
+@dataclass
+class RadioResource:
+    url: str
+    patch_size: int
+    max_resolution: int
+    preferred_resolution: Resolution
+    vitdet_num_windowed: Optional[int] = None
+    vitdet_num_global: Optional[int] = None
+RESOURCE_MAP = {
+    # RADIOv2.5
+    "radio_v2.5-b": RadioResource(
+        "https://huggingface.co/nvidia/RADIO/resolve/main/radio-v2.5-b_half.pth.tar?download=true",
+        patch_size=16,
+        max_resolution=2048,
+        preferred_resolution=(768, 768),
+        vitdet_num_global=4,
+    ),
+    "radio_v2.5-l": RadioResource(
+        "https://huggingface.co/nvidia/RADIO/resolve/main/radio-v2.5-l_half.pth.tar?download=true",
+        patch_size=16,
+        max_resolution=2048,
+        preferred_resolution=(768, 768),
+        vitdet_num_global=4,
+    ),
+    "radio_v2.5-h": RadioResource(
+        "https://huggingface.co/nvidia/RADIO/resolve/main/radio_v2.5-h.pth.tar?download=true",
+        patch_size=16,
+        max_resolution=2048,
+        preferred_resolution=(768, 768),
+        vitdet_num_global=4,
+    ),
+    "radio_v2.5-h-norm": RadioResource(
+        "https://huggingface.co/nvidia/RADIO/resolve/main/radio_v2.5-h-norm.pth.tar?download=true",
+        patch_size=16,
+        max_resolution=2048,
+        preferred_resolution=(768, 768),
+        vitdet_num_global=4,
+    ),
+    "radio_v2.5-g": RadioResource(
+        "https://huggingface.co/nvidia/RADIO/resolve/main/radio_v2.5-g.pth.tar?download=true",
+        patch_size=14,
+        max_resolution=1792,
+        preferred_resolution=(896, 896),
+        vitdet_num_global=8,
+    ),
+    # RADIO
+    "radio_v2.1": RadioResource(
+        "https://huggingface.co/nvidia/RADIO/resolve/main/radio_v2.1_bf16.pth.tar?download=true",
+        patch_size=16,
+        max_resolution=2048,
+        preferred_resolution=Resolution(432, 432),
+        vitdet_num_windowed=5,
+    ),
+    "radio_v2": RadioResource(
+        "https://huggingface.co/nvidia/RADIO/resolve/main/radio_v2.pth.tar?download=true",
+        patch_size=16,
+        max_resolution=2048,
+        preferred_resolution=Resolution(432, 432),
+        vitdet_num_windowed=5,
+    ),
+    "radio_v1": RadioResource(
+        "https://huggingface.co/nvidia/RADIO/resolve/main/radio_v1.pth.tar?download=true",
+        patch_size=14,
+        max_resolution=1050,
+        preferred_resolution=Resolution(378, 378),
+    ),
+    # E-RADIO
+    "e-radio_v2": RadioResource(
+        "https://huggingface.co/nvidia/RADIO/resolve/main/eradio_v2.pth.tar?download=true",
+        patch_size=16,
+        max_resolution=2048,
+        preferred_resolution=Resolution(512, 512),
+    ),
+    # C-RADIO
+    "c-radio_v2.5-g": RadioResource(
+        "https://huggingface.co/nvidia/C-RADIOv2-g/resolve/main/c-radio_v2-g_half.pth.tar",
+        patch_size=16,
+        max_resolution=2048,
+        preferred_resolution=(768, 768),
+        vitdet_num_global=8,
+    ),
+    "c-radio_v3-l": RadioResource(
+        # NOTE: Currently, this model cannot be loaded via TorchHub. Instead, use the transformers API at https://huggingface.co/nvidia/C-RADIOv3-L
+        # and accept the license terms.
+        "https://huggingface.co/nvidia/C-RADIOv3-L/resolve/main/c-radio-v3_l_half.pth.tar?download=true",
+        patch_size=16,
+        max_resolution=2048,
+        preferred_resolution=Resolution(512, 512),
+    ),
+}
+DEFAULT_VERSION = "radio_v2.5-h"
+class RADIOConfig(PretrainedConfig):
+    """Pretrained Hugging Face configuration for RADIO models."""
+    def __init__(
+        self,
+        args: Optional[dict] = None,
+        version: Optional[str] = DEFAULT_VERSION,
+        patch_size: Optional[int] = None,
+        max_resolution: Optional[int] = None,
+        preferred_resolution: Optional[Resolution] = None,
+        adaptor_names: Union[str, List[str]] = None,
+        adaptor_configs: Dict[str, Dict[str, int]] = None,
+        vitdet_window_size: Optional[int] = None,
+        feature_normalizer_config: Optional[dict] = None,
+        inter_feature_normalizer_config: Optional[dict] = None,
+        **kwargs,
+    ):
+        self.args = args
+        for field in ["dtype", "amp_dtype"]:
+            if self.args is not None and field in self.args:
+                # Convert to a string in order to make it serializable.
+                # For example for torch.float32 we will store "float32",
+                # for "bfloat16" we will store "bfloat16".
+                self.args[field] = str(args[field]).split(".")[-1]
+        self.version = version
+        resource = RESOURCE_MAP[version]
+        self.patch_size = patch_size or resource.patch_size
+        self.max_resolution = max_resolution or resource.max_resolution
+        self.preferred_resolution = (
+            preferred_resolution or resource.preferred_resolution
+        )
+        self.adaptor_names = adaptor_names
+        self.adaptor_configs = adaptor_configs
+        self.vitdet_window_size = vitdet_window_size
+        self.feature_normalizer_config = feature_normalizer_config
+        self.inter_feature_normalizer_config = inter_feature_normalizer_config
+        super().__init__(**kwargs)

evs.py ADDED Viewed

	@@ -0,0 +1,73 @@

+import torch
+from typing import Tuple
+class EfficientVideoSampling:
+    @staticmethod
+    def compute_retention_mask(
+        *,
+        video_embeds: torch.FloatTensor,
+        thw: torch.LongTensor,
+        spatial_merge_size: int,
+        q: float,
+    ):
+        """
+        Computes the retention mask for video embeddings based on the grid dimensions.
+        Args:
+            video_embeds (`torch.FloatTensor` of shape `(T * H * W, hidden_size)`):
+                The video embeddings to compute the retention mask for.
+            thw (`torch.LongTensor` of shape `(3)`):
+                The temporal, height and width of feature shape of each video in LLM.
+            spatial_merge_size (`int`): The spatial merge size of the video embeddings.
+                If embeddings will be downsampled *later*, this should be the downsampling factor.
+            q: (`float`): Pruning rate factor, indicating number of tokens to prune (remove)
+        Returns:
+            `torch.Tensor`: The retention mask for the video embeddings (T * H * W).
+                1 for tokens to keep, 0 for tokens to prune.
+        """
+        T, H, W = thw
+        # video_embeds = einops.rearrange(
+        #     video_embeds,
+        #     "(T H W) C -> T H W C",
+        #     T=T,
+        #     H=H // spatial_merge_size,
+        #     W=W // spatial_merge_size,
+        # )
+        # Use reshape instead of einops to avoid graph breaks
+        video_embeds = video_embeds.reshape(
+            T, H // spatial_merge_size, W // spatial_merge_size, video_embeds.size(-1)
+        )
+        # Core EVS
+        similarity = torch.nn.functional.cosine_similarity(
+            video_embeds[1:, ...], video_embeds[:-1, ...], dim=-1
+        )
+        dissimilarity = 1 - similarity
+        # Always ensure we include all tokens from the first frame
+        dissimilarity = torch.cat(
+            [255 * torch.ones_like(video_embeds[:1, :, :, 0]), dissimilarity], dim=0
+        )
+        dissimilarity_flat = dissimilarity.view(-1)
+        min_num_tokens = (H // spatial_merge_size) * (W // spatial_merge_size)  # a single frame
+        evs_num_tokens = int(T * min_num_tokens * (1 - q))
+        num_tokens_to_keep = max(min_num_tokens, evs_num_tokens)
+        order = torch.argsort(dissimilarity_flat,
+                              dim=-1,
+                              descending=True,
+                              stable=True)
+        topk_indices = order[:num_tokens_to_keep]
+        retention_mask = torch.zeros_like(dissimilarity_flat, dtype=torch.bool)
+        retention_mask[topk_indices] = True
+        retention_mask = retention_mask.reshape(dissimilarity.size())
+        # print(
+        #     f"Computed retention mask of shape {retention_mask.shape=} with sparsity {retention_mask.float().mean().item():.4f} for {q=}",
+        # )
+        mask = retention_mask.view(-1)  # "T H W -> (T H W)"
+        return mask

explainability.md ADDED Viewed

	@@ -0,0 +1,15 @@

+Field                                                                                                  |  Response
+:------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------
+Intended Task/Domain:                                                                   |  Visual Question Answering
+Model Type:                                                                                            |  Transformer
+Intended Users:                                                                                        | Individuals and businesses that need to process documents such as invoices, receipts, and manuals. Also, users who are building multi-modal agents and RAG systems.
+Output:                                                                                                |  Text
+ Tools used to evaluate datasets to identify synthetic data and ensure data authenticity. | We used a Gemma-3 4B-based filtering model fine-tuned on [Nemotron Content Safety Dataset v2](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-2.0) to ensure the quality of synthetic data.
+Describe how the model works:                                                                          | Vision Encoder and a Nemotron 5.5H -12B Language Encoder. It processes multiple input modalities, including text, multiple images, and video. It fuses these inputs and uses its large language model backbone with a 128K context length to perform visual Q&A, summarization, and data extraction.
+Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of:  |  Not Applicable
+Technical Limitations & Mitigation:                                                                    |  The model has a limited maximum resolution determined by a 12-tile layout constraint, where each tile is 512x512 pixels. It also supports a limited number of input images (up to 4) and has a maximum context length of 128K tokens for combined input and output.
+Verified to have met prescribed NVIDIA quality standards:  |  Yes
+Performance Metrics:                                                                                   |  Accuracy (Visual Question Answering), Latency, Throughput
+Potential Known Risks:                                                                                 | The Model may produce output that is biased, toxic, or incorrect responses. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The Model may also generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text, producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.While we have taken safety and security into account and are continuously improving, outputs may still contain political content, misleading information, or unwanted bias beyond our control.
+Licensing:                                                                                             |  Governing Terms: Use of this model is governed by the [ NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/)

generation_config.json ADDED Viewed

	@@ -0,0 +1,11 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 1,
+  "eos_token_id": [
+    2,
+    11,
+    12
+  ],
+  "pad_token_id": 0,
+  "transformers_version": "4.51.3"
+}

hf_quant_config.json ADDED Viewed

	@@ -0,0 +1,17 @@

+{
+    "producer": {
+        "name": "modelopt",
+        "version": "0.37.0.dev5+g76fb12d47.d20250905"
+    },
+    "quantization": {
+        "quant_algo": "FP8",
+        "kv_cache_quant_algo": null,
+        "exclude_modules": [
+            "model.layers.language_model.lm_head",
+            "model.layers.mlp1*",
+	        "model.layers.*.conv1d*",
+            "model.layers.vision_model*",
+            "lm_head"
+        ]
+    }
+}

image_processing.py ADDED Viewed

	@@ -0,0 +1,148 @@

+from typing import List, Optional, Union, Any, Dict
+from PIL import Image
+import torch
+from transformers.image_processing_base import BatchFeature
+from transformers.image_processing_utils_fast import BaseImageProcessorFast, divide_to_patches
+from transformers.image_utils import (make_list_of_images, get_image_size,
+                                      get_image_type, ImageInput, ImageType, ChannelDimension)
+from transformers.utils import TensorType
+import torchvision.transforms as T
+class NemotronNanoVLV2ImageProcessor(BaseImageProcessorFast):
+    model_input_names = ["pixel_values"]
+    def __init__(self, image_size=512, max_num_tiles=12, use_thumbnail=True, norm_mean=None, norm_std=None, do_rescale=True, patch_size=16, downsample_ratio=0.5, **kwargs):
+        super().__init__(**kwargs)
+        self.image_size = image_size
+        self.max_num_tiles = max_num_tiles
+        self.use_thumbnail = use_thumbnail
+        self.norm_mean = norm_mean
+        self.norm_std = norm_std
+        self.do_rescale = do_rescale
+        self.num_image_token = int((image_size // patch_size) ** 2 * (downsample_ratio ** 2))
+    def _process_image(
+        self,
+        image: ImageInput,
+        **kwargs,
+    ) -> torch.Tensor:
+        image_type = get_image_type(image)
+        if image_type == ImageType.PIL:
+            if image.mode != 'RGB':
+                image = image.convert('RGB')
+            image = T.ToTensor()(image)
+        return image
+    def _preprocess(
+        self,
+        images: List[torch.Tensor],
+        image_size: int = None,
+        max_num_tiles: int = None,
+        use_thumbnail: bool = None,
+        do_rescale: bool = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        **kwargs,
+    ) -> List[torch.Tensor]:
+        image_size = image_size if image_size is not None else self.image_size
+        max_num_tiles = max_num_tiles if max_num_tiles is not None else self.max_num_tiles
+        use_thumbnail = use_thumbnail if use_thumbnail is not None else self.use_thumbnail
+        do_rescale = do_rescale if do_rescale is not None else self.do_rescale
+        images = make_list_of_images(images)
+        all_patches = []
+        num_patches = []
+        for image in images:
+            patches = dynamic_preprocess(image, image_size, max_num_tiles, use_thumbnail)
+            all_patches.extend(patches)
+            num_patches.append(len(patches))
+        pixel_values = torch.stack(all_patches, dim=0)
+        norm_mean = torch.Tensor(self.norm_mean).view(1, 3, 1, 1)
+        norm_std = torch.Tensor(self.norm_std).view(1, 3, 1, 1)
+        pixel_values = (pixel_values - norm_mean) / norm_std
+        return BatchFeature(data={"pixel_values": pixel_values, "num_patches": num_patches}, tensor_type=return_tensors)
+def get_internvl_target_ratios(
+    min_num: int,
+    max_num: int,
+) -> list[tuple[int, int]]:
+    target_ratios = {(i, j)
+                     for n in range(min_num, max_num + 1)
+                     for i in range(1, n + 1)
+                     for j in range(1, n + 1) if min_num <= i * j <= max_num}
+    return sorted(target_ratios, key=lambda x: x[0] * x[1])
+# From https://github.com/OpenGVLab/InternVL/blob/c62fa4f7c850165d7386bdc48ac6bc5a6fab0864/internvl_chat/internvl/train/dataset.py#L685
+# Copyright (c) 2023 OpenGVLab.
+def find_closest_aspect_ratio(
+    aspect_ratio: float,
+    target_ratios: list[tuple[int, int]],
+    width: int,
+    height: int,
+    image_size: int,
+) -> tuple[int, int]:
+    best_ratio_diff = float("inf")
+    best_ratio = (1, 1)
+    area = width * height
+    for ratio in target_ratios:
+        target_aspect_ratio = ratio[0] / ratio[1]
+        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
+        if ratio_diff < best_ratio_diff:
+            best_ratio_diff = ratio_diff
+            best_ratio = ratio
+        elif ratio_diff == best_ratio_diff:
+            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
+                best_ratio = ratio
+    return best_ratio
+def calculate_targets(
+    orig_width: int,
+    orig_height: int,
+    target_ratios: list[tuple[int, int]],
+    image_size: int,
+) -> tuple[int, int, int]:
+    aspect_ratio = orig_width / orig_height
+    # find the closest aspect ratio to the target
+    target_aspect_ratio = find_closest_aspect_ratio(
+        aspect_ratio,
+        target_ratios,
+        width=orig_width,
+        height=orig_height,
+        image_size=image_size,
+    )
+    # calculate the target width and height
+    target_width = image_size * target_aspect_ratio[0]
+    target_height = image_size * target_aspect_ratio[1]
+    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
+    return blocks, target_width, target_height
+def dynamic_preprocess(image, image_size=512, max_num_tiles=12, use_thumbnail=True):
+    orig_height, orig_width = get_image_size(image, channel_dim=ChannelDimension.FIRST)
+    target_ratios = get_internvl_target_ratios(1, max_num_tiles)
+    blocks, target_width, target_height = calculate_targets(
+        orig_width,
+        orig_height,
+        target_ratios,
+        image_size
+    )
+    # resize the image
+    resized_img = T.Resize((target_height, target_width), interpolation=T.InterpolationMode.BICUBIC)(image)
+    patches = divide_to_patches(resized_img, image_size)
+    assert len(patches) == blocks
+    if use_thumbnail and len(patches) != 1:
+        thumbnail_img = T.Resize((image_size, image_size), interpolation=T.InterpolationMode.BICUBIC)(image)
+        patches.append(thumbnail_img)
+    return patches

llama_nemotron_toolcall_parser_no_streaming.py ADDED Viewed

	@@ -0,0 +1,470 @@

+# SPDX-License-Identifier: Apache-2.0
+import ast
+import json
+import re
+from collections.abc import Sequence
+from typing import Union
+import partial_json_parser
+from partial_json_parser.core.options import Allow
+from vllm.entrypoints.openai.protocol import (
+    ChatCompletionRequest,
+    DeltaFunctionCall, DeltaMessage,
+    DeltaToolCall,
+    ExtractedToolCallInformation,
+    FunctionCall,
+    ToolCall,
+)
+from vllm.entrypoints.openai.tool_parsers.abstract_tool_parser import (
+    ToolParser,
+    ToolParserManager,
+)
+from vllm.logger import init_logger
+from vllm.transformers_utils.tokenizer import AnyTokenizer
+from vllm.utils import random_uuid
+logger = init_logger(__name__)
+@ToolParserManager.register_module("llama_nemotron_xml")
+class LlamaNemotronXMLToolParser(ToolParser):
+    def __init__(self, tokenizer: AnyTokenizer):
+        super().__init__(tokenizer)
+        self.current_tool_name_sent: bool = False
+        self.prev_tool_call_arr: list[dict] = []
+        self.current_tool_id: int = -1  # Potentially for streaming
+        self.streamed_args_for_tool: list[str] = [] # Potentially for streaming
+        self.tool_call_start_token: str = "<tool_call>"
+        self.tool_call_end_token: str = "</tool_call>"
+        # Regex to find full <tool_call>...</tool_call> blocks and capture their content
+        self.tool_call_block_regex = re.compile(r"<tool_call>(.*?)</tool_call>", re.DOTALL)
+        # Regex to find <tool>...</tool> within a tool_call block content
+        self.name_regex = re.compile(r"<tool>(.*?)</tool>", re.DOTALL)
+        # Regex to find <key>value</key> pairs within the tool_call block content (excluding <tool> tags)
+        self.param_regex = re.compile(r"<([^/>\s]+)>(.*?)</\1>", re.DOTALL)
+    def extract_tool_calls(
+        self,
+        model_output: str,
+        request: ChatCompletionRequest,
+    ) -> ExtractedToolCallInformation:
+        tool_call_start_index = model_output.find(self.tool_call_start_token)
+        if tool_call_start_index == -1:
+            return ExtractedToolCallInformation(
+                tools_called=False,
+                tool_calls=[],
+                content=model_output,
+            )
+        content = model_output[:tool_call_start_index].strip()
+        tool_calls_str_content = model_output[tool_call_start_index:]
+        parsed_tool_calls = []
+        try:
+            # Find all occurrences of <tool_call>...</tool_call>
+            xml_tool_call_contents = self.tool_call_block_regex.findall(tool_calls_str_content)
+            for tool_content_str in xml_tool_call_contents:
+                name_match = self.name_regex.search(tool_content_str)
+                if not name_match:
+                    logger.warning(f"Could not find tool name in XML block: {tool_content_str}")
+                    continue
+                tool_name = name_match.group(1).strip()
+                parsed_arguments = {}
+                # Find all parameter tags in the tool_call content, excluding the <tool> tag
+                param_matches = self.param_regex.finditer(tool_content_str)
+                for match in param_matches:
+                    param_name = match.group(1).strip()
+                    param_value_str = match.group(2).strip()
+                    # Skip the <tool> tag since it's not a parameter
+                    if param_name == "tool":
+                        continue
+                    target_type = None
+                    # Try to get type from request.tools schema
+                    if request.tools:
+                        for tool_def in request.tools:
+                            if tool_def.function.name == tool_name:
+                                if tool_def.function.parameters and \
+                                   isinstance(tool_def.function.parameters, dict) and \
+                                   "properties" in tool_def.function.parameters and \
+                                   isinstance(tool_def.function.parameters["properties"], dict) and \
+                                   param_name in tool_def.function.parameters["properties"] and \
+                                   isinstance(tool_def.function.parameters["properties"][param_name], dict):
+                                    target_type = tool_def.function.parameters["properties"][param_name].get("type")
+                                break
+                    typed_param_value = param_value_str # Default to string
+                    if target_type:
+                        try:
+                            if target_type == "string":
+                                typed_param_value = param_value_str
+                            elif target_type == "integer":
+                                typed_param_value = int(param_value_str)
+                            elif target_type == "number":
+                                typed_param_value = float(param_value_str)
+                            elif target_type == "boolean":
+                                typed_param_value = param_value_str.lower() == 'true'
+                            elif target_type in ["object", "array"]:
+                                try:
+                                    typed_param_value = json.loads(param_value_str)
+                                except json.JSONDecodeError:
+                                    # Fallback for non-strict JSON like Python dict/list string
+                                    typed_param_value = ast.literal_eval(param_value_str)
+                            else: # Unknown type, keep as string
+                                typed_param_value = param_value_str
+                        except (ValueError, SyntaxError, json.JSONDecodeError) as e:
+                            logger.warning(
+                                f"Could not convert param '{param_name}' with value '{param_value_str}' "
+                                f"to type '{target_type}'. Error: {e}. Using string value."
+                            )
+                            typed_param_value = param_value_str
+                    else: # No schema type, try ast.literal_eval
+                        try:
+                            # For values like "true", "123", "['a', 'b']"
+                            # ast.literal_eval('some_string_without_quotes') will raise SyntaxError
+                            if (param_value_str.startswith("'") and param_value_str.endswith("'")) or \
+                               (param_value_str.startswith('"') and param_value_str.endswith('"')) or \
+                               (param_value_str.startswith('[') and param_value_str.endswith(']')) or \
+                               (param_value_str.startswith('{') and param_value_str.endswith('}')) or \
+                               param_value_str.lower() in ['true', 'false', 'none'] or \
+                               param_value_str.replace('.', '', 1).isdigit() or \
+                               (param_value_str.startswith('-') and param_value_str[1:].replace('.', '', 1).isdigit()):
+                                typed_param_value = ast.literal_eval(param_value_str)
+                            else: # It's likely a plain string not meant for ast.literal_eval
+                                typed_param_value = param_value_str
+                        except (ValueError, SyntaxError):
+                            typed_param_value = param_value_str # Keep as string if ast.literal_eval fails
+                    parsed_arguments[param_name] = typed_param_value
+                parsed_tool_calls.append(ToolCall(
+                    id=f"call_{random_uuid()}",
+                    type="function",
+                    function=FunctionCall(
+                        name=tool_name,
+                        arguments=json.dumps(parsed_arguments, ensure_ascii=False),
+                    ),
+                ))
+            return ExtractedToolCallInformation(
+                tools_called=len(parsed_tool_calls) > 0,
+                tool_calls=parsed_tool_calls,
+                content=content if content else None,
+            )
+        except Exception:
+            logger.exception(f"Error in extracting XML tool call from response. Response: {model_output}")
+            # Fallback to original model output if parsing fails catastrophically
+            return ExtractedToolCallInformation(
+                tools_called=False,
+                tool_calls=[],
+                content=model_output,
+            )
+    def extract_tool_calls_streaming(
+        self,
+        previous_text: str,
+        current_text: str,
+        delta_text: str,
+        previous_token_ids: Sequence[int],
+        current_token_ids: Sequence[int],
+        delta_token_ids: Sequence[int],
+        request: ChatCompletionRequest,
+    ) -> Union[DeltaMessage, None]:
+        raise NotImplementedError("Tool calling is not supported in streaming mode!")
+@ToolParserManager.register_module("llama_nemotron_json")
+class LlamaNemotronJSONToolParser(ToolParser):
+    def __init__(self, tokenizer: AnyTokenizer):
+        super().__init__(tokenizer)
+        self.current_tool_name_sent: bool = False
+        self.prev_tool_call_arr: list[dict] = []
+        self.current_tool_id: int = -1
+        self.streamed_args_for_tool: list[str] = []
+        self.tool_call_start_token: str = "<TOOLCALL>"
+        self.tool_call_end_token: str = "</TOOLCALL>"
+        self.tool_call_regex = re.compile(r"<TOOLCALL>(.*?)</TOOLCALL>", re.DOTALL)
+    def extract_tool_calls(
+        self,
+        model_output: str,
+        request: ChatCompletionRequest,
+    ) -> ExtractedToolCallInformation:
+        if self.tool_call_start_token not in model_output:
+            return ExtractedToolCallInformation(
+                tools_called=False,
+                tool_calls=[],
+                content=model_output,
+            )
+        else:
+            try:
+                str_tool_calls = self.tool_call_regex.findall(model_output)[0].strip()
+                if not str_tool_calls.startswith("["):
+                    str_tool_calls = "[" + str_tool_calls
+                if not str_tool_calls.endswith("]"):
+                    str_tool_calls = str_tool_calls + "]"
+                json_tool_calls = json.loads(str_tool_calls)
+                tool_calls = []
+                for tool_call in json_tool_calls:
+                    try:
+                        tool_calls.append(ToolCall(
+                            type="function",
+                            function=FunctionCall(
+                                name=tool_call["name"],
+                                arguments=json.dumps(tool_call["arguments"], ensure_ascii=False) \
+                                    if isinstance(tool_call["arguments"], dict) else tool_call["arguments"],
+                            ),
+                        ))
+                    except:
+                        continue
+                content = model_output[:model_output.rfind(self.tool_call_start_token)]
+                return ExtractedToolCallInformation(
+                    tools_called=True,
+                    tool_calls=tool_calls,
+                    content=content if content else None,
+                )
+            except Exception:
+                logger.exception(f"Error in extracting tool call from response. Response: {model_output}")
+                return ExtractedToolCallInformation(
+                    tools_called=False,
+                    tool_calls=[],
+                    content=model_output,
+                )
+    def extract_tool_calls_streaming(
+        self,
+        previous_text: str,
+        current_text: str,
+        delta_text: str,
+        previous_token_ids: Sequence[int],
+        current_token_ids: Sequence[int],
+        delta_token_ids: Sequence[int],
+        request: ChatCompletionRequest,
+    ) -> Union[DeltaMessage, None]:
+        raise NotImplementedError("Tool calling is not supported in streaming mode!")
+@ToolParserManager.register_module("llama_nemotron_pythonic")
+class LlamaNemotronPythonicToolParser(ToolParser):
+    def __init__(self, tokenizer: AnyTokenizer):
+        super().__init__(tokenizer)
+        self.current_tool_name_sent: bool = False
+        self.prev_tool_call_arr: list[dict] = []
+        self.current_tool_id: int = -1
+        self.streamed_args_for_tool: list[str] = []
+        self.tool_call_start_token: str = "<TOOLCALL>"
+        self.tool_call_end_token: str = "</TOOLCALL>"
+        self.tool_call_regex = re.compile(r"<TOOLCALL>(.*?)</TOOLCALL>", re.DOTALL)
+        # Regex to parse pythonic function calls: function_name(arg1="value1", arg2=123, arg3=True)
+        self.function_call_regex = re.compile(r"(\w+)\((.*?)\)$", re.DOTALL)
+    def parse_function_arguments(self, args_str: str) -> dict:
+        """Parse pythonic function arguments string into a dictionary"""
+        if not args_str.strip():
+            return {}
+        # Use ast.parse to safely parse the function call arguments
+        # We'll construct a temporary function call and parse it
+        try:
+            # Create a dummy function call to parse arguments
+            dummy_code = f"dummy_func({args_str})"
+            parsed = ast.parse(dummy_code, mode='eval')
+            # Extract arguments from the AST
+            call_node = parsed.body
+            if not isinstance(call_node, ast.Call):
+                return {}
+            arguments = {}
+            # Handle keyword arguments
+            for keyword in call_node.keywords:
+                if keyword.arg is None:  # **kwargs
+                    continue
+                # Convert AST value to Python value
+                try:
+                    value = ast.literal_eval(keyword.value)
+                    arguments[keyword.arg] = value
+                except (ValueError, TypeError):
+                    # If literal_eval fails, try to get the raw value
+                    if isinstance(keyword.value, ast.Name):
+                        arguments[keyword.arg] = keyword.value.id
+                    elif isinstance(keyword.value, ast.Constant):
+                        arguments[keyword.arg] = keyword.value.value
+                    else:
+                        # Fallback: convert to string
+                        arguments[keyword.arg] = ast.unparse(keyword.value)
+            # Handle positional arguments (less common in tool calls but supported)
+            for i, arg in enumerate(call_node.args):
+                try:
+                    value = ast.literal_eval(arg)
+                    arguments[f"arg_{i}"] = value
+                except (ValueError, TypeError):
+                    if isinstance(arg, ast.Name):
+                        arguments[f"arg_{i}"] = arg.id
+                    elif isinstance(arg, ast.Constant):
+                        arguments[f"arg_{i}"] = arg.value
+                    else:
+                        arguments[f"arg_{i}"] = ast.unparse(arg)
+            return arguments
+        except (SyntaxError, ValueError) as e:
+            logger.warning(f"Failed to parse function arguments '{args_str}': {e}")
+            return {}
+    def extract_tool_calls(
+        self,
+        model_output: str,
+        request: ChatCompletionRequest,
+    ) -> ExtractedToolCallInformation:
+        if self.tool_call_start_token not in model_output:
+            return ExtractedToolCallInformation(
+                tools_called=False,
+                tool_calls=[],
+                content=model_output,
+            )
+        tool_call_start_index = model_output.find(self.tool_call_start_token)
+        content = model_output[:tool_call_start_index].strip()
+        try:
+            # Extract content between <TOOLCALL> tags
+            tool_call_matches = self.tool_call_regex.findall(model_output)
+            if not tool_call_matches:
+                return ExtractedToolCallInformation(
+                    tools_called=False,
+                    tool_calls=[],
+                    content=model_output,
+                )
+            tool_calls_content = tool_call_matches[0].strip()
+            # Split by lines to get individual function calls
+            function_lines = [line.strip() for line in tool_calls_content.split('\n') if line.strip()]
+            parsed_tool_calls = []
+            for func_line in function_lines:
+                # Parse each function call
+                match = self.function_call_regex.match(func_line)
+                if not match:
+                    logger.warning(f"Could not parse function call: {func_line}")
+                    continue
+                function_name = match.group(1)
+                args_str = match.group(2)
+                # Parse arguments
+                parsed_arguments = self.parse_function_arguments(args_str)
+                # Apply type conversion based on schema if available
+                if request.tools:
+                    for tool_def in request.tools:
+                        if tool_def.function.name == function_name:
+                            schema_properties = {}
+                            if (tool_def.function.parameters and
+                                isinstance(tool_def.function.parameters, dict) and
+                                "properties" in tool_def.function.parameters and
+                                isinstance(tool_def.function.parameters["properties"], dict)):
+                                schema_properties = tool_def.function.parameters["properties"]
+                            # Convert arguments based on schema types
+                            for arg_name, arg_value in parsed_arguments.items():
+                                if arg_name in schema_properties:
+                                    param_info = schema_properties[arg_name]
+                                    target_type = param_info.get("type")
+                                    try:
+                                        if target_type == "string" and not isinstance(arg_value, str):
+                                            parsed_arguments[arg_name] = str(arg_value)
+                                        elif target_type == "integer" and not isinstance(arg_value, int):
+                                            parsed_arguments[arg_name] = int(arg_value)
+                                        elif target_type == "number" and not isinstance(arg_value, (int, float)):
+                                            parsed_arguments[arg_name] = float(arg_value)
+                                        elif target_type == "boolean" and not isinstance(arg_value, bool):
+                                            if isinstance(arg_value, str):
+                                                parsed_arguments[arg_name] = arg_value.lower() in ['true', '1', 'yes']
+                                            else:
+                                                parsed_arguments[arg_name] = bool(arg_value)
+                                        elif target_type in ["object", "array"]:
+                                            if isinstance(arg_value, str):
+                                                try:
+                                                    parsed_arguments[arg_name] = json.loads(arg_value)
+                                                except json.JSONDecodeError:
+                                                    # Keep as string if JSON parsing fails
+                                                    pass
+                                    except (ValueError, TypeError) as e:
+                                        logger.warning(f"Type conversion failed for {arg_name}: {e}")
+                                        # Keep original value if conversion fails
+                            break
+                parsed_tool_calls.append(ToolCall(
+                    id=f"call_{random_uuid()}",
+                    type="function",
+                    function=FunctionCall(
+                        name=function_name,
+                        arguments=json.dumps(parsed_arguments, ensure_ascii=False),
+                    ),
+                ))
+            return ExtractedToolCallInformation(
+                tools_called=len(parsed_tool_calls) > 0,
+                tool_calls=parsed_tool_calls,
+                content=content if content else None,
+            )
+        except Exception:
+            logger.exception(f"Error in extracting pythonic tool call from response. Response: {model_output}")
+            return ExtractedToolCallInformation(
+                tools_called=False,
+                tool_calls=[],
+                content=model_output,
+            )
+    def extract_tool_calls_streaming(
+        self,
+        previous_text: str,
+        current_text: str,
+        delta_text: str,
+        previous_token_ids: Sequence[int],
+        current_token_ids: Sequence[int],
+        delta_token_ids: Sequence[int],
+        request: ChatCompletionRequest,
+    ) -> Union[DeltaMessage, None]:
+        raise NotImplementedError("Tool calling is not supported in streaming mode!")

model-00001-of-00004.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ae4461eba110cf3251a45c60a3a71e27c587f7df8d2bfccecd8c4aeb63298914
+size 4999458416

model-00002-of-00004.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:20552cb6e7fb2d94e7976f2b60aaba540695498cd3582aba31a82569ba3a8b0f
+size 4990786200

model-00003-of-00004.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bc5884919e5d9af689f2908cd97963f8f8ddbbbfbe05ec0796fcd344210d3bdd
+size 4988690304

model-00004-of-00004.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6e2c02d5ea330834358c3031121cad2d3a9b5071fb589d8b85b88e06cd9658a1
+size 419430616

model.safetensors.index.json ADDED Viewed

The diff for this file is too large to render. See raw diff

modeling.py ADDED Viewed

	@@ -0,0 +1,287 @@

+import os
+import warnings
+from typing import List, Optional, Tuple, Union
+import torch
+import transformers
+from torch import nn
+from torch.nn import CrossEntropyLoss
+from transformers import AutoModel, AutoModelForCausalLM, GenerationConfig
+from transformers.modeling_outputs import CausalLMOutputWithPast
+from transformers.modeling_utils import PreTrainedModel
+from transformers.utils import logging
+from .configuration import NemotronH_Nano_VL_V2_Config
+from .modeling_nemotron_h import NemotronHForCausalLM
+from .evs import EfficientVideoSampling
+logger = logging.get_logger(__name__)
+"""
+The following code is adapted from the
+https://huggingface.co/OpenGVLab/InternVL2-Llama3-76B/blob/main/modeling_internvl_chat.py repository
+The chat function is adapted to handle NVLM 1-D tile-tagging design for dynamic high-resolution images.
+"""
+class SquaredReLU(nn.Module):
+    def forward(self, x):
+        return torch.pow(torch.nn.functional.relu(x), 2)
+class RMSNorm(nn.Module):
+    def __init__(self, hidden_size, eps=1e-5):
+        super().__init__()
+        self.weight = nn.Parameter(torch.ones(hidden_size))
+        self.eps = eps
+    def forward(self, hidden_states):
+        input_dtype = hidden_states.dtype
+        hidden_states = hidden_states.to(torch.float32)
+        variance = hidden_states.pow(2).mean(-1, keepdim=True)
+        hidden_states = hidden_states * torch.rsqrt(variance + self.eps)
+        return (self.weight.to(torch.float32) * hidden_states).to(input_dtype)
+def version_cmp(v1, v2, op='eq'):
+    import operator
+    from packaging import version
+    op_func = getattr(operator, op)
+    return op_func(version.parse(v1), version.parse(v2))
+class NemotronH_Nano_VL_V2(PreTrainedModel):
+    config_class = NemotronH_Nano_VL_V2_Config
+    main_input_name = 'pixel_values'
+    _supports_flash_attn_2 = True
+    _no_split_modules = ['NemotronHBlock']
+    def __init__(self, config: NemotronH_Nano_VL_V2_Config):
+        super().__init__(config)
+        assert version_cmp(transformers.__version__, '4.36.2', 'ge')
+        image_size = config.force_image_size
+        patch_size = config.patch_size
+        self.patch_size = patch_size
+        self.template = config.template
+        self.num_image_token = int((image_size // patch_size) ** 2 * (config.downsample_ratio ** 2))
+        self.downsample_ratio = config.downsample_ratio
+        self.ps_version = config.ps_version
+        self.image_tag_type = config.image_tag_type
+        self.img_context_token_id = config.img_context_token_id
+        self.video_context_token_id = config.video_context_token_id
+        logger.info(f'num_image_token: {self.num_image_token}')
+        logger.info(f'ps_version: {self.ps_version}')
+        self.language_model = AutoModelForCausalLM.from_config(config.llm_config, trust_remote_code=True)
+        self.vision_model = AutoModel.from_config(config.vision_config, trust_remote_code=True)
+        self.vision_model.model._initialize_weights = self.vision_model.model._init_weights  # WAR for transformers issue 38358
+        self.vision_model.radio_model.make_preprocessor_external()
+        self.vision_model = self.vision_model.to(self.language_model.config.torch_dtype)
+        self.drop_vision_class_token = True
+        # Construct the vision projection.
+        # Default
+        vit_hidden_size = config.vit_hidden_size
+        vision_projection_hidden_size = config.projector_hidden_size
+        llm_hidden_size = config.llm_config.hidden_size
+        self.video_pruning_rate = config.video_pruning_rate
+        self.mlp1 = nn.Sequential(
+            RMSNorm(vit_hidden_size * int(1 / self.downsample_ratio) ** 2, eps=1e-5),
+            nn.Linear(vit_hidden_size * int(1 / self.downsample_ratio) ** 2, vision_projection_hidden_size, bias=False),
+            SquaredReLU(),
+            nn.Linear(vision_projection_hidden_size, llm_hidden_size, bias=False)
+        )
+        self.mlp1 = self.mlp1.to(self.language_model.config.torch_dtype)
+    def forward(
+            self,
+            pixel_values: torch.FloatTensor,
+            input_ids: torch.LongTensor = None,
+            attention_mask: Optional[torch.Tensor] = None,
+            position_ids: Optional[torch.LongTensor] = None,
+            image_flags: Optional[torch.LongTensor] = None,
+            past_key_values: Optional[List[torch.FloatTensor]] = None,
+            labels: Optional[torch.LongTensor] = None,
+            inputs_embeds = None,
+            use_cache: Optional[bool] = None,
+            output_attentions: Optional[bool] = None,
+            output_hidden_states: Optional[bool] = None,
+            return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, CausalLMOutputWithPast]:
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        if inputs_embeds is None:
+            inputs_embeds = self.language_model.get_input_embeddings()(input_ids)
+        image_flags = image_flags.squeeze(-1)
+        B, N, C = inputs_embeds.shape
+        inputs_embeds = inputs_embeds.reshape(B * N, C)
+        input_ids = input_ids.reshape(B * N)
+        selected = (input_ids == self.img_context_token_id)
+        vit_batch_size = pixel_values.shape[0]
+        vit_embeds = self.extract_feature(pixel_values)
+        del pixel_values
+        if torch.distributed.get_rank() == 0:
+            print(f'dynamic ViT batch size: {vit_batch_size}, images per sample: {vit_batch_size / B}, dynamic token length: {N}')
+        vit_embeds = vit_embeds[image_flags == 1]
+        try:
+            inputs_embeds[selected] = inputs_embeds[selected] * 0.0 + vit_embeds.reshape(-1, C)
+        except Exception as e:
+            vit_embeds = vit_embeds.reshape(-1, C)
+            print(f'warning: {e}, inputs_embeds[selected].shape={inputs_embeds[selected].shape}, '
+                  f'vit_embeds.shape={vit_embeds.shape}')
+            n_token = selected.sum()
+            inputs_embeds[selected] = inputs_embeds[selected] * 0.0 + vit_embeds[:n_token]
+        del vit_embeds
+        inputs_embeds = inputs_embeds.reshape(B, N, C)
+        outputs = self.language_model(
+            inputs_embeds=inputs_embeds,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        logits = outputs.logits
+        loss = None
+        if labels is not None:
+            # Shift so that tokens < n predict n
+            shift_logits = logits[..., :-1, :].contiguous()
+            shift_labels = labels[..., 1:].contiguous()
+            # Flatten the tokens
+            loss_fct = CrossEntropyLoss()
+            shift_logits = shift_logits.view(-1, self.language_model.config.vocab_size)
+            shift_labels = shift_labels.view(-1)
+            # Enable model parallelism
+            shift_labels = shift_labels.to(shift_logits.device)
+            loss = loss_fct(shift_logits, shift_labels)
+        if not return_dict:
+            output = (logits,) + outputs[1:]
+            return (loss,) + output if loss is not None else output
+        return CausalLMOutputWithPast(
+            loss=loss,
+            logits=logits,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+    def pixel_shuffle(self, x, scale_factor=0.5):
+        n, w, h, c = x.size()
+        # N, W, H, C --> N, W, H * scale, C // scale
+        x = x.view(n, w, int(h * scale_factor), int(c / scale_factor))
+        # N, W, H * scale, C // scale --> N, H * scale, W, C // scale
+        x = x.permute(0, 2, 1, 3).contiguous()
+        # N, H * scale, W, C // scale --> N, H * scale, W * scale, C // (scale ** 2)
+        x = x.view(n, int(h * scale_factor), int(w * scale_factor),
+                   int(c / (scale_factor * scale_factor)))
+        if self.ps_version == 'v1':
+            warnings.warn("In ps_version 'v1', the height and width have not been swapped back, "
+                          'which results in a transposed image.')
+        else:
+            x = x.permute(0, 2, 1, 3).contiguous()
+        return x
+    def extract_feature(self, pixel_values):
+        vit_embeds = self.vision_model(pixel_values).features
+        vit_embeds = vit_embeds.to(dtype=torch.bfloat16)
+        h = w = int(vit_embeds.shape[1] ** 0.5)
+        vit_embeds = vit_embeds.reshape(vit_embeds.shape[0], h, w, -1)
+        vit_embeds = self.pixel_shuffle(vit_embeds, scale_factor=self.downsample_ratio)
+        vit_embeds = vit_embeds.reshape(vit_embeds.shape[0], -1, vit_embeds.shape[-1])
+        vit_embeds = self.mlp1(vit_embeds)
+        return vit_embeds
+    @torch.no_grad()
+    def generate(
+            self,
+            pixel_values: Optional[torch.FloatTensor] = None,
+            pixel_values_videos: Optional[torch.FloatTensor] = None,
+            input_ids: Optional[torch.FloatTensor] = None,
+            attention_mask: Optional[torch.LongTensor] = None,
+            generation_config: Optional[GenerationConfig] = None,
+            output_hidden_states: Optional[bool] = None,
+            return_dict: Optional[bool] = None,
+            **generate_kwargs,
+    ) -> torch.LongTensor:
+        assert self.img_context_token_id is not None
+        if pixel_values is not None or pixel_values_videos is not None:
+            image_vit_embeds, video_vit_embeds = None, None
+            if pixel_values is not None:
+                pixel_values = pixel_values.to(dtype=self.vision_model.config.torch_dtype)
+                image_vit_embeds = self.extract_feature(pixel_values)
+            if pixel_values_videos is not None:
+                pixel_values_videos = pixel_values_videos.to(dtype=self.vision_model.config.torch_dtype)
+                video_vit_embeds = self.extract_feature(pixel_values_videos)
+            inputs_embeds = self.language_model.get_input_embeddings()(input_ids)
+            B, N, C = inputs_embeds.shape
+            inputs_embeds = inputs_embeds.reshape(B * N, C)
+            input_ids_copy = input_ids.reshape(B * N)
+            if image_vit_embeds is not None:
+                image_mask = (input_ids_copy == self.img_context_token_id)
+                assert image_mask.sum() != 0
+                inputs_embeds[image_mask] = image_vit_embeds.reshape(-1, C).to(inputs_embeds.device, inputs_embeds.dtype)
+            if video_vit_embeds is not None:
+                if B > 1:
+                    raise NotImplementedError("Video is not supported for batch size > 1")
+                video_mask = (input_ids_copy == self.video_context_token_id)
+                assert video_mask.sum() != 0
+                inputs_embeds[video_mask] = video_vit_embeds.reshape(-1, C).to(inputs_embeds.device, inputs_embeds.dtype)
+            if video_vit_embeds is not None and self.video_pruning_rate > 0:  # EVS
+                h = w = int(video_vit_embeds.shape[1] ** 0.5)  # assumption here (and everywhere else) is that shape is square
+                evs_mask = EfficientVideoSampling.compute_retention_mask(
+                    video_embeds=video_vit_embeds,
+                    thw=(video_vit_embeds.shape[0], h, w),
+                    spatial_merge_size=1,  # we already work on vision embeddings, so no downsampling to follow
+                    q=self.video_pruning_rate,
+                )
+                print(f"pruning rate: {self.video_pruning_rate}, EVS mask: {evs_mask.sum().item()} tokens retained out of {evs_mask.numel()} total video tokens ({evs_mask.sum().item() / evs_mask.numel() * 100:.2f}%)")
+                retention_mask = torch.ones_like(input_ids_copy, dtype=torch.bool)
+                retention_mask[video_mask] = evs_mask.view(-1)
+                inputs_embeds = inputs_embeds[retention_mask].unsqueeze(0)  # adding batch=1
+                if attention_mask is not None:
+                    attention_mask = attention_mask[:, retention_mask].contiguous()
+                if input_ids is not None:
+                    input_ids = input_ids[:, retention_mask].contiguous()
+            else:
+                inputs_embeds = inputs_embeds.reshape(B, N, C)
+        else:
+            inputs_embeds = self.language_model.get_input_embeddings()(input_ids)
+        # print(f"DEBUG: input_ids shape: {input_ids.shape}")
+        # print(f"DEBUG: input text: {self._tokenizer.decode(input_ids[0])}")
+        outputs = self.language_model.generate(
+            input_ids=input_ids,
+            inputs_embeds=inputs_embeds,
+            attention_mask=attention_mask,
+            generation_config=generation_config,
+            output_hidden_states=output_hidden_states,
+            use_cache=True,
+            # return_dict_in_generate=True,
+            # output_scores=True,
+            **generate_kwargs,
+        )
+        return outputs

modeling_nemotron_h.py ADDED Viewed

	@@ -0,0 +1,1636 @@

+# coding=utf-8
+# Copyright 2024 HuggingFace Inc. team.
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""PyTorch NemotronH model."""
+import math
+from dataclasses import dataclass
+from typing import Any, Dict, Optional, Tuple, Union
+import torch
+import torch.utils.checkpoint
+from torch import nn
+from torch.nn import CrossEntropyLoss
+from transformers.activations import ACT2FN
+from transformers.cache_utils import DynamicCache  # we need __iter__ and __len__ of pkv
+from transformers.generation import GenerationMixin
+from transformers.modeling_attn_mask_utils import (
+    AttentionMaskConverter,
+)
+from transformers.modeling_utils import PreTrainedModel
+from transformers.utils import (
+    ModelOutput,
+    add_code_sample_docstrings,
+    add_start_docstrings,
+    add_start_docstrings_to_model_forward,
+    logging,
+)
+from transformers.utils.import_utils import (
+    is_causal_conv1d_available,
+    is_flash_attn_2_available,
+    is_flash_attn_greater_or_equal_2_10,
+    is_mamba_2_ssm_available,
+)
+from .configuration_nemotron_h import NemotronHConfig
+logger = logging.get_logger(__name__)
+# Copied from transformers.models.mamba.modeling_mamba2.modeling_mamba2.py with MAMBA2->NEMOTRONH,Mamba2->NemotronH
+# For Mamba2 components Mamba2->NemotronHMamba2
+if is_mamba_2_ssm_available():
+    from mamba_ssm.ops.triton.selective_state_update import selective_state_update
+    from mamba_ssm.ops.triton.ssd_combined import mamba_chunk_scan_combined, mamba_split_conv1d_scan_combined
+else:
+    mamba_chunk_scan_combined, mamba_split_conv1d_scan_combined, selective_state_update = None, None, None
+try:
+    #from mamba_ssm.ops.triton.layernorm_gated import RMSNorm as RMSNormGated
+    from mamba_ssm.ops.triton.layernorm_gated import rmsnorm_fn
+except ImportError:
+    raise ImportError("mamba-ssm is required by the Mamba model but cannot be imported")
+if is_causal_conv1d_available():
+    from causal_conv1d import causal_conv1d_fn, causal_conv1d_update
+else:
+    causal_conv1d_update, causal_conv1d_fn = None, None
+if is_flash_attn_2_available():
+    from transformers.modeling_flash_attention_utils import _flash_attention_forward
+is_fast_path_available = all(
+    (
+        selective_state_update,
+        mamba_chunk_scan_combined,
+        mamba_split_conv1d_scan_combined,
+        causal_conv1d_fn,
+        causal_conv1d_update,
+    )
+)
+_CHECKPOINT_FOR_DOC = "nvidia/Nemotron-H-56B-Base-8K"
+_CONFIG_FOR_DOC = "NemotronHConfig"
+# Helper methods for segment sum computation
+def pad_tensor_by_size(input_tensor: torch.Tensor, pad_size: int):
+    """
+    Padding x tensor with `pad_size` on the seq_len dim (dim=1)
+    Assumes that we only have tensors of either size 4 or 3
+    """
+    pad_shape = (0, 0, 0, 0, 0, pad_size, 0, 0) if len(input_tensor.shape) == 4 else (0, 0, 0, pad_size, 0, 0)
+    return torch.nn.functional.pad(input_tensor, pad_shape, mode="constant", value=0)
+def reshape_into_chunks(input_tensor, pad_size, chunk_size):
+    """
+    Padding input_tensor with `pad_size` on the seq_len dim (dim=1) and
+    simultaneously splitting it into chunk sequences.
+    Assumes that we only have tensors of either size 4 or 3
+    """
+    # [bsz, seq_len, ...] -> [bsz, seq_len multiple of chunk_size, ...]
+    input_tensor = pad_tensor_by_size(input_tensor, pad_size)
+    if len(input_tensor.shape) == 3:
+        # [bsz, seq_len multiple of chunk_size, num_heads] -> [bsz, -1, chunk_size, num_heads]
+        return input_tensor.reshape(input_tensor.shape[0], -1, chunk_size, input_tensor.shape[2])
+    else:
+        # [bsz, seq_len multiple of chunk_size, num_heads, head_dim or state_size] -> [bsz, -1, chunk_size, num_heads, head_dim or state_size]
+        return input_tensor.reshape(
+            input_tensor.shape[0], -1, chunk_size, input_tensor.shape[2], input_tensor.shape[3]
+        )
+def segment_sum(input_tensor):
+    """
+    More stable segment sum calculation. Uses cumulative sums and masking instead of direct subtractions.
+    """
+    chunk_size = input_tensor.size(-1)
+    # 1. expand input tensor to have an additional dimension and repeat along that dimension
+    # [..., chunk_size] -> [..., chunk_size, chunk_size]
+    input_tensor = input_tensor[..., None].expand(*input_tensor.size(), chunk_size)
+    # 2. create a lower triangular mask with the diagonal set to 0 to 0 out elements above diag
+    mask = torch.tril(torch.ones(chunk_size, chunk_size, device=input_tensor.device, dtype=torch.bool), diagonal=-1)
+    input_tensor = input_tensor.masked_fill(~mask, 0)
+    # 3. compute actual cumsum
+    tensor_segsum = torch.cumsum(input_tensor, dim=-2)
+    # 4. apply mask to keep only the lower triangular part of the cumulative sum result (incl diagonal this time)
+    mask = torch.tril(torch.ones(chunk_size, chunk_size, device=input_tensor.device, dtype=torch.bool), diagonal=0)
+    tensor_segsum = tensor_segsum.masked_fill(~mask, -torch.inf)
+    return tensor_segsum
+def apply_mask_to_padding_states(hidden_states, attention_mask):
+    """
+    Tunes out the hidden states for padding tokens, see https://github.com/state-spaces/mamba/issues/66
+    """
+    if attention_mask is not None and attention_mask.shape[1] > 1 and attention_mask.shape[0] > 1:
+        dtype = hidden_states.dtype
+        hidden_states = (hidden_states * attention_mask[:, :, None]).to(dtype)
+    return hidden_states
+# Copied from https://github.com/huggingface/transformers/blob/main/src/transformers/models/jamba/modeling_jamba.py
+class HybridMambaAttentionDynamicCache(DynamicCache):
+    """
+    A dynamic cache that can handle both the attention cache (which has a seq_len dimension) and the mamba cache
+    (which has a constant shape regardless of seq_len).
+    This cache has two sets of lists of tensors: `key_cache` and `value_cache` for attention cache and `conv_states`
+    and `ssm_states` for mamba cache. Each of these lists has `num_layers` tensors. The expected shape for each tensor
+    For attention layers, `key_cache` and `value_cache` have a shape of `(batch_size, num_heads, seq_len, head_dim)`,
+    while `conv_states` and `ssm_states` have a shape of `(batch_size, 0)` (empty tensors).
+    For mamba layers, `key_cache` and `value_cache` have a shape of `(batch_size, 0)` (empty tensors),
+    while `conv_states` represents the convolution state and has a shape of `(batch_size, d_inner, d_conv)`,
+    and `ssm_states` represents the ssm state and has a shape of `(batch_size, d_inner, d_state)`.
+    """
+    def __init__(self, config, batch_size, dtype=torch.float16, device=None):
+        super().__init__()
+        self.dtype = dtype
+        self.hybrid_override_pattern = config.hybrid_override_pattern
+        self.has_previous_state = False  # only used by mamba
+        #intermediate_size = config.expand * config.hidden_size
+        intermediate_size = config.mamba_num_heads * config.mamba_head_dim
+        ssm_state_size = config.ssm_state_size
+        conv_kernel_size = config.conv_kernel
+        self.conv_states = []
+        self.ssm_states = []
+        self.transformer_layers = []
+        for i in range(config.num_hidden_layers):
+            if self.hybrid_override_pattern[i] == "M":
+                # Mamba layer
+                self.conv_states += [
+                    torch.zeros(batch_size, intermediate_size, conv_kernel_size, device=device, dtype=dtype)
+                ]
+                self.ssm_states += [
+                    torch.zeros(batch_size, intermediate_size, ssm_state_size, device=device, dtype=torch.float32)
+                ]
+            else:
+                # Attention or MLP layer
+                self.conv_states += [torch.tensor([[]] * batch_size, device=device)]
+                self.ssm_states += [torch.tensor([[]] * batch_size, device=device)]
+                self.transformer_layers.append(i)
+        self.key_cache = [torch.tensor([[]] * batch_size, device=device) for _ in range(config.num_hidden_layers)]
+        self.value_cache = [torch.tensor([[]] * batch_size, device=device) for _ in range(config.num_hidden_layers)]
+    def update(
+        self,
+        key_states: torch.Tensor,
+        value_states: torch.Tensor,
+        layer_idx: int,
+        cache_kwargs: Optional[Dict[str, Any]] = None,
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        # Update the cache
+        if self.key_cache[layer_idx].shape[-1] == 0:
+            self.key_cache[layer_idx] = key_states
+            self.value_cache[layer_idx] = value_states
+        else:
+            self.key_cache[layer_idx] = torch.cat([self.key_cache[layer_idx], key_states], dim=2)
+            self.value_cache[layer_idx] = torch.cat([self.value_cache[layer_idx], value_states], dim=2)
+        return self.key_cache[layer_idx], self.value_cache[layer_idx]
+    def reorder_cache(self, beam_idx: torch.LongTensor):
+        """Reorders the cache for beam search, given the selected beam indices."""
+        for layer_idx in range(len(self.key_cache)):
+            device = self.key_cache[layer_idx].device
+            self.key_cache[layer_idx] = self.key_cache[layer_idx].index_select(0, beam_idx.to(device))
+            device = self.value_cache[layer_idx].device
+            self.value_cache[layer_idx] = self.value_cache[layer_idx].index_select(0, beam_idx.to(device))
+            device = self.conv_states[layer_idx].device
+            self.conv_states[layer_idx] = self.conv_states[layer_idx].index_select(0, beam_idx.to(device))
+            device = self.ssm_states[layer_idx].device
+            self.ssm_states[layer_idx] = self.ssm_states[layer_idx].index_select(0, beam_idx.to(device))
+    def get_seq_length(self, layer_idx: Optional[int] = 0) -> int:
+        """Returns the sequence length of the cached states. A layer index can be optionally passed."""
+        # take any layer that contains cache and not empty tensor
+        layer_idx = self.transformer_layers[0] if layer_idx not in self.transformer_layers else layer_idx
+        if len(self.key_cache) <= layer_idx:
+            return 0
+        return self.key_cache[layer_idx].shape[-2]
+    def to_legacy_cache(self) -> Tuple[Tuple[torch.Tensor], Tuple[torch.Tensor]]:
+        raise NotImplementedError("HybridMambaAttentionDynamicCache does not have a legacy cache equivalent.")
+    @classmethod
+    def from_legacy_cache(cls, past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None) -> "DynamicCache":
+        raise NotImplementedError("HybridMambaAttentionDynamicCache does not have a legacy cache equivalent.")
+    # Copied from modeling_mamba2.py
+    def update_conv_state(
+        self, layer_idx: int, new_conv_state: torch.Tensor, cache_init: bool = False
+    ) -> torch.Tensor:
+        if cache_init:
+            self.conv_states[layer_idx] = new_conv_state.to(self.conv_states.device)
+        else:
+            self.conv_states[layer_idx] = self.conv_states[layer_idx].roll(shifts=-1, dims=-1)
+            self.conv_states[layer_idx][:, :, -1] = new_conv_state[:, 0, :].to(self.conv_states.device)
+        return self.conv_states[layer_idx]
+    def update_ssm_state(self, layer_idx: int, new_ssm_state: torch.Tensor):
+        self.ssm_states[layer_idx] = new_ssm_state.to(self.ssm_states.device)
+        return self.ssm_states[layer_idx]
+    def reset(self):
+        self.conv_states.zero_()
+        self.ssm_states.zero_()
+class MambaRMSNormGated(torch.nn.Module):
+    def __init__(self, hidden_size, group_size, eps=1e-5):
+        super().__init__()
+        self.weight = nn.Parameter(torch.ones(hidden_size))
+        self.variance_epsilon = eps
+        self.group_size = group_size
+    # jan28b version
+    def forward(self, hidden_states, gate=None):
+        return rmsnorm_fn(x=hidden_states,
+                          weight=self.weight,
+                          bias=None, # No bias
+                          z=gate,
+                          eps=self.variance_epsilon,
+                          group_size=self.group_size,
+                          norm_before_gate=False
+        )
+class NemotronHMamba2Mixer(nn.Module):
+    """
+    Compute ∆, A, B, C, and D the state space parameters and compute the `contextualized_states`.
+    A, D are input independent (see Mamba paper [1] Section 3.5.2 "Interpretation of A" for why A isn't selective)
+    ∆, B, C are input-dependent (this is a key difference between Mamba and the linear time invariant S4,
+    and is why Mamba is called **selective** state spaces)
+    """
+    def __init__(self, config: NemotronHConfig, layer_idx: int):
+        super().__init__()
+        self.num_heads = config.mamba_num_heads
+        self.hidden_size = config.hidden_size
+        self.ssm_state_size = config.ssm_state_size
+        self.conv_kernel_size = config.conv_kernel
+        self.intermediate_size = config.mamba_num_heads * config.mamba_head_dim
+        self.layer_idx = layer_idx
+        self.use_conv_bias = config.use_conv_bias
+        self.activation = config.mamba_hidden_act
+        self.act = ACT2FN[config.mamba_hidden_act]
+        self.layer_norm_epsilon = config.layer_norm_epsilon
+        self.n_groups = config.n_groups
+        self.head_dim = config.mamba_head_dim
+        self.chunk_size = config.chunk_size
+        self.time_step_limit = config.time_step_limit
+        self.time_step_min = config.time_step_min
+        self.time_step_max = config.time_step_max
+        self.conv_dim = self.intermediate_size + 2 * self.n_groups * self.ssm_state_size
+        self.conv1d = nn.Conv1d(
+            in_channels=self.conv_dim,
+            out_channels=self.conv_dim,
+            bias=config.use_conv_bias,
+            kernel_size=config.conv_kernel,
+            groups=self.conv_dim,
+            padding=config.conv_kernel - 1,
+        )
+        # projection of the input hidden states
+        projection_size = self.intermediate_size + self.conv_dim + self.num_heads
+        self.in_proj = nn.Linear(
+            self.hidden_size,
+            projection_size,
+            bias=config.use_bias,
+        )
+        # selective projection used to make dt, B and C input dependant
+        # time step projection (discretization)
+        # instantiate once and copy inv_dt in init_weights of PretrainedModel
+        self.dt_bias = nn.Parameter(torch.ones(self.num_heads))
+        # S4D real initialization. These are not discretized!
+        # The core is to load them, compute the discrete states, then write the updated state. Keeps the memory bounded
+        A = torch.arange(1, self.num_heads + 1)
+        self.A_log = nn.Parameter(torch.log(A))
+        self.A_log._no_weight_decay = True
+        self.norm = MambaRMSNormGated(self.intermediate_size, eps=self.layer_norm_epsilon, group_size=self.intermediate_size // self.n_groups)
+        self.D = nn.Parameter(torch.ones(self.num_heads))
+        self.D._no_weight_decay = True
+        self.out_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=config.use_bias)
+        self.use_bias = config.use_bias
+        if not is_fast_path_available:
+            logger.warning_once(
+                "The fast path is not available because on of `(selective_state_update, causal_conv1d_fn, causal_conv1d_update)`"
+                " is None. Falling back to the naive implementation. To install follow https://github.com/state-spaces/mamba/#installation and"
+                " https://github.com/Dao-AILab/causal-conv1d"
+            )
+    def cuda_kernels_forward(
+        self,
+        hidden_states: torch.Tensor,
+        cache_params: Optional[HybridMambaAttentionDynamicCache] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+    ):
+        # 1. Gated MLP's linear projection
+        hidden_states = apply_mask_to_padding_states(hidden_states, attention_mask)
+        projected_states = self.in_proj(hidden_states)
+        # Set up dimensions for reshapes later
+        batch_size, seq_len, _ = hidden_states.shape
+        groups_time_state_size = self.n_groups * self.ssm_state_size
+        d_mlp = (
+            projected_states.shape[-1]
+            - 2 * self.intermediate_size
+            - 2 * self.n_groups * self.ssm_state_size
+            - self.num_heads
+        ) // 2
+        # Single step calculations via cache
+        if cache_params is not None and cache_position is not None and cache_position[0] > 0:
+            _, _, gate, hidden_states_B_C, dt = projected_states.squeeze(1).split(
+                [d_mlp, d_mlp, self.intermediate_size, self.conv_dim, self.num_heads], dim=-1
+            )
+            # 2. Convolution sequence transformation
+            hidden_states_B_C = causal_conv1d_update(
+                hidden_states_B_C,
+                cache_params.conv_states[self.layer_idx],
+                self.conv1d.weight.squeeze(1),
+                self.conv1d.bias,
+                self.activation,
+            )
+            hidden_states, B, C = torch.split(
+                hidden_states_B_C,
+                [self.intermediate_size, groups_time_state_size, groups_time_state_size],
+                dim=-1,
+            )
+            # 3. SSM transformation
+            A = -torch.exp(self.A_log.float())  # (nheads,)
+            A = A[:, None, ...][:, :, None].expand(-1, self.head_dim, self.ssm_state_size).to(dtype=torch.float32)
+            dt = dt[:, :, None].expand(-1, -1, self.head_dim)
+            dt_bias = self.dt_bias[:, None, ...].expand(-1, self.head_dim)
+            D = self.D[:, None, ...].expand(-1, self.head_dim)
+            B = B.view(batch_size, self.n_groups, B.shape[1] // self.n_groups)
+            C = C.view(batch_size, self.n_groups, C.shape[1] // self.n_groups)
+            hidden_states_reshaped = hidden_states.view(batch_size, self.num_heads, self.head_dim)
+            hidden_states = selective_state_update(
+                cache_params.ssm_states[self.layer_idx],
+                hidden_states_reshaped,
+                dt,
+                A,
+                B,
+                C,
+                D,
+                z=None,
+                dt_bias=dt_bias,
+                dt_softplus=True,
+            )
+            hidden_states = hidden_states.view(batch_size, self.num_heads * self.head_dim)
+            hidden_states = self.norm(hidden_states, gate)
+            # 4. Final linear projection
+            out = self.out_proj(hidden_states)[:, None, ...]
+        # Fused calculations or step by step if no initialized cache is found
+        else:
+            A = -torch.exp(self.A_log.float())  # (num_heads) or (intermediate_size, state_size)
+            dt_limit_kwargs = {} if self.time_step_limit == (0.0, float("inf")) else {"dt_limit": self.time_step_limit}
+            # 2-4. Fused kernel for conv1d, SSM, and the final projection
+            if self.training and cache_params is None:
+                out = mamba_split_conv1d_scan_combined(
+                    projected_states,
+                    self.conv1d.weight.squeeze(1),
+                    self.conv1d.bias,
+                    self.dt_bias,
+                    A,
+                    D=self.D,
+                    chunk_size=self.chunk_size,
+                    seq_idx=None,  # was seq_idx
+                    activation=self.activation,
+                    rmsnorm_weight=self.norm.weight,
+                    rmsnorm_eps=self.norm.variance_epsilon,
+                    outproj_weight=self.out_proj.weight,
+                    outproj_bias=self.out_proj.bias,
+                    headdim=self.head_dim,
+                    ngroups=self.n_groups,
+                    norm_before_gate=False,
+                    return_final_states=False,
+                    **dt_limit_kwargs,
+                )
+            else:
+                _, _, gate, hidden_states_B_C, dt = projected_states.split(
+                    [d_mlp, d_mlp, self.intermediate_size, self.conv_dim, self.num_heads], dim=-1
+                )
+                # 2. Convolution sequence transformation
+                # Init cache
+                if cache_params is not None:
+                    hidden_states_B_C_transposed = hidden_states_B_C.transpose(1, 2)
+                    conv_states = nn.functional.pad(
+                        hidden_states_B_C_transposed,
+                        (cache_params.conv_kernel_size - hidden_states_B_C_transposed.shape[-1], 0),
+                    )
+                    cache_params.update_conv_state(
+                        layer_idx=self.layer_idx, new_conv_state=conv_states, cache_init=True
+                    )
+                if self.activation not in ["silu", "swish"]:
+                    hidden_states_B_C = self.act(
+                        self.conv1d(hidden_states_B_C.transpose(1, 2))[..., :seq_len].transpose(1, 2)
+                    )
+                else:
+                    hidden_states_B_C = causal_conv1d_fn(
+                        x=hidden_states_B_C.transpose(1, 2),
+                        weight=self.conv1d.weight.squeeze(1),
+                        bias=self.conv1d.bias,
+                        activation=self.activation,
+                    ).transpose(1, 2)
+                hidden_states_B_C = apply_mask_to_padding_states(hidden_states_B_C, attention_mask)
+                hidden_states, B, C = torch.split(
+                    hidden_states_B_C,
+                    [self.intermediate_size, groups_time_state_size, groups_time_state_size],
+                    dim=-1,
+                )
+                # 3. SSM transformation
+                scan_output, ssm_state = mamba_chunk_scan_combined(
+                    hidden_states.view(batch_size, seq_len, -1, self.head_dim),
+                    dt,
+                    A,
+                    B.view(batch_size, seq_len, self.n_groups, -1),
+                    C.view(batch_size, seq_len, self.n_groups, -1),
+                    chunk_size=self.chunk_size,
+                    D=self.D,
+                    z=None,
+                    seq_idx=None,
+                    return_final_states=True,
+                    dt_bias=self.dt_bias,
+                    dt_softplus=True,
+                    **dt_limit_kwargs,
+                )
+                # Init cache
+                if ssm_state is not None and cache_params is not None:
+                    cache_params.update_ssm_state(layer_idx=self.layer_idx, new_ssm_state=ssm_state)
+                scan_output = scan_output.view(batch_size, seq_len, -1)
+                # Multiply "gate" branch and apply extra normalization layer
+                scan_output = self.norm(scan_output, gate)
+                # 4. Final linear projection
+                out = self.out_proj(scan_output)
+        return out
+    # fmt: off
+    def torch_forward(self, input_states, cache_params: Optional[HybridMambaAttentionDynamicCache]=None, cache_position:Optional[torch.LongTensor]=None, attention_mask: Optional[torch.Tensor]=None):
+        batch_size, seq_len, _ = input_states.shape
+        dtype = input_states.dtype
+        # 1. Gated MLP's linear projection
+        input_states = apply_mask_to_padding_states(input_states, attention_mask)
+        projected_states = self.in_proj(input_states)
+        d_mlp = (projected_states.shape[-1] - 2 * self.intermediate_size - 2 * self.n_groups * self.ssm_state_size-self.num_heads) // 2
+        _, _, gate, hidden_states_B_C, dt = projected_states.split(
+                [d_mlp, d_mlp, self.intermediate_size,  self.conv_dim, self.num_heads], dim=-1
+        )
+        # 2. Convolution sequence transformation
+        if cache_params is not None and cache_position is not None and cache_position[0] > 0:
+            cache_params.update_conv_state(layer_idx=self.layer_idx, new_conv_state=hidden_states_B_C, cache_init=False)
+            # We need to guarantee that anything regarding the cache is on the same device
+            conv_states = cache_params.conv_states[self.layer_idx].to(device=self.conv1d.weight.device)
+            hidden_states_B_C = torch.sum(
+                conv_states * self.conv1d.weight.squeeze(1), dim=-1
+            )
+            if self.use_conv_bias:
+                hidden_states_B_C = hidden_states_B_C + self.conv1d.bias
+            hidden_states_B_C = self.act(hidden_states_B_C)
+        else:
+            # Init cache
+            if cache_params is not None:
+                hidden_states_B_C_transposed = hidden_states_B_C.transpose(1, 2)
+                conv_states = nn.functional.pad(
+                    hidden_states_B_C_transposed, (cache_params.conv_kernel_size - hidden_states_B_C_transposed.shape[-1], 0)
+                )
+                cache_params.update_conv_state(layer_idx=self.layer_idx, new_conv_state=conv_states, cache_init=True)
+            hidden_states_B_C = self.act(self.conv1d(hidden_states_B_C.transpose(1, 2))[..., :seq_len].transpose(1, 2))
+        hidden_states_B_C = apply_mask_to_padding_states(hidden_states_B_C, attention_mask)
+        hidden_states, B, C = torch.split(
+            hidden_states_B_C,
+            [self.intermediate_size, self.n_groups * self.ssm_state_size, self.n_groups * self.ssm_state_size],
+            dim=-1
+        )
+        # 3. SSM transformation
+        A = -torch.exp(self.A_log.float())                            # [num_heads]
+        if cache_params is not None and cache_position is not None and cache_position[0] > 0:
+            # We need to guarantee that anything regarding the cache is on the same device
+            cache_device = cache_params.ssm_states.device
+            # Note: there is no need to pad parameter matrices here, as there is just one new token
+            # for batched generation
+            dt = dt[:, 0, :][:, None, ...]
+            dt = dt.transpose(1, 2).expand(batch_size, dt.shape[-1], self.head_dim)
+            # [num_heads] -> [num_heads, head_dim]
+            dt_bias = self.dt_bias[..., None].expand(self.dt_bias.shape[0], self.head_dim)
+            dt = torch.nn.functional.softplus(dt + dt_bias.to(dt.dtype))
+            dt = torch.clamp(dt, self.time_step_limit[0], self.time_step_limit[1])
+            A = A[..., None, None].expand(self.num_heads, self.head_dim, self.ssm_state_size).to(dtype=torch.float32)
+            # [bsz, num_heads, head_dim, state_size]
+            dA = (torch.exp(dt[..., None] * A)).to(device=cache_device)
+            # Discretize B
+            # [bsz, n_groups * state_size] -> [bsz, n_groups, 1, state_size] ->
+            # -> [bsz, n_groups, group to head repetition factor, state_size] -> [bsz, num_heads, state_size]
+            B = B.reshape(batch_size, self.n_groups, -1)[..., None, :]
+            B = B.expand(batch_size, self.n_groups, self.num_heads // self.n_groups, B.shape[-1]).contiguous()
+            B = B.reshape(batch_size, -1, B.shape[-1])
+            # [bsz, num_heads, head_dim, state_size]
+            dB = dt[..., None] * B[..., None, :]
+            # Discretize x into dB
+            # [bsz, intermediate_size] -> [bsz, num_heads, head_dim]
+            hidden_states = hidden_states.reshape(batch_size, -1, self.head_dim)
+            dBx = (dB * hidden_states[..., None]).to(device=cache_device)
+            # State calculation
+            cache_params.update_ssm_state(
+                layer_idx=self.layer_idx,
+                new_ssm_state=cache_params.ssm_states[self.layer_idx] * dA + dBx
+            )
+            # Subsequent output
+            # [bsz, n_groups * state_size] -> [bsz, num_heads, state_size]
+            C = C.reshape(batch_size, self.n_groups, -1)[..., None, :]
+            C = C.expand(batch_size, self.n_groups, self.num_heads // self.n_groups, C.shape[-1]).contiguous()
+            C = C.reshape(batch_size, -1, C.shape[-1])
+            # [bsz, num_heads, head_dim]
+            ssm_states = cache_params.ssm_states[self.layer_idx].to(device=C.device, dtype=C.dtype)  # Shape: [b, h, d, n]
+            # Reshape ssm_states to merge the first two dimensions
+            ssm_states_reshaped = ssm_states.view(batch_size * self.num_heads, self.head_dim, self.ssm_state_size)  # Shape: [b*h, d, n]
+            C_reshaped = C.view(batch_size * self.num_heads, self.ssm_state_size, 1)  # Shape: [b*h, n, 1]
+            y = torch.bmm(ssm_states_reshaped, C_reshaped)
+            y = y.view(batch_size, self.num_heads, self.head_dim)
+            # D skip connection
+            # [num_heads] -> [num_heads, head_dim]
+            D = self.D[..., None].expand(self.D.shape[0], self.head_dim)
+            y = (y + hidden_states * D).to(y.dtype)
+            # [bsz, num_heads, head_dim] -> [bsz, 1, intermediate_size]
+            y = y.reshape(batch_size, -1)[:, None, ...]
+        else:
+            # begin ssd naive implementation without einsums
+            dt = nn.functional.softplus(dt + self.dt_bias)
+            dt = torch.clamp(dt, self.time_step_limit[0], self.time_step_limit[1])
+            hidden_states = hidden_states.reshape(batch_size, seq_len, -1, self.head_dim).float()
+            B = B.reshape(batch_size, seq_len, -1, self.ssm_state_size).float()
+            C = C.reshape(batch_size, seq_len, -1, self.ssm_state_size).float()
+            B = B.repeat(1, 1, self.num_heads // self.n_groups, 1)
+            C = C.repeat(1, 1, self.num_heads // self.n_groups, 1)
+            pad_size = (self.chunk_size - seq_len % self.chunk_size) % self.chunk_size
+            D_residual = self.D[..., None] * pad_tensor_by_size(hidden_states, pad_size)
+            # Discretize x and A
+            hidden_states = hidden_states * dt[..., None]
+            A = A.to(hidden_states.dtype) * dt
+            # Rearrange into blocks/chunks
+            hidden_states, A, B, C = [reshape_into_chunks(t, pad_size, self.chunk_size) for t in (hidden_states, A, B, C)]
+            # [bsz, -1, chunk_size, num_heads] -> [bsz, num_heads, -1, chunk_size]
+            A = A.permute(0, 3, 1, 2)
+            A_cumsum = torch.cumsum(A, dim=-1)
+            # 1. Compute the output for each intra-chunk (diagonal blocks)
+            # This is the analog of a causal mask
+            L = torch.exp(segment_sum(A))
+            # Contraction of C and B to get G (attention-weights like)
+            G_intermediate = C[:, :, :, None, :, :] * B[:, :, None, :, :, :]  # shape: (b, c, l, s, h, n)
+            G = G_intermediate.sum(dim=-1)  # shape: (b, c, l, s, h)
+            # Compute M, equivalent to applying attention mask to weights
+            M_intermediate = G[..., None] * L.permute(0, 2, 3, 4, 1)[..., None]
+            M = M_intermediate.sum(dim=-1)
+            # Compute Y_diag (apply to values)
+            Y_diag = (M[..., None] * hidden_states[:, :, None]).sum(dim=3)
+            # 2. Compute the state for each intra-chunk
+            # (right term of low-rank factorization of off-diagonal blocks; B terms)
+            decay_states = torch.exp((A_cumsum[:, :, :, -1:] - A_cumsum))
+            B_decay = B * decay_states.permute(0, -2, -1, 1)[..., None]
+            states = (B_decay[..., None, :] * hidden_states[..., None]).sum(dim=2)
+            # 3. Compute the inter-chunk SSM recurrence; produces correct SSM states at chunk boundaries
+            # (middle term of factorization of off-diag blocks; A terms)
+            if cache_params is not None and cache_position is not None and cache_position[0] > 0:
+                previous_states = cache_params.ssm_states[self.layer_idx][:, None, ...].to(device=states.device)
+            else:
+                previous_states = torch.zeros_like(states[:, :1])
+            states = torch.cat([previous_states, states], dim=1)
+            decay_chunk = torch.exp(segment_sum(nn.functional.pad(A_cumsum[:, :, :, -1], (1, 0))))
+            decay_chunk = decay_chunk.transpose(1, 3)
+            new_states = (decay_chunk[..., None, None] * states[:, :, None, ...]).sum(dim=1)
+            states, ssm_state = new_states[:, :-1], new_states[:, -1]
+            # 4. Compute state -> output conversion per chunk
+            # (left term of low-rank factorization of off-diagonal blocks; C terms)
+            state_decay_out = torch.exp(A_cumsum)
+            C_times_states = (C[..., None, :] * states[:, :, None, ...])
+            state_decay_out_permuted = state_decay_out.permute(0, 2, 3, 1)
+            Y_off = (C_times_states.sum(-1) * state_decay_out_permuted[..., None])
+            # Add output of intra-chunk and inter-chunk terms (diagonal and off-diagonal blocks)
+            y = Y_diag + Y_off
+            # [bsz, -1, self.chunk_size, num_heads, head_dim] -> [bsz, (padded) seq_len, num_heads, head_dim]
+            y = y.reshape(batch_size, -1, self.num_heads, self.head_dim)
+            y = y + D_residual
+            # Cutting off padded chunks
+            if pad_size > 0:
+                y = y[:, :seq_len, :, :]
+            y = y.reshape(batch_size, seq_len, -1)
+            # Init cache
+            if ssm_state is not None and cache_params is not None:
+                cache_params.update_ssm_state(layer_idx=self.layer_idx, new_ssm_state=ssm_state)
+        scan_output = self.norm(y, gate)
+        # end ssd naive
+        # 4. Final linear projection
+        contextualized_states = self.out_proj(scan_output.to(dtype))  # [batch, seq_len, hidden_size]
+        return contextualized_states
+    # fmt: on
+    def forward(
+        self,
+        hidden_states,
+        cache_params: Optional[HybridMambaAttentionDynamicCache] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+    ):
+        if is_fast_path_available and "cuda" in self.in_proj.weight.device.type:
+            return self.cuda_kernels_forward(hidden_states, cache_params, cache_position, attention_mask)
+        dtype = hidden_states.dtype
+        if attention_mask is not None and attention_mask.shape[1] > 1 and attention_mask.shape[0] > 1:
+            # tune out hidden states for pad tokens, see https://github.com/state-spaces/mamba/issues/66
+            hidden_states = (hidden_states * attention_mask[:, :, None]).to(dtype)
+        return self.torch_forward(hidden_states, cache_params, cache_position, attention_mask)
+class NemotronHRMSNorm(nn.Module):
+    def __init__(self, hidden_size, eps=1e-6):
+        """
+        NemotronHRMSNorm is equivalent to T5LayerNorm and LlamaRMSNorm
+        """
+        super().__init__()
+        self.weight = nn.Parameter(torch.ones(hidden_size))
+        self.variance_epsilon = eps
+    def forward(self, hidden_states):
+        input_dtype = hidden_states.dtype
+        hidden_states = hidden_states.to(torch.float32)
+        variance = hidden_states.pow(2).mean(-1, keepdim=True)
+        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
+        # Weights are in float32
+        return (self.weight.to(torch.float32) * hidden_states).to(input_dtype)
+class NemotronHBlock(nn.Module):
+    def __init__(self, config, layer_idx):
+        super().__init__()
+        self.config = config
+        self.layer_idx = layer_idx
+        self.residual_in_fp32 = config.residual_in_fp32
+        self.norm = NemotronHRMSNorm(config.hidden_size, eps=config.layer_norm_epsilon)
+        # M: Mamba2, *: Attention, -: MLP
+        self.block_type = config.layers_block_type[layer_idx]
+        if self.block_type == "mamba":
+            self.mixer = NemotronHMamba2Mixer(config, layer_idx=layer_idx)
+        elif self.block_type == "attention":
+            self.mixer = NEMOTRONH_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx=layer_idx)
+        elif self.block_type == "mlp":
+            self.mixer = NemotronHMLP(config, layer_idx=layer_idx)
+        else:
+            raise ValueError(f"Invalid layer pattern {config.hybrid_override_pattern[layer_idx]}")
+    def forward(
+        self,
+        hidden_states,
+        cache_params: Optional[HybridMambaAttentionDynamicCache] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+    ):
+        with torch.cuda.stream(torch.cuda.default_stream(hidden_states.device)):
+            # * Use torch.cuda.stream() to avoid NaN issues when using multiple GPUs
+            residual = hidden_states
+            hidden_states = self.norm(hidden_states.to(dtype=self.norm.weight.dtype))
+            if self.residual_in_fp32:
+                residual = residual.to(torch.float32)
+            if self.block_type == "mamba":
+                hidden_states = self.mixer(
+                    hidden_states, cache_params=cache_params, cache_position=cache_position
+                )
+            elif self.block_type == "attention":
+                hidden_states = self.mixer(
+                    hidden_states, cache_position=cache_position
+                )
+                hidden_states = hidden_states[0]
+            elif self.block_type == "mlp":
+                hidden_states = self.mixer(
+                    hidden_states
+                )
+            else:
+                raise ValueError(f"Invalid block_type: {self.block_type}")
+            hidden_states = residual + hidden_states
+            return hidden_states
+# Copied from transformers.models.nemotron.modeling_nemotron Nemotron->NemotronH
+class NemotronHMLP(nn.Module):
+    def __init__(self, config, layer_idx: Optional[int] = None):
+        super().__init__()
+        self.config = config
+        self.layer_idx = layer_idx
+        if layer_idx is None:
+            logger.warning_once(
+                f"Instantiating {self.__class__.__name__} without passing a `layer_idx` is not recommended and will "
+                "lead to errors during the forward call if caching is used. Please make sure to provide a `layer_idx` "
+                "when creating this class."
+            )
+        self.hidden_size = config.hidden_size
+        #intermediate_size = config.expand * config.hidden_size
+        self.intermediate_size = config.intermediate_size
+        self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=config.mlp_bias)
+        self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=config.mlp_bias)
+        self.act_fn = ACT2FN[config.mlp_hidden_act]
+    def forward(self, x):
+        return self.down_proj(self.act_fn(self.up_proj(x)))
+# Copied from transformers.models.llama.modeling_llama.repeat_kv
+def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
+    """
+    This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
+    num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
+    """
+    batch, num_key_value_heads, slen, head_dim = hidden_states.shape
+    if n_rep == 1:
+        return hidden_states
+    hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
+    return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
+class NemotronHAttention(nn.Module):
+    """Multi-headed attention from 'Attention Is All You Need' paper"""
+    def __init__(self, config: NemotronHConfig, layer_idx: Optional[int] = None):
+        super().__init__()
+        self.config = config
+        self.layer_idx = layer_idx
+        if layer_idx is None:
+            logger.warning_once(
+                f"Instantiating {self.__class__.__name__} without passing a `layer_idx` is not recommended and will "
+                "lead to errors during the forward call if caching is used. Please make sure to provide a `layer_idx` "
+                "when creating this class."
+            )
+        self.attention_dropout = config.attention_dropout
+        self.hidden_size = config.hidden_size
+        self.num_heads = config.num_attention_heads
+        if config.head_dim is not None:
+            self.head_dim = config.head_dim
+        else:
+            self.head_dim = config.hidden_size // config.num_attention_heads
+        self.num_key_value_heads = config.num_key_value_heads
+        self.num_key_value_groups = self.num_heads // self.num_key_value_heads
+        self.max_position_embeddings = config.max_position_embeddings
+        self.is_causal = True
+        self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias)
+        self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias)
+        self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias)
+        self.o_proj = nn.Linear(self.head_dim * self.num_heads, self.hidden_size, bias=config.attention_bias)
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        # position_embeddings: Tuple[torch.Tensor, torch.Tensor], #TODO
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_value: Optional[HybridMambaAttentionDynamicCache] = None,
+        output_attentions: bool = False,
+        use_cache: bool = False,
+        cache_position: Optional[torch.LongTensor] = None,
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
+        bsz, q_len, _ = hidden_states.size()
+        query_states = self.q_proj(hidden_states)
+        key_states = self.k_proj(hidden_states)
+        value_states = self.v_proj(hidden_states)
+        query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
+        key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
+        value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
+        if past_key_value is not None:
+            key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx)
+        key_states = repeat_kv(key_states, self.num_key_value_groups)
+        value_states = repeat_kv(value_states, self.num_key_value_groups)
+        causal_mask = attention_mask
+        if attention_mask is not None:  # no matter the length, we just slice it
+            causal_mask = attention_mask[:, :, :, : key_states.shape[-2]]
+        if query_states.device.type == "cuda" and attention_mask is not None:
+            query_states = query_states.contiguous()
+            key_states = key_states.contiguous()
+            value_states = value_states.contiguous()
+        is_causal = True if causal_mask is None and q_len > 1 else False
+        attn_output = torch.nn.functional.scaled_dot_product_attention(
+            query_states,
+            key_states,
+            value_states,
+            attn_mask=causal_mask,
+            dropout_p=self.attention_dropout if self.training else 0.0,
+            is_causal=is_causal,
+        )
+        attn_output = attn_output.transpose(1, 2).contiguous()
+        #attn_output = attn_output.view(bsz, q_len, self.hidden_size)
+        attn_output = attn_output.view(bsz, q_len, self.num_heads * self.head_dim)
+        attn_output = self.o_proj(attn_output)
+        return attn_output, None, past_key_value
+# Adapted from transformers.models.mistral.modeling_mistral.MistralFlashAttention2 with Mistral->Jamba
+#class JambaFlashAttention2(JambaAttention):
+class NemotronHFlashAttention2(NemotronHAttention):
+    """
+    Jamba flash attention module. This module inherits from `JambaAttention` as the weights of the module stays
+    untouched. The only required change would be on the forward pass where it needs to correctly call the public API of
+    flash attention and deal with padding tokens in case the input contains any of them.
+    """
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        # TODO: Should be removed once Flash Attention for RoCm is bumped to 2.1.
+        # flash_attn<2.1 generates top-left aligned causal mask, while what is needed here is bottom-right alignement, that was made default for flash_attn>=2.1. This attribute is used to handle this difference. Reference: https://github.com/Dao-AILab/flash-attention/releases/tag/v2.1.0.
+        # Beware that with flash_attn<2.1, using q_seqlen != k_seqlen (except for the case q_seqlen == 1) produces a wrong mask (top-left).
+        self._flash_attn_uses_top_left_mask = not is_flash_attn_greater_or_equal_2_10()
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_value: Optional[HybridMambaAttentionDynamicCache] = None,
+        output_attentions: bool = False,
+        use_cache: bool = False,
+        cache_position: Optional[torch.LongTensor] = None,
+        **kwargs,
+    ):
+        bsz, q_len, _ = hidden_states.size()
+        query_states = self.q_proj(hidden_states)
+        key_states = self.k_proj(hidden_states)
+        value_states = self.v_proj(hidden_states)
+        # Flash attention requires the input to have the shape
+        # batch_size x seq_length x head_dim x hidden_dim
+        # therefore we just need to keep the original shape
+        query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim)
+        key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
+        value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
+        if past_key_value is not None:
+            key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx)
+        # repeat k/v heads if n_kv_heads < n_heads
+        key_states = repeat_kv(key_states, self.num_key_value_groups)
+        value_states = repeat_kv(value_states, self.num_key_value_groups)
+        dropout_rate = 0.0 if not self.training else self.attention_dropout
+        # In PEFT, usually we cast the layer norms in float32 for training stability reasons
+        # therefore the input hidden states gets silently casted in float32. Hence, we need
+        # cast them back in float16 just to be sure everything works as expected.
+        input_dtype = query_states.dtype
+        if input_dtype == torch.float32:
+            if torch.is_autocast_enabled():
+                target_dtype = torch.get_autocast_gpu_dtype()
+            # Handle the case where the model is quantized
+            elif hasattr(self.config, "_pre_quantization_dtype"):
+                target_dtype = self.config._pre_quantization_dtype
+            else:
+                target_dtype = self.q_proj.weight.dtype
+            logger.warning_once(
+                f"The input hidden states seems to be silently casted in float32, this might be related to"
+                f" the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in"
+                f" {target_dtype}."
+            )
+            query_states = query_states.to(target_dtype)
+            key_states = key_states.to(target_dtype)
+            value_states = value_states.to(target_dtype)
+        # Reashape to the expected shape for Flash Attention
+        key_states = key_states.transpose(1, 2)
+        value_states = value_states.transpose(1, 2)
+        attn_output = _flash_attention_forward(
+            query_states,
+            key_states,
+            value_states,
+            attention_mask,
+            q_len,
+            dropout=dropout_rate,
+            sliding_window=getattr(self.config, "sliding_window", None),
+            is_causal=self.is_causal,
+            use_top_left_mask=self._flash_attn_uses_top_left_mask,
+        )
+        #attn_output = attn_output.reshape(bsz, q_len, self.hidden_size).contiguous()
+        attn_output = attn_output.reshape(bsz, q_len, self.num_heads * self.head_dim).contiguous()
+        attn_output = self.o_proj(attn_output)
+        if not output_attentions:
+            attn_weights = None
+        return attn_output, attn_weights, past_key_value
+# Adapted from transformers.models.mistral.modeling_mistral.MistralSdpaAttention with Mistral->Jamba
+#class JambaSdpaAttention(JambaAttention):
+class NemotronHSdpaAttention(NemotronHAttention):
+    """
+    Jamba attention module using torch.nn.functional.scaled_dot_product_attention. This module inherits from
+    `JambaAttention` as the weights of the module stays untouched. The only changes are on the forward pass to adapt to
+    SDPA API.
+    """
+    # Adapted from NemotronHAttention.forward
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_value: Optional[HybridMambaAttentionDynamicCache] = None,
+        output_attentions: bool = False,
+        use_cache: bool = False,
+        cache_position: Optional[torch.LongTensor] = None,
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
+        if output_attentions:
+            # TODO: Improve this warning with e.g. `model.config.attn_implementation = "manual"` once this is implemented.
+            logger.warning_once(
+                "NemotronHModel is using NemotronHSdpaAttention, but `torch.nn.functional.scaled_dot_product_attention` does not support `output_attentions=True`. Falling back to the manual attention implementation, "
+                'but specifying the manual implementation will be required from Transformers version v5.0.0 onwards. This warning can be removed using the argument `attn_implementation="eager"` when loading the model.'
+            )
+            return super().forward(
+                hidden_states=hidden_states,
+                attention_mask=attention_mask,
+                position_ids=position_ids,
+                past_key_value=past_key_value,
+                output_attentions=output_attentions,
+                use_cache=use_cache,
+            )
+        bsz, q_len, _ = hidden_states.size()
+        query_states = self.q_proj(hidden_states)
+        key_states = self.k_proj(hidden_states)
+        value_states = self.v_proj(hidden_states)
+        query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
+        key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
+        value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
+        if past_key_value is not None:
+            key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx)
+        key_states = repeat_kv(key_states, self.num_key_value_groups)
+        value_states = repeat_kv(value_states, self.num_key_value_groups)
+        causal_mask = attention_mask
+        if attention_mask is not None:
+            causal_mask = causal_mask[:, :, :, : key_states.shape[-2]]
+        # SDPA with memory-efficient backend is currently (torch==2.1.2) bugged with non-contiguous inputs with custom attn_mask,
+        # Reference: https://github.com/pytorch/pytorch/issues/112577.
+        if query_states.device.type == "cuda" and attention_mask is not None:
+            query_states = query_states.contiguous()
+            key_states = key_states.contiguous()
+            value_states = value_states.contiguous()
+        # We dispatch to SDPA's Flash Attention or Efficient kernels via this `is_causal` if statement instead of an inline conditional assignment
+        # in SDPA to support both torch.compile's dynamic shapes and full graph options. An inline conditional prevents dynamic shapes from compiling.
+        # The q_len > 1 is necessary to match with AttentionMaskConverter.to_causal_4d that does not create a causal mask in case q_len == 1.
+        is_causal = True if self.is_causal and causal_mask is None and q_len > 1 else False
+        attn_output = torch.nn.functional.scaled_dot_product_attention(
+            query_states,
+            key_states,
+            value_states,
+            attn_mask=causal_mask,
+            dropout_p=self.attention_dropout if self.training else 0.0,
+            is_causal=is_causal,
+        )
+        attn_output = attn_output.transpose(1, 2).contiguous()
+        attn_output = attn_output.view(bsz, q_len, self.hidden_size)
+        attn_output = self.o_proj(attn_output)
+        return attn_output, None, past_key_value
+NEMOTRONH_ATTENTION_CLASSES = {
+    "eager": NemotronHAttention,
+    "flash_attention_2": NemotronHFlashAttention2,
+    "sdpa": NemotronHSdpaAttention,
+}
+# Copied from transformers.models.mamba.modeling_mamba2.Mamba2PreTrainedModel
+class NemotronHPreTrainedModel(PreTrainedModel):
+    """
+    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+    models.
+    """
+    config_class = NemotronHConfig
+    base_model_prefix = "backbone"
+    _no_split_modules = ["NemotronHBlock"]
+    supports_gradient_checkpointing = True
+    _is_stateful = True
+    def _init_weights(self, module):
+        """Initialize the weights."""
+        if isinstance(module, NemotronHMamba2Mixer):
+            module.A_log._no_weight_decay = True
+            module.D._no_weight_decay = True
+            dt = torch.exp(
+                torch.rand(self.config.mamba_num_heads)
+                * (math.log(self.config.time_step_max) - math.log(self.config.time_step_min))
+                + math.log(self.config.time_step_min)
+            ).clamp(min=self.config.time_step_floor)
+            # # Inverse of softplus: https://github.com/pytorch/pytorch/issues/72759
+            inv_dt = dt + torch.log(-torch.expm1(-dt))
+            with torch.no_grad():
+                module.dt_bias.copy_(inv_dt)
+            module.dt_bias._no_reinit = True
+        if isinstance(module, nn.Linear):
+            if module.bias is not None:
+                if not getattr(module.bias, "_no_reinit", False):
+                    nn.init.zeros_(module.bias)
+        elif isinstance(module, nn.Embedding):
+            nn.init.normal_(module.weight, std=self.config.initializer_range)
+        # TODO: Check
+        if self.config.rescale_prenorm_residual:
+            # Reinitialize selected weights subject to the OpenAI GPT-2 Paper Scheme:
+            #   > A modified initialization which accounts for the accumulation on the residual path with model depth. Scale
+            #   > the weights of residual layers at initialization by a factor of 1/√N where N is the # of residual layers.
+            #   >   -- GPT-2 :: https://openai.com/blog/better-language-models/
+            #
+            # Reference (Megatron-LM): https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/model/gpt_model.py
+            for name, p in module.named_parameters():
+                if name in ["out_proj.weight"]:
+                    # Special Scaled Initialization --> There are 2 Layer Norms per Transformer Block
+                    # Following Pytorch init, except scale by 1/sqrt(2 * n_layer)
+                    # We need to reinit p since this code could be called multiple times
+                    # Having just p *= scale would repeatedly scale it down
+                    nn.init.kaiming_uniform_(p, a=math.sqrt(5))
+                    with torch.no_grad():
+                        p /= math.sqrt(self.config.num_hidden_layers)
+@dataclass
+# Copied from transformers.models.mamba.modeling_mamba2.Mamba2Output with MAMBA2->NemotronH,Mamba2->NemotronH
+class NemotronHOutput(ModelOutput):
+    """
+    Class for the NemotronH model outputs.
+    Args:
+        last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
+            Sequence of hidden-states at the output of the last layer of the model.
+        cache_params (`HybridMambaAttentionDynamicCache`):
+            The state of the model at the last time step. Can be used in a forward method with the next `input_ids` to
+            avoid providing the old `input_ids`.
+            Includes both the State space model state matrices after the selective scan, and the Convolutional states
+        hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
+    """
+    last_hidden_state: Optional[torch.FloatTensor] = None
+    cache_params: Optional[HybridMambaAttentionDynamicCache] = None
+    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+    attentions: Optional[Tuple[torch.FloatTensor]] = None
+@dataclass
+# Copied from transformers.models.mamba2.modeling_mamba2.MambaCausalLMOutput with Mamba2->NemotronH
+class NemotronHCausalLMOutput(ModelOutput):
+    """
+    Base class for causal language model (or autoregressive) outputs.
+    Args:
+        loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
+            Language modeling loss (for next-token prediction).
+        logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
+            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
+        cache_params (`HybridMambaAttentionDynamicCache`):
+            The state of the model at the last time step. Can be used in a forward method with the next `input_ids` to
+            avoid providing the old `input_ids`.
+            Includes both the State space model state matrices after the selective scan, and the Convolutional states
+        hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
+    """
+    loss: Optional[torch.FloatTensor] = None
+    logits: Optional[torch.FloatTensor] = None
+    cache_params: Optional[HybridMambaAttentionDynamicCache] = None
+    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+    attentions: Optional[Tuple[torch.FloatTensor]] = None
+NEMOTRONH_START_DOCSTRING = r"""
+    This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
+    library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
+    etc.)
+    This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
+    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
+    and behavior.
+    Parameters:
+        config ([`NemotronHConfig`]): Model configuration class with all the parameters of the model.
+            Initializing with a config file does not load the weights associated with the model, only the
+            configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
+"""
+NEMOTRONH_INPUTS_DOCSTRING = r"""
+    Args:
+        input_ids (`torch.LongTensor` of shape `(batch_size, input_ids_length)`, *optional*):
+            Indices of input sequence tokens in the vocabulary.
+            If `cache_params.seqlen_offset>0`, only `input_ids` that do not have their past calculated should be passed as
+            `input_ids`.
+            Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
+            [`PreTrainedTokenizer.__call__`] for details.
+            [What are input IDs?](../glossary#input-ids)
+        inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
+            Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
+            is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
+            model's internal embedding lookup matrix.
+        position_ids (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
+            Indices of positions of each input sequence tokens in the position embeddings.
+        cache_params (`HybridMambaAttentionDynamicCache`, *optional*):
+            If passed along, the model uses the previous state in all the blocks (which will give the output for the
+            `input_ids` provided as if the model add `state_input_ids + input_ids` as context).
+        use_cache (`bool`, *optional*):
+            If set to `True`, the `cache_params` is returned and can be used to quickly generate the next logits.
+        output_attentions (`bool`, *optional*):
+            Whether or not to return the attentions tensors of all attention layers.
+        output_hidden_states (`bool`, *optional*):
+            Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+            more detail.
+        return_dict (`bool`, *optional*):
+            Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+        cache_position (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
+            The position of the current input in the cache. This is used to ensure that the cache is correctly updated.
+            If `cache_params` is passed, `cache_position` should also be passed.
+        attention_mask (`torch.FloatTensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+            [What are attention masks?](../glossary#attention-mask)
+"""
+@add_start_docstrings(
+    "The bare NemotronH Model transformer outputting raw hidden-states without any specific head on top.",
+    NEMOTRONH_START_DOCSTRING,
+)
+class NemotronHModel(NemotronHPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        self.embeddings = nn.Embedding(config.vocab_size, config.hidden_size)
+        self.layers = nn.ModuleList([NemotronHBlock(config, layer_idx=idx) for idx in range(config.num_hidden_layers)])
+        self.gradient_checkpointing = False
+        self.norm_f = NemotronHRMSNorm(config.hidden_size, eps=config.layer_norm_epsilon)
+        # Initialize weights and apply final processing
+        self._register_load_state_dict_pre_hook(self.load_hook)
+        self.post_init()
+    def load_hook(self, state_dict, prefix, *args):
+        for k in state_dict:
+            if "embedding." in k:
+                state_dict[k.replace("embedding.", "embeddings.")] = state_dict.pop(k)
+                break
+    def get_input_embeddings(self):
+        return self.embeddings
+    def set_input_embeddings(self, new_embeddings):
+        self.embeddings = new_embeddings
+    @add_start_docstrings_to_model_forward(NEMOTRONH_INPUTS_DOCSTRING)
+    @add_code_sample_docstrings(
+        checkpoint=_CHECKPOINT_FOR_DOC,
+        output_type=NemotronHOutput,
+        config_class=_CONFIG_FOR_DOC,
+    )
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        inputs_embeds: Optional[torch.LongTensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        cache_params: Optional[HybridMambaAttentionDynamicCache] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        **kwargs,
+    ) -> Union[Tuple, NemotronHOutput]:
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        # use_cache = use_cache if use_cache is not None else self.config.use_cache
+        use_cache = use_cache if use_cache is not None else (self.config.use_cache if not self.training else False)
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        if (input_ids is None) ^ (inputs_embeds is not None):  # ^ is python for xor
+            raise ValueError("You must specify exactly one of input_ids or inputs_embeds")
+        if inputs_embeds is None:
+            inputs_embeds = self.embeddings(input_ids)
+        if self.gradient_checkpointing and self.training and use_cache:
+            logger.warning_once(
+                "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`."
+            )
+            use_cache = False
+        # From zamba_modeling.py
+        if use_cache and cache_params is None:
+            logger.warning_once(
+                "NemotronH requires an initialized `NemotronHHybridDynamicCache` to return a cache. None was "
+                "provided, so no cache will be returned."
+            )
+        hidden_states = inputs_embeds
+        if cache_position is None:
+            cache_position = torch.arange(hidden_states.shape[1], device=hidden_states.device)
+        if position_ids is None:
+            position_ids = cache_position.unsqueeze(0)
+        causal_mask = self._update_causal_mask(attention_mask, inputs_embeds, cache_position)
+        mamba_mask = self._update_mamba_mask(attention_mask, cache_position)
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attns = () if output_attentions else None
+        # Until HERE
+        for layer_idx, mixer_block in enumerate(self.layers):
+            # Depending on the layer type we opt for 2D base attention mask (Mamba) or 4D causal mask (Attention)
+            if mixer_block.block_type == "mamba":
+                layer_mask = mamba_mask
+            elif mixer_block.block_type == "attention":
+                layer_mask = causal_mask
+            elif mixer_block.block_type == "mlp":
+                layer_mask = None
+            else:
+                raise ValueError(f"Invalid block_type: {self.block_type}")
+            if output_hidden_states:
+                all_hidden_states += (hidden_states,)
+            if self.gradient_checkpointing and self.training:
+                hidden_states = self._gradient_checkpointing_func(
+                    mixer_block.__call__, hidden_states, cache_params, cache_position, layer_mask
+                )
+            else:
+                hidden_states = mixer_block(
+                    hidden_states,
+                    cache_params=cache_params,
+                    cache_position=cache_position,
+                    attention_mask=layer_mask,
+                )
+            # TODO: Store attentions
+            # if output_attentions:
+            #     if layer_outputs[1] is not None:
+            #         # append attentions only of attention layers. Mamba layers return `None` as the attention weights
+            #         all_self_attns += (layer_outputs[1],)
+            # TODO (Check): should it happen before the forward pass?
+            # if output_hidden_states:
+            #     all_hidden_states = all_hidden_states + (hidden_states,)
+        hidden_states = self.norm_f(hidden_states)
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+        if not return_dict:
+            return tuple(v for v in [hidden_states, cache_params, all_hidden_states] if v is not None)
+        return NemotronHOutput(
+            last_hidden_state=hidden_states,
+            cache_params=cache_params if use_cache else None,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attns,
+        )
+    # Copied from transformers.models.jamba.modeling_jamba.JambaModel._update_causal_mask
+    def _update_causal_mask(self, attention_mask, input_tensor, cache_position):
+        if self.config._attn_implementation == "flash_attention_2":
+            if attention_mask is not None and 0.0 in attention_mask:
+                return attention_mask
+            return None
+        dtype, device = input_tensor.dtype, input_tensor.device
+        min_dtype = torch.finfo(dtype).min
+        sequence_length = input_tensor.shape[1]
+        target_length = cache_position[-1] + 1
+        causal_mask = torch.full((sequence_length, target_length), fill_value=min_dtype, dtype=dtype, device=device)
+        if sequence_length != 1:
+            causal_mask = torch.triu(causal_mask, diagonal=1)
+        causal_mask *= torch.arange(target_length, device=device) > cache_position.reshape(-1, 1)
+        causal_mask = causal_mask[None, None, :, :].expand(input_tensor.shape[0], 1, -1, -1)
+        if attention_mask is not None:
+            causal_mask = causal_mask.clone()  # copy to contiguous memory for in-place edit
+            if attention_mask.dim() == 2:
+                mask_length = attention_mask.shape[-1]
+                padding_mask = causal_mask[..., :mask_length].eq(0.0) * attention_mask[:, None, None, :].eq(0.0)
+                causal_mask[..., :mask_length] = causal_mask[..., :mask_length].masked_fill(padding_mask, min_dtype)
+        if (
+            self.config._attn_implementation == "sdpa"
+            and attention_mask is not None
+            and attention_mask.device.type == "cuda"
+        ):
+            # Attend to all tokens in fully masked rows in the causal_mask, for example the relevant first rows when
+            # using left padding. This is required by F.scaled_dot_product_attention memory-efficient attention path.
+            # Details: https://github.com/pytorch/pytorch/issues/110213
+            causal_mask = AttentionMaskConverter._unmask_unattended(causal_mask, min_dtype)
+        return causal_mask
+    def _update_mamba_mask(self, attention_mask, cache_position):
+        """
+        No need for zeroing states when
+            1. Cached forward
+            2. Attending to all inputs
+        """
+        mamba_mask = attention_mask
+        if cache_position[0] > 0 or (attention_mask is not None and torch.all(attention_mask == 1)):
+            mamba_mask = None
+        return mamba_mask
+@add_start_docstrings(
+    """
+    The NEMOTRONH Model transformer with a language modeling head on top (linear layer with weights not tied to the input
+    embeddings).
+    """,
+    NEMOTRONH_START_DOCSTRING,
+)
+class NemotronHForCausalLM(NemotronHPreTrainedModel, GenerationMixin):
+    _tied_weights_keys = ["lm_head.weight"]
+    def __init__(self, config):
+        super().__init__(config)
+        self.backbone = NemotronHModel(config)
+        self.vocab_size = config.vocab_size
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+        # Initialize weights and apply final processing
+        self.post_init()
+    def get_input_embeddings(self):
+        return self.backbone.get_input_embeddings()
+    def set_input_embeddings(self, new_embeddings):
+        return self.backbone.set_input_embeddings(new_embeddings)
+    def get_output_embeddings(self):
+        return self.lm_head
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head = new_embeddings
+    def get_decoder(self):
+        return self.model
+    def set_decoder(self, decoder):
+        self.model = decoder
+    def prepare_inputs_for_generation(
+        self,
+        input_ids,
+        past_key_values=None,
+        attention_mask=None,
+        inputs_embeds=None,
+        cache_position=None,
+        position_ids=None,
+        use_cache=True,
+        **kwargs,
+    ):
+        # Copy from https://github.com/huggingface/transformers/blob/main/src/transformers/models/jamba/modeling_jamba.py
+        # Overwitten -- uses `cache_params` as opposed to `past_key_values`
+        empty_past_kv = past_key_values is None
+        # If we have cache: let's slice `input_ids` through `cache_position`, to keep only the unprocessed tokens
+        # Exception 1: when passing input_embeds, input_ids may be missing entries
+        # Exception 2: some generation methods do special slicing of input_ids, so we don't need to do it here
+        # Exception 3: with synced GPUs cache_position may go out of bounds, but we only want dummy token in that case.
+        #              (we can't check exception 3 while compiling)
+        if not empty_past_kv:
+            if (
+                inputs_embeds is not None  # Exception 1
+                or cache_position[-1] >= input_ids.shape[1]  # Exception 3
+            ):
+                input_ids = input_ids[:, -cache_position.shape[0] :]
+            elif input_ids.shape[1] != cache_position.shape[0]:  # Default case (the "else", a no op, is Exception 2)
+                input_ids = input_ids[:, cache_position]
+        else:
+            past_key_values = HybridMambaAttentionDynamicCache(
+                self.config, input_ids.shape[0], self.dtype, device=self.device
+            )
+        if attention_mask is not None and position_ids is None:
+            # create position_ids on the fly for batch generation
+            position_ids = attention_mask.long().cumsum(-1) - 1
+            position_ids.masked_fill_(attention_mask == 0, 1)
+            if not empty_past_kv:
+                position_ids = position_ids[:, -input_ids.shape[1] :]
+        # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
+        if inputs_embeds is not None and empty_past_kv:
+            if input_ids is not None and inputs_embeds.shape[1] < input_ids.shape[1]:
+                new_token_embeds = self.get_input_embeddings()(input_ids[:,inputs_embeds.shape[1]:])
+                inputs_embeds = torch.cat([inputs_embeds, new_token_embeds], dim=1)
+            model_inputs = {"inputs_embeds": inputs_embeds}
+        else:
+            model_inputs = {"input_ids": input_ids.contiguous()}  # `contiguous()` needed for compilation use cases
+        model_inputs.update(
+            {
+                "position_ids": position_ids,
+                "past_key_values": past_key_values,
+                "use_cache": use_cache,
+                "attention_mask": attention_mask,
+                "logits_to_keep": self.config.num_logits_to_keep,
+                "cache_position": cache_position,
+            }
+        )
+        return model_inputs
+    @add_start_docstrings_to_model_forward(NEMOTRONH_INPUTS_DOCSTRING)
+    @add_code_sample_docstrings(
+        checkpoint=_CHECKPOINT_FOR_DOC,
+        output_type=NemotronHCausalLMOutput,
+        config_class=_CONFIG_FOR_DOC,
+    )
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        cache_params: Optional[HybridMambaAttentionDynamicCache] = None,
+        labels: Optional[torch.LongTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        use_cache: Optional[bool] = None,
+        cache_position: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        **kwargs,  # for now we need this for generation
+    ) -> Union[Tuple, NemotronHCausalLMOutput]:
+        r"""
+        labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Labels for language modeling. Note that the labels **are shifted** inside the model, i.e. you can set
+            `labels = input_ids` Indices are selected in `[-100, 0, ..., config.vocab_size]` All labels set to `-100`
+            are ignored (masked), the loss is only computed for labels in `[0, ..., config.vocab_size]`
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        nemotron_h_outputs = self.backbone(
+            input_ids,
+            cache_params=cache_params,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            use_cache=use_cache,
+            cache_position=cache_position,
+            attention_mask=attention_mask,
+        )
+        hidden_states = nemotron_h_outputs[0]
+        # TODO: Check zamba_modeling.py: https://github.com/huggingface/transformers/blob/d7188ba600e36d3fd191b12e19f1b3bb81a8404f/src/transformers/models/zamba/modeling_zamba.py#L1284C1-L1286C2
+        #logits = self.lm_head(hidden_states.to(self.lm_head.weight.dtype)).float()
+        logits = self.lm_head(hidden_states.to(self.lm_head.weight.dtype)).float()
+        loss = None
+        if labels is not None:
+            # move labels to correct device to enable model parallelism
+            labels = labels.to(logits.device)
+            # Shift so that tokens < n predict n
+            shift_logits = logits[..., :-1, :].contiguous()
+            shift_labels = labels[..., 1:].contiguous()
+            # Flatten the tokens
+            loss_fct = CrossEntropyLoss()
+            loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
+        if not return_dict:
+            output = (logits,) + nemotron_h_outputs[1:]
+            return ((loss,) + output) if loss is not None else output
+        return NemotronHCausalLMOutput(
+            loss=loss,
+            logits=logits,
+            cache_params=nemotron_h_outputs.cache_params,
+            hidden_states=nemotron_h_outputs.hidden_states,
+            attentions=nemotron_h_outputs.attentions,
+        )

nano_v2_inference_chat_template.jinja ADDED Viewed

	@@ -0,0 +1,125 @@

+{%- set ns = namespace(enable_thinking=true) -%}
+{%- for message in messages -%}
+    {%- if message['content'] is string -%}
+        {%- if message['role'] == 'user' or message['role'] == 'system' -%}
+            {%- if '/think' in message['content'] -%}
+                {%- set ns.enable_thinking = true -%}
+            {%- elif '/no_think' in message['content'] -%}
+                {%- set ns.enable_thinking = false -%}
+            {%- endif -%}
+        {%- endif -%}
+    {%- else -%}
+        {%- for content in message['content'] -%}
+            {%- if content['type'] == 'text' -%}
+                {%- if message['role'] == 'user' or message['role'] == 'system' -%}
+                    {%- if '/think' in content['text'] -%}
+                        {%- set ns.enable_thinking = true -%}
+                    {%- elif '/no_think' in content['text'] -%}
+                        {%- set ns.enable_thinking = false -%}
+                    {%- endif -%}
+                {%- endif -%}
+            {%- endif -%}
+        {%- endfor -%}
+    {%- endif -%}
+{%- endfor -%}
+{%- for message in messages -%}
+    {%- if loop.first -%}
+        {%- if message['role'] != 'system' -%}
+            {{- '<SPECIAL_10>System\n\n' }}
+        {%- endif -%}
+    {%- endif -%}
+    {%- if message['role'] == 'system' -%}
+        {{- '<SPECIAL_10>System\n' }}
+        {%- if message['content'] is string -%}
+            {{- message['content'].replace('/think', '').replace('/no_think', '').strip() }}
+        {%- else -%}
+            {%- for content in message['content'] -%}
+                {%- if content['type'] == 'image' -%}
+                    {{- '' }}
+                {%- elif content['type'] == 'text' -%}
+                    {{- content['text'].replace('/think', '').replace('/no_think', '').strip() }}
+                {%- endif -%}
+            {%- endfor -%}
+        {%- endif -%}
+        {%- if tools -%}
+            {%- if message['content'].replace('/think', '').replace('/no_think', '').strip() != '' -%}
+                {{- '\n\n' }}
+            {%- endif -%}
+            {{- 'You can use the following tools to assist the user if required:\n<AVAILABLE_TOOLS>[' }}
+            {%- for tool in tools -%}
+                {{- (tool.function if tool.function is defined else tool) | tojson -}}
+                {{- ', ' if not loop.last else '' -}}
+            {%- endfor -%}
+            {{- ']</AVAILABLE_TOOLS>\n\nIf you decide to call any tool(s), use the following format:\n<TOOLCALL>[{{\"name\": \"tool_name1\", \"arguments\": "\tool_args1\"}}, {{\"name\": \"tool_name2\", \"arguments\": \"tool_args2\"}}]</TOOLCALL>\n\nThe user will execute tool-calls and return responses from tool(s) in this format:\n<TOOL_RESPONSE>[{{\"tool_response1\"}}, {{\"tool_response2\"}}]</TOOL_RESPONSE>\n\nBased on the tool responses, you can call additional tools if needed, correct tool calls if any errors are found, or just respond to the user.' -}}
+        {%- endif -%}
+        {{- '\n' -}}
+    {%- elif message['role'] == 'user' -%}
+        {{- '<SPECIAL_11>User\n' }}
+        {%- if message['content'] is string -%}
+            {{- message['content'].replace('/think', '').replace('/no_think', '').strip() }}
+        {%- else -%}
+            {%- for content in message['content'] -%}
+                {%- if content['type'] == 'image' -%}
+                    {{- '' }}
+                {%- elif content['type'] == 'text' -%}
+                    {{- content['text'].replace('/think', '').replace('/no_think', '').strip() }}
+                {%- endif -%}
+            {%- endfor -%}
+        {%- endif -%}
+        {{- '\n' -}}
+    {%- elif message['role'] == 'tool' -%}
+        {%- if loop.first or (messages[loop.index0 - 1].role != 'tool') -%}
+            {{- '<SPECIAL_11>User\n<TOOL_RESPONSE>[' }}
+        {%- endif -%}
+        {{- message.content }}
+        {{- ', ' if not loop.last and (messages[loop.index0 + 1].role == 'tool') else '' -}}
+        {%- if loop.last or (messages[loop.index0 + 1].role != 'tool') -%}
+            {{- ']</TOOL_RESPONSE>\n' -}}
+        {%- endif -%}
+    {%- elif message['role'] == 'assistant' -%}
+        {%- if '</think>' in content -%}
+            {%- set content = content.split('</think>')[1].strip() -%}
+        {%- endif -%}
+        {{- '<SPECIAL_11>Assistant\n' + content.strip() }}
+        {%- if message.tool_calls -%}
+            {%- if content.strip() != '' -%}
+                {{- '\n\n' -}}
+            {%- endif -%}
+            {{- '<TOOLCALL>[' -}}
+            {%- for call in message.tool_calls -%}
+                {%- set fn = call.function if call.function is defined else call -%}
+                {{- '{"name": "' + fn.name + '", "arguments": ' -}}
+                {%- if fn.arguments is string -%}
+                    {{- fn.arguments -}}
+                {%- else -%}
+                    {{- fn.arguments | tojson -}}
+                {%- endif -%}
+                {{- '}' + (', ' if not loop.last else '') -}}
+            {%- endfor -%}
+            {{- ']</TOOLCALL>' -}}
+        {%- endif -%}
+        {{- '\n<SPECIAL_12>\n' -}}
+    {%- endif -%}
+{%- endfor -%}
+{%- if not add_generation_prompt is defined -%}
+  {%- set add_generation_prompt = false -%}
+{%- endif -%}
+{%- if add_generation_prompt -%}
+    {{- '<SPECIAL_11>Assistant\n' }}
+    {%- if ns.enable_thinking is defined and ns.enable_thinking is false -%}
+        {{- '<think></think>' }}
+    {%- else -%}
+        {{- '<think>\n' }}
+    {%- endif -%}
+{%- endif -%}

nano_v2_llm_template.jinja ADDED Viewed

	@@ -0,0 +1 @@

+ {%- for message in messages %}{%- set content = message['content'] %}{%- if message['role'] == 'system' %}{{- '<SPECIAL_10>System\n' + content.replace('/think', '').replace('/no_think', '').strip() }}{%- if tools -%}{%- if content.replace('/think', '').replace('/no_think', '').strip() != '' -%}{{- '\n\n' -}}{%- endif -%}{{- 'You can use the following tools to assist the user if required:\n<AVAILABLE_TOOLS>[' -}}{%- for tool in tools -%}{{- (tool.function if tool.function is defined else tool) | tojson -}}{{- ', ' if not loop.last else '' -}}{%- endfor -%}{{- ']</AVAILABLE_TOOLS>\n\nIf you decide to call any tool(s), use the following format:\n<TOOLCALL>[{{\"name\": \"tool_name1\", \"arguments\": \"tool_args1\"}}, {{\"name\": \"tool_name2\", \"arguments\": \"tool_args2\"}}]</TOOLCALL>\n\nThe user will execute tool-calls and return responses from tool(s) in this format:\n<TOOL_RESPONSE>[{{\"tool_response1\"}}, {{\"tool_response2\"}}]</TOOL_RESPONSE>\n\nBased on the tool responses, you can call additional tools if needed, correct tool calls if any errors are found, or just respond to the user.' -}}{%- endif -%}{{- '\n' -}}{%- elif message['role'] == 'user' %}{{- '<SPECIAL_11>User\n' + content.replace('/think', '').replace('/no_think', '').strip() + '\n' }}{%- elif message['role'] == 'tool' %}{%- if loop.first or (messages[loop.index0 - 1].role != 'tool') -%}{{- '<SPECIAL_11>User\n' + '<TOOL_RESPONSE>[' }}{%- endif -%}{{- message.content -}}{{- ', ' if not loop.last and (messages[loop.index0 + 1].role == 'tool') else '' -}}{%- if loop.last or (messages[loop.index0 + 1].role != 'tool') -%}{{- ']</TOOL_RESPONSE>\n' -}}{%- endif -%}{%- elif message['role'] == 'assistant' %}{%- if '</think>' in content %}{%- set content = content.split('</think>')[1].strip() %}{%- endif %}{{- '<SPECIAL_11>Assistant\n' + content.strip() }}{%- if message.tool_calls -%}{%- if content.strip() != '' -%}{{- '\n\n' -}}{%- endif -%}{{- '<TOOLCALL>[' -}}{%- for call in message.tool_calls -%}{%- set fn = call.function if call.function is defined else call -%}{{- '{\"name\": \"' + fn.name + '\", \"arguments\": ' -}}{%- if fn.arguments is string -%}{{- fn.arguments -}}{%- else -%}{{- fn.arguments | tojson -}}{%- endif -%}{{- '}' + (', ' if not loop.last else '') -}}{%- endfor -%}{{- ']</TOOLCALL>' -}}{%- endif -%}{{- '\n<SPECIAL_12>\n' -}}{%- endif %}{%- endfor %}{%- set ns = namespace(enable_thinking=true) %}{%- for message in messages %}{%- set content = message['content'] %}{%- if message['role'] == 'user' or message['role'] == 'system' %}{%- if '/think' in content %}{%- set ns.enable_thinking = true %}{%- elif '/no_think' in content %}{%- set ns.enable_thinking = false %}{%- endif %}{%- endif %}{%- endfor %}{%- if add_generation_prompt %}{{- '<SPECIAL_11>Assistant\n' }}{%- if ns.enable_thinking is defined and ns.enable_thinking is false %}{{- '<think></think>' }}{%- else %}{{- '<think>\n' }}{%- endif %}{%- endif %}

non_reasoning_nano_v2_inference_chat_template.jinja ADDED Viewed

	@@ -0,0 +1,118 @@

+{%- for message in messages -%}
+    {%- if loop.first -%}
+        {%- if message['role'] != 'system' -%}
+            {{ '<SPECIAL_10>System\n\n' }}
+        {%- endif -%}
+    {%- endif -%}
+    {%- if message['role'] == 'system' -%}
+        {{ '<SPECIAL_10>System\n' }}
+        {%- if message['content'] is string -%}
+            {{ message['content'].replace('/think', '').replace('/no_think', '').strip() }}
+        {%- else -%}
+            {%- for content in message['content'] -%}
+                {%- if content['type'] == 'image' -%}
+                    {{ '' }}
+                {%- elif content['type'] == 'text' -%}
+                    {{ content['text'].replace('/think', '').replace('/no_think', '').strip() }}
+                {%- endif -%}
+            {%- endfor -%}
+        {%- endif -%}
+        {{ '\n' }}
+        {%- if tools -%}
+            {%- if message['content'].replace('/think', '').replace('/no_think', '').strip() != '' -%}
+                {{ '\n\n' }}
+            {%- endif -%}
+            {{ 'You can use the following tools to assist the user if required:\n<AVAILABLE_TOOLS>[' }}
+            {%- for tool in tools -%}
+                {{- (tool.function if tool.function is defined else tool) | tojson -}}
+                {{ ', ' if not loop.last else '' }}
+            {%- endfor -%}
+            {{ ']</AVAILABLE_TOOLS>\n\n' }}
+            {{ 'If you decide to call any tool(s), use the following format:\n' }}
+            {{ '<TOOLCALL>[{{"name": "tool_name1", "arguments": "tool_args1"}}, {{"name": "tool_name2", "arguments": "tool_args2"}}]</TOOLCALL>\n\n' }}
+            {{ 'The user will execute tool-calls and return responses from tool(s) in this format:\n' }}
+            {{ '<TOOL_RESPONSE>[{{"tool_response1"}}, {{"tool_response2"}}]</TOOL_RESPONSE>\n\n' }}
+            {{ 'Based on the tool responses, you can call additional tools if needed, correct tool calls if any errors are found, or just respond to the user.\n' }}
+        {%- endif -%}
+    {%- elif message['role'] == 'user' -%}
+        {{ '<SPECIAL_11>User\n' }}
+        {%- if message['content'] is string -%}
+            {{ message['content'].replace('/think', '').replace('/no_think', '').strip() }}
+        {%- else -%}
+            {%- for content in message['content'] -%}
+                {%- if content['type'] == 'image' -%}
+                    {{ '' }}
+                {%- elif content['type'] == 'text' -%}
+                    {{ content['text'].replace('/think', '').replace('/no_think', '').strip() }}
+                {%- endif -%}
+            {%- endfor -%}
+        {%- endif -%}
+        {{ '\n' }}
+    {%- elif message['role'] == 'tool' -%}
+        {%- if loop.first or (messages[loop.index0 - 1].role != 'tool') -%}
+            {{ '<SPECIAL_11>User\n<TOOL_RESPONSE>[' }}
+        {%- endif -%}
+        {{ message.content }}
+        {{ ', ' if not loop.last and (messages[loop.index0 + 1].role == 'tool') else '' }}
+        {%- if loop.last or (messages[loop.index0 + 1].role != 'tool') -%}
+            {{ ']</TOOL_RESPONSE>\n' }}
+        {%- endif -%}
+    {%- elif message['role'] == 'assistant' -%}
+        {%- if '</think>' in content -%}
+            {%- set content = content.split('</think>')[1].strip() -%}
+        {%- endif -%}
+        {{ '<SPECIAL_11>Assistant\n' + content.strip() }}
+        {%- if message.tool_calls -%}
+            {%- if content.strip() != '' -%}
+                {{ '\n\n' }}
+            {%- endif -%}
+            {{ '<TOOLCALL>[' }}
+            {%- for call in message.tool_calls -%}
+                {%- set fn = call.function if call.function is defined else call -%}
+                {{ '{"name": "' + fn.name + '", "arguments": ' }}
+                {%- if fn.arguments is string -%}
+                    {{- fn.arguments -}}
+                {%- else -%}
+                    {{- fn.arguments | tojson -}}
+                {%- endif -%}
+                {{ '}' + (', ' if not loop.last else '') }}
+            {%- endfor -%}
+            {{ ']</TOOLCALL>' }}
+        {%- endif -%}
+        {{ '\n<SPECIAL_12>\n' }}
+    {%- endif -%}
+{%- endfor -%}
+{%- set ns = namespace(enable_thinking=true) -%}
+{%- for message in messages -%}
+    {%- set content = message['content'] -%}
+    {%- if message['role'] == 'user' or message['role'] == 'system' -%}
+        {%- if '/think' in content -%}
+            {%- set ns.enable_thinking = true -%}
+        {%- elif '/no_think' in content -%}
+            {%- set ns.enable_thinking = false -%}
+        {%- endif -%}
+    {%- endif -%}
+{%- endfor -%}
+{%- if not add_generation_prompt is defined -%}
+  {%- set add_generation_prompt = false -%}
+{%- endif -%}
+{%- if add_generation_prompt -%}
+    {{ '<SPECIAL_11>Assistant\n' }}
+    {{ '<think></think>' }}
+{%- endif -%}

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,15 @@

+{
+  "image_processor_type": "NemotronNanoVLV2ImageProcessor",
+  "auto_map": {
+    "AutoImageProcessor": "image_processing.NemotronNanoVLV2ImageProcessor",
+    "AutoVideoProcessor": "video_processing.NemotronNanoVLV2VideoProcessor",
+    "AutoProcessor": "processing.NemotronNanoVLV2Processor"
+  },
+  "image_size": 512,
+  "patch_size": 16,
+  "downsample_ratio": 0.5,
+  "max_num_tiles": 12,
+  "use_thumbnail": true,
+  "norm_mean": [0.48145466, 0.4578275, 0.40821073],
+  "norm_std": [0.26862954, 0.26130258, 0.27577711]
+}

privacy.md ADDED Viewed

	@@ -0,0 +1,13 @@

+Field                                                                                                                              |  Response
+:----------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------
+Generatable or reverse engineerable personal data?                                                     |  No
+Personal data used to create this model?                                                                                       |  No
+Was consent obtained for any personal data used?                                                                                             |  Not Applicable
+A description of any methods implemented in data acquisition or processing, if any, to address the prevalence of personal data in the training data, where relevant and applicable. | We used only prompts that do not contain any personal data for synthetic data generation.
+How often is dataset reviewed?                                                                                                     |  Before release and during dataset creation and model training <br><br>
+Is there provenance for all datasets used in training?                                                                                |  Yes
+Does data labeling (annotation, metadata) comply with privacy laws?                                                                |  Yes
+Is data compliant with data subject requests for data correction or removal, if such a request was made?                           |  No, not possible with externally-sourced data.
+Applicable Privacy Policy        | [Privacy Policy](https://www.nvidia.com/en-us/about-nvidia/privacy-policy/)
+During AI model development, strict adherence to copyright policy ensured compliance through risk mitigation and legal reviews. Post-data collection, reserved rights content is identified and removed, with verified opt-out processes for rightsholders. Detailed records document due diligence and transparency.
+We employ automated tools and data processing techniques to scan for Personally Identifiable Information (PII) during pre-training to identify and filter certain categories of personal information, including public-facing contact details such as email addresses and phone numbers. Scans of Common Crawl, CC-News, and Wikimedia datasets did not detect PII in the majority of samples. However, Microsoft Presidio indicated potential findings including business contact information embedded in natural language, such as email addresses and phone numbers. These were removed using verified instances of PII through a combination of automated filtering and human-in-the-loop validation.

processing.py ADDED Viewed

	@@ -0,0 +1,261 @@

+from typing import Optional, Union, List
+import numpy as np
+from transformers.feature_extraction_utils import BatchFeature
+from transformers.image_utils import ImageInput
+from transformers.processing_utils import ImagesKwargs, MultiModalData, ProcessingKwargs, ProcessorMixin, Unpack, VideosKwargs
+from transformers.tokenization_utils_base import PreTokenizedInput, TextInput
+from transformers.video_utils import VideoInput
+class NemotronNanoVLV2ImagesKwargs(ImagesKwargs):
+    min_pixels: Optional[int]
+    max_pixels: Optional[int]
+    patch_size: Optional[int]
+    temporal_patch_size: Optional[int]
+    merge_size: Optional[int]
+class NemotronNanoVLV2ProcessorKwargs(ProcessingKwargs, total=False):
+    images_kwargs: NemotronNanoVLV2ImagesKwargs
+    videos_kwargs: VideosKwargs
+    _defaults = {
+        "text_kwargs": {
+            "padding": False,
+        },
+    }
+class NemotronNanoVLV2Processor(ProcessorMixin):
+    r"""
+    Constructs a Nemotron Nano VL V2 processor which wraps an image processor and a tokenizer into a single processor.
+    [`NemotronNanoVLV2Processor`] offers all the functionalities of the image processor and tokenizer. See the
+    [`~NemotronNanoVLV2Processor.__call__`] and [`~NemotronNanoVLV2Processor.decode`] for more information.
+    Args:
+        image_processor ([`AutoImageProcessor`], *optional*):
+            The image processor is a required input.
+        tokenizer ([`AutoTokenizer`], *optional*):
+            The tokenizer is a required input.
+        chat_template (`str`, *optional*): A Jinja template which will be used to convert lists of messages
+            in a chat into a tokenizable string.
+    """
+    attributes = ["image_processor", "tokenizer"]
+    image_processor_class = "AutoImageProcessor"
+    video_processor_class = "AutoVideoProcessor"
+    tokenizer_class = ("AutoTokenizer")
+    def __init__(self, image_processor=None, tokenizer=None, chat_template=None, **kwargs):
+        self.image_token = "<image>" if not hasattr(tokenizer, "image_token") else tokenizer.image_token
+        self.video_token = "<video>" if not hasattr(tokenizer, "video_token") else tokenizer.video_token
+        self.image_start_token = "<img>" if not hasattr(tokenizer, "image_start_token") else tokenizer.image_start_token
+        self.image_end_token = "</img>" if not hasattr(tokenizer, "image_end_token") else tokenizer.image_end_token
+        self.image_token_id = (
+            tokenizer.image_token_id
+            if getattr(tokenizer, "image_token_id", None)
+            else tokenizer.convert_tokens_to_ids(self.image_token)
+        )
+        self.video_token_id = (
+            tokenizer.video_token_id
+            if getattr(tokenizer, "video_token_id", None)
+            else tokenizer.convert_tokens_to_ids(self.video_token)
+        )
+        super().__init__(image_processor, tokenizer, chat_template=chat_template)
+    def __call__(
+        self,
+        images: ImageInput = None,
+        text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,
+        videos: VideoInput = None,
+        **kwargs: Unpack[NemotronNanoVLV2ProcessorKwargs],
+    ) -> BatchFeature:
+        """
+        Main method to prepare multimodal inputs (text, images, videos) for the model. This method processes text by
+        replacing image/video tokens with appropriate placeholder sequences, processes images and videos through the
+        image processor, and tokenizes the final text.
+        The method performs the following key operations:
+        1. Processes images using the image processor to get pixel values and patch counts
+        2. Processes videos using the image processor with max_num_tiles=1 to get video pixel values
+        3. Replaces `<image>` tokens in text with `<img>` + image tokens + `</img>` sequences
+        4. Replaces `<video>` tokens in text with frame-by-frame descriptions including timestamps (if metadata provided)
+        5. Tokenizes the processed text and combines all outputs
+        Args:
+            images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`, *optional*):
+                The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
+                tensor. Both channels-first and channels-last formats are supported.
+            text (`str`, `List[str]`, *optional*):
+                The sequence or batch of sequences to be encoded. Each sequence should be a string. The text can contain
+                special tokens `<image>` and `<video>` that will be replaced with appropriate token sequences.
+            videos (`np.ndarray`, `torch.Tensor`, `List[np.ndarray]`, `List[torch.Tensor]`, *optional*):
+                The video or batch of videos to be prepared. Each video should be a 4D NumPy array or PyTorch
+                tensor with shape (num_frames, channels, height, width). Both channels-first and channels-last formats
+                are supported. Note: Currently only supports batch size of 1 for videos.
+            images_kwargs (`Dict`, *optional*):
+                Additional keyword arguments for image processing, including:
+                - `min_pixels` (`int`, *optional*): Minimum number of pixels for image processing
+                - `max_pixels` (`int`, *optional*): Maximum number of pixels for image processing
+                - `patch_size` (`int`, *optional*): Size of patches for image processing
+                - `temporal_patch_size` (`int`, *optional*): Size of temporal patches
+                - `merge_size` (`int`, *optional*): Size for merging patches
+            videos_kwargs (`Dict`, *optional*):
+                Additional keyword arguments for video processing, including:
+                - `video_metadata` (`VideoMetadata`, *optional*): Metadata containing fps information for timestamp calculation
+            text_kwargs (`Dict`, *optional*):
+                Additional keyword arguments for text tokenization, including:
+                - `return_tensors` (`str` or [`~utils.TensorType`], *optional*): Framework for returned tensors ('tf', 'pt', 'np', 'jax')
+                - `padding` (`bool`, *optional*): Whether to pad sequences (defaults to False)
+        Returns:
+            [`BatchFeature`]: A [`BatchFeature`] with the following fields:
+            - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
+            - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
+              `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
+              `None`).
+            - **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
+            - **num_patches** -- Number of patches per image. Returned when `images` is not `None`.
+            - **pixel_values_videos** -- Pixel values of videos to be fed to a model. Returned when `videos` is not `None`.
+        Raises:
+            AssertionError: If videos are provided with batch size > 1 (not currently supported).
+        Note:
+            - Image tokens `<image>` in text are replaced with `<img>` + repeated image tokens + `</img>`
+            - Video tokens `<video>` in text are replaced with frame-by-frame descriptions
+            - When video metadata with fps is provided, frame descriptions include timestamps
+            - Videos are processed with max_num_tiles=1 regardless of the images setting
+        """
+        output_kwargs = self._merge_kwargs(
+            NemotronNanoVLV2ProcessorKwargs,
+            tokenizer_init_kwargs=self.tokenizer.init_kwargs,
+            **kwargs,
+        )
+        image_inputs = videos_inputs = {}
+        if images is not None:
+            image_inputs = self.image_processor(images=images, **output_kwargs["images_kwargs"])
+            image_num_patches = image_inputs["num_patches"]
+        if videos is not None:
+            orig_tiles = self.image_processor.max_num_tiles
+            self.image_processor.max_num_tiles = 1
+            videos_inputs = self.image_processor(images=videos, **output_kwargs["images_kwargs"])
+            self.image_processor.max_num_tiles = orig_tiles
+            video_num_patches = [sum(videos_inputs["num_patches"])]
+            videos_inputs["pixel_values_videos"] = videos_inputs["pixel_values"]
+            del videos_inputs["pixel_values"]
+        if not isinstance(text, list):
+            text = [text]
+        text = text.copy()  # below lines change text in-place
+        if images is not None:
+            index = 0
+            for i in range(len(text)):
+                while self.image_token in text[i]:
+                    text[i] = text[i].replace(self.image_token, self.image_start_token + "<|placeholder|>" * image_num_patches[index] * self.image_processor.num_image_token + self.image_end_token, 1)
+                    index += 1
+                text[i] = text[i].replace("<|placeholder|>", self.image_token)
+        if videos is not None:
+            assert len(text) == 1, "Video is not supported for batch size > 1"
+            video_metadata = output_kwargs.get("videos_kwargs", {}).get("video_metadata", None)
+            i = 0
+            index = 0
+            if self.video_token in text[i]:
+                each_frame = self.image_start_token + "<|placeholder|>" * self.image_processor.num_image_token + self.image_end_token
+                video_prompt = "This is a video:\n"
+                for j in range(video_num_patches[index]):
+                    if video_metadata is not None and video_metadata.fps is not None:
+                        timestamp = j / video_metadata.fps
+                        video_prompt += f"Frame {j+1} sampled at {timestamp:.2f} seconds: {each_frame}\n"
+                    else:
+                        # Fallback to original format without timestamps
+                        video_prompt += f"Frame {j+1}: {each_frame}\n"
+                text[i] = text[i].replace(self.video_token, video_prompt, 1)
+            text[i] = text[i].replace("<|placeholder|>", self.video_token)
+        return_tensors = output_kwargs["text_kwargs"].pop("return_tensors", None)
+        text_inputs = self.tokenizer(text, **output_kwargs["text_kwargs"])
+        return BatchFeature(data={**text_inputs, **image_inputs, **videos_inputs}, tensor_type=return_tensors)
+    def _get_num_multimodal_tokens(self, image_sizes=None, video_sizes=None, **kwargs):
+        """
+        Computes the number of placeholder tokens needed for multimodal inputs with the given sizes.
+        Args:
+            image_sizes (`list[list[int]]`, *optional*):
+                The input sizes formatted as (height, width) per each image.
+            video_sizes (`list[list[int]]`, *optional*):
+                The input sizes formatted as (num_frames, height, width) per each video.
+        Returns:
+            `MultiModalData`: A `MultiModalData` object holding number of tokens per each of the provided
+            input modalities, along with other useful data.
+        """
+        vision_data = {}
+        if image_sizes is not None:
+            images_kwargs = NemotronNanoVLV2ProcessorKwargs._defaults.get("images_kwargs", {})
+            images_kwargs.update(kwargs)
+            merge_size = images_kwargs.get("merge_size", None) or self.image_processor.merge_size
+            num_image_patches = [
+                self.image_processor.get_number_of_image_patches(*image_size, images_kwargs)
+                for image_size in image_sizes
+            ]
+            num_image_tokens = [(num_patches // merge_size**2) for num_patches in num_image_patches]
+            vision_data.update({"num_image_tokens": num_image_tokens, "num_image_patches": num_image_patches})
+        return MultiModalData(**vision_data)
+    def batch_decode(self, *args, **kwargs):
+        """
+        This method forwards all its arguments to the tokenizer's [`~PreTrainedTokenizer.batch_decode`]. Please
+        refer to the docstring of this method for more information.
+        """
+        return self.tokenizer.batch_decode(*args, **kwargs)
+    def decode(self, *args, **kwargs):
+        """
+        This method forwards all its arguments to the tokenizer's [`~PreTrainedTokenizer.decode`]. Please refer to
+        the docstring of this method for more information.
+        """
+        return self.tokenizer.decode(*args, **kwargs)
+    def post_process_image_text_to_text(
+        self, generated_outputs, skip_special_tokens=True, clean_up_tokenization_spaces=False, **kwargs
+    ):
+        """
+        Post-process the output of the model to decode the text.
+        Args:
+            generated_outputs (`torch.Tensor` or `np.ndarray`):
+                The output of the model `generate` function. The output is expected to be a tensor of shape `(batch_size, sequence_length)`
+                or `(sequence_length,)`.
+            skip_special_tokens (`bool`, *optional*, defaults to `True`):
+                Whether or not to remove special tokens in the output. Argument passed to the tokenizer's `batch_decode` method.
+            clean_up_tokenization_spaces (`bool`, *optional*, defaults to `False`):
+                Whether or not to clean up the tokenization spaces. Argument passed to the tokenizer's `batch_decode` method.
+            **kwargs:
+                Additional arguments to be passed to the tokenizer's `batch_decode method`.
+        Returns:
+            `list[str]`: The decoded text.
+        """
+        return self.tokenizer.batch_decode(
+            generated_outputs,
+            skip_special_tokens=skip_special_tokens,
+            clean_up_tokenization_spaces=clean_up_tokenization_spaces,
+            **kwargs,
+        )
+    @property
+    def model_input_names(self):
+        tokenizer_input_names = self.tokenizer.model_input_names
+        image_processor_input_names = self.image_processor.model_input_names
+        names_from_processor = list(dict.fromkeys(tokenizer_input_names + image_processor_input_names))
+        return names_from_processor + ["second_per_grid_ts"]
+__all__ = ["NemotronNanoVLV2Processor"]

processing_utils.py ADDED Viewed

	@@ -0,0 +1,83 @@

+from typing import List, Optional, Union, Any, Dict
+from PIL import Image
+import torch
+from transformers.image_processing_base import BatchFeature
+from transformers.image_processing_utils_fast import BaseImageProcessorFast, divide_to_patches
+from transformers.image_utils import (make_list_of_images, get_image_size,
+                                      get_image_type, ImageInput, ImageType, ChannelDimension)
+from transformers.utils import TensorType
+import torchvision.transforms as T
+def get_internvl_target_ratios(
+    min_num: int,
+    max_num: int,
+) -> list[tuple[int, int]]:
+    target_ratios = {(i, j)
+                     for n in range(min_num, max_num + 1)
+                     for i in range(1, n + 1)
+                     for j in range(1, n + 1) if min_num <= i * j <= max_num}
+    return sorted(target_ratios, key=lambda x: x[0] * x[1])
+def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
+    best_factor = float('-inf')
+    best_ratio = (1, 1)
+    area = width * height
+    for ratio in target_ratios:
+        target_aspect_ratio = ratio[0] / ratio[1]
+        factor_based_on_area_n_ratio = min(
+            (ratio[0]*ratio[1]*image_size*image_size)/ area, 0.6
+            )* min(
+                target_aspect_ratio/aspect_ratio, aspect_ratio/target_aspect_ratio)
+        if factor_based_on_area_n_ratio > best_factor:
+            best_factor = factor_based_on_area_n_ratio
+            best_ratio = ratio
+    return best_ratio
+def calculate_targets(
+    orig_width: int,
+    orig_height: int,
+    target_ratios: list[tuple[int, int]],
+    image_size: int,
+) -> tuple[int, int, int]:
+    aspect_ratio = orig_width / orig_height
+    # find the closest aspect ratio to the target
+    target_aspect_ratio = find_closest_aspect_ratio(
+        aspect_ratio,
+        target_ratios,
+        width=orig_width,
+        height=orig_height,
+        image_size=image_size,
+    )
+    # calculate the target width and height
+    target_width = image_size * target_aspect_ratio[0]
+    target_height = image_size * target_aspect_ratio[1]
+    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
+    return blocks, target_width, target_height
+def dynamic_preprocess(image, image_size=512, max_num_tiles=12, use_thumbnail=True):
+    orig_height, orig_width = get_image_size(image, channel_dim=ChannelDimension.FIRST)
+    target_ratios = get_internvl_target_ratios(1, max_num_tiles)
+    blocks, target_width, target_height = calculate_targets(
+        orig_width,
+        orig_height,
+        target_ratios,
+        image_size
+    )
+    # resize the image
+    resized_img = T.Resize((target_width, target_height), interpolation=T.InterpolationMode.BICUBIC)(image)
+    patches = divide_to_patches(resized_img, image_size)
+    assert len(patches) == blocks
+    if use_thumbnail and len(patches) != 1:
+        thumbnail_img = T.Resize((image_size, image_size), interpolation=T.InterpolationMode.BICUBIC)(image)
+        patches.append(thumbnail_img)
+    return patches

safety.md ADDED Viewed

	@@ -0,0 +1,10 @@

+Field                                               |  Response
+:---------------------------------------------------|:----------------------------------
+Model Application Field(s):                               |  Customer Service, Media & Entertainment, Enterprise Document Intelligence and Processing & Retail
+Describe the life critical impact (if present).   |  Not Applicable
+Description of methods implemented in data acquisition or processing, if any, to address other types of potentially harmful data in the training, testing, and validation data: | We used a guard model for content safety to exclude potentially harmful data from  training.
+Description of any methods implemented in data acquisition or processing, if any, to address illegal or harmful content in the training data, including, but not limited to, child sexual abuse material (CSAM) and non-consensual intimate imagery (NCII) | We used a Gemma-3 4B-based guard model trained on [Nemotron Content Safety Dataset v2](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-2.0) for content safety to exclude potentially illegal or harmful content from the training. We also did CSAM checks on our image datasets for training.
+Use Case Restrictions:                              |  Use of this model is governed by the [ NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/)
+Model and dataset restrictions:            |  The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development.  Restrictions enforce dataset access during training, and dataset license constraints adhered to.
+This AI model was developed based on our policies to ensure responsible data handling and risk mitigation. The datasets used for training have been scanned for harmful content and illegal content, consistent with our policies including scanning for Child Sexual Abuse Material (CSAM). Ongoing review and monitoring mechanisms are in place based on our policies and to maintain data integrity.
+The model was optimized explicitly for instruction following and as such is more susceptible to prompt injection and jailbreaking in various forms as a result of its instruction tuning. This means that the model should be paired with additional rails or system filtering to limit exposure to instructions from malicious sources -- either directly or indirectly by retrieval (e.g. via visiting a website) -- as they may yield outputs that can lead to harmful, system-level outcomes up to and including remote code execution in agentic systems when effective security controls including guardrails are not in place. The model may generate answers that may be inaccurate, omit key information, include irrelevant or redundant text, or produce socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,23 @@

+{
+  "bos_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "<SPECIAL_12>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:db8e35444fca3a2b98e2c8e927a8f1d8b1ba9d4b349e13ce5aafdb11b6404205
+size 17079976

tokenizer_config.json ADDED Viewed

The diff for this file is too large to render. See raw diff

video_io.py ADDED Viewed

	@@ -0,0 +1,176 @@

+import os
+import base64
+import mimetypes
+from PIL import Image
+import io
+from transformers.video_utils import VideoMetadata
+def encode_pil_to_jpeg_data_url(pil_image):
+    from io import BytesIO
+    buf = BytesIO()
+    pil_image.save(buf, format="JPEG")
+    b64 = base64.b64encode(buf.getvalue()).decode("utf-8")
+    return f"data:image/jpeg;base64,{b64}"
+def sample_video_frames_to_data_urls(video_path_local, fps=1, nframe=0, nframe_max=-1):
+    """
+    Sample frames from a video and return base64-encoded data URLs along with metadata.
+    Args:
+        video_path_local: Path to the video file
+        fps: Target frames per second for sampling (if > 0, uses fps-based sampling)
+        nframe: Number of frames to sample (used if fps <= 0)
+        nframe_max: Maximum number of frames to sample
+    Returns:
+        tuple: (frame_data_urls, metadata)
+        - frame_data_urls: List of base64-encoded frame images
+        - metadata: VideoMetadata dataclass containing info about the sampled frames:
+            - total_num_frames: Number of sampled frames
+            - fps: Effective frame rate of the sampled frames
+            - duration: Duration covered by the sampled frames (in seconds)
+            - video_backend: Backend used for video processing ('decord')
+    """
+    import numpy as np
+    from PIL import Image
+    import decord
+    vid = decord.VideoReader(video_path_local)
+    total_frames = len(vid)
+    video_fps = vid.get_avg_fps()
+    total_duration = total_frames / max(1e-6, video_fps)
+    if fps > 0:
+        required_frames = int(total_duration * fps)
+        desired_frames = max(1, required_frames)
+        if nframe_max > 0 and desired_frames > nframe_max:
+            desired_frames = nframe_max
+        if desired_frames >= total_frames:
+            indices = list(range(total_frames))
+        elif desired_frames == 1:
+            indices = [0]  # Always use first frame for single frame sampling
+        else:
+            # Generate evenly spaced indices and ensure uniqueness
+            raw_indices = np.linspace(0, total_frames - 1, desired_frames)
+            indices = list(np.unique(np.round(raw_indices).astype(int)))
+    else:
+        desired_frames = max(1, int(nframe) if nframe and nframe > 0 else 8)
+        if nframe_max > 0 and desired_frames > nframe_max:
+            desired_frames = nframe_max
+        if desired_frames >= total_frames:
+            indices = list(range(total_frames))
+        elif desired_frames == 1:
+            indices = [0]  # Always use first frame for single frame sampling
+        else:
+            # Generate evenly spaced indices and ensure uniqueness
+            raw_indices = np.linspace(0, total_frames - 1, desired_frames)
+            indices = list(np.unique(np.round(raw_indices).astype(int)))
+    images = [Image.fromarray(vid[i].asnumpy()) for i in indices]
+    frame_urls = [encode_pil_to_jpeg_data_url(im) for im in images]
+    # Calculate timestamps for each sampled frame
+    timestamps = [float(idx) / video_fps for idx in indices]
+    # Calculate metadata for the sampled frames
+    sampled_num_frames = len(indices)
+    # Duration is the time span from first to last frame
+    if len(timestamps) > 1:
+        sampled_duration = timestamps[-1] - timestamps[0]
+        sampled_fps = (sampled_num_frames - 1) / sampled_duration if sampled_duration > 0 else 1.0
+    else:
+        # Single frame case
+        sampled_duration = None
+        sampled_fps = None
+    metadata = VideoMetadata(
+        total_num_frames=sampled_num_frames,
+        fps=sampled_fps,
+        duration=sampled_duration,
+        video_backend=None,
+    )
+    return frame_urls, metadata
+def maybe_path_or_url_to_data_urls(path_or_url, fps=1, nframe=0, nframe_max=-1):
+    """
+    Convert a path or URL to data URLs, handling videos, images, and remote files.
+    Args:
+        path_or_url: Path or URL to the media file
+        fps: Target frames per second for video sampling (if > 0, uses fps-based sampling)
+        nframe: Number of frames to sample from video (used if fps <= 0)
+        nframe_max: Maximum number of frames to sample
+    Returns:
+        tuple: (data_urls, metadata)
+        - data_urls: List of base64-encoded data URLs
+        - metadata: VideoMetadata dataclass with video metadata or None for images
+    """
+    val = str(path_or_url or "")
+    low = val.lower()
+    # Handle data URLs
+    if low.startswith("data:"):
+        if low.startswith("data:video/mp4"):
+            header, _, b64part = val.partition(",")
+            if not b64part:
+                return [val], None
+            import tempfile
+            tmp = tempfile.NamedTemporaryFile(suffix=".mp4", delete=False)
+            try:
+                tmp.write(base64.b64decode(b64part))
+                tmp.flush(); tmp.close()
+                return sample_video_frames_to_data_urls(tmp.name, fps=fps, nframe=nframe, nframe_max=nframe_max)
+            finally:
+                try:
+                    os.unlink(tmp.name)
+                except Exception:
+                    pass
+        return [val], None
+    # Remote URL
+    if low.startswith("http://") or low.startswith("https://"):
+        if low.endswith(".mp4"):
+            try:
+                import tempfile, urllib.request
+                with tempfile.NamedTemporaryFile(suffix=".mp4", delete=False) as tmpf:
+                    urllib.request.urlretrieve(val, tmpf.name)
+                    local_path = tmpf.name
+                result = sample_video_frames_to_data_urls(local_path, fps=fps, nframe=nframe, nframe_max=nframe_max)
+                try:
+                    os.unlink(local_path)
+                except Exception:
+                    pass
+                return result
+            except Exception:
+                return [val], None
+        return [val], None
+    # Local path
+    if os.path.exists(val):
+        mime, _ = mimetypes.guess_type(val)
+        if mime and mime.startswith("image/"):
+            with open(val, "rb") as f:
+                b64 = base64.b64encode(f.read()).decode("utf-8")
+            return [f"data:{mime};base64,{b64}"], None
+        if mime == "video/mp4" or (mime is None and val.endswith(".mp4")):
+            return sample_video_frames_to_data_urls(val, fps=fps, nframe=nframe, nframe_max=nframe_max)
+        # Fallback: treat as binary image
+        with open(val, "rb") as f:
+            b64 = base64.b64encode(f.read()).decode("utf-8")
+        return [f"data:image/jpeg;base64,{b64}"], None
+    return [val], None
+def pil_image_from_base64(b64_str: str) -> Image.Image:
+    # Handle data URLs like "data:image/png;base64,...."
+    if b64_str.startswith('data:'):
+        b64_str = b64_str.split(',', 1)[1]
+    img_bytes = base64.b64decode(b64_str)
+    return Image.open(io.BytesIO(img_bytes))