OLMo-3-7B-Instruct-NVFP4-1M

NVFP4 quantized version of allenai/Olmo-3-7B-Instruct with extended 1M token context support via linear RoPE scaling.

Model Description

This model is the NVFP4 (4-bit floating point) quantized version of OLMo-3-7B-Instruct, optimized for NVIDIA DGX Spark systems with Blackwell GB10 GPUs and Ada Lovelace architecture support. The quantization uses NVIDIA's ModelOpt library with two-level scaling: E4M3 FP8 per block plus FP32 global scale.

Key Features

Base Model: allenai/Olmo-3-7B-Instruct (7.3B parameters)
Quantization Format: NVFP4 with group_size=16
Context Length: 1,048,576 tokens (1M) via linear RoPE scaling
Model Size: 5.30 GB (64% reduction from 14.60 GB)
GPU Memory: ~5.23 GiB (64% reduction)

Performance

Metric	Original	Quantized	Improvement
Model Size	14.60 GB	5.30 GB	64% reduction
GPU Memory	14.6 GB	5.23 GiB	64% reduction
Context Length	4,096	1,048,576	256x increase
Inference Speed	-	31-35 tok/s	-

Usage

Important: This model requires vLLM with ModelOpt quantization support. It cannot be loaded with standard transformers.

vLLM Server Deployment

python3 -m vllm.entrypoints.openai.api_server \
    --model Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M \
    --quantization modelopt \
    --trust-remote-code \
    --gpu-memory-utilization 0.95 \
    --max-model-len 200000 \
    --served-model-name 'OLMo-3-7B-NVFP4' \
    --host 0.0.0.0 \
    --port 8000

Python Usage with vLLM

from vllm import LLM, SamplingParams

llm = LLM(
    model="Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M",
    quantization="modelopt",
    trust_remote_code=True,
    gpu_memory_utilization=0.95,
    max_model_len=200000
)

sampling_params = SamplingParams(temperature=0.6, top_p=0.95, max_tokens=512)

prompts = ["What is artificial intelligence?"]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)

Requirements

GPU: NVIDIA GPU with compute capability 8.9+ (Ada Lovelace, Blackwell)
vLLM: Latest version with ModelOpt support
Dependencies: pip install vllm transformers torchao

Quantization Details

Algorithm: NVFP4 (4-bit floating point)
Calibration Dataset: allenai/c4 (2048 samples)
Calibration Length: 2048 tokens per sample
Tool: NVIDIA ModelOpt 0.39.0
Group Size: 16
Excluded Layers: lm_head

Context Extension

The context was extended from 4,096 to 1,048,576 tokens using linear RoPE scaling:

Scaling Factor: 16x
rope_theta: 50,000,000
rope_scaling: {"type": "linear", "factor": 16.0}

Note: Actual usable context depends on available GPU memory. With 120GB GPU at 95% utilization, approximately 200,000 tokens can be stored in KV cache.

Architecture Compatibility

For vLLM compatibility, the model uses:

Architecture: Olmo2ForCausalLM
Model Type: olmo2

This mapping allows vLLM to properly load the OLMo-3 architecture.

Limitations

Requires vLLM with --quantization modelopt flag
Cannot be loaded with standard transformers
Requires NVIDIA GPU with FP4 support (Ada Lovelace or newer)
Maximum usable context limited by GPU memory for KV cache

Intended Use

Long-context instruction following and chat
Document analysis and summarization
Code generation and review
Research and educational purposes

License

Apache 2.0 (inherited from base model)

Citation

@misc{olmo3-nvfp4-1m,
  author = {Ex0bit},
  title = {OLMo-3-7B-Instruct-NVFP4-1M: NVFP4 Quantized OLMo-3 with 1M Context},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M}}
}