OLMo-3-7B-Instruct-NVFP4-1M

NVFP4 quantized version of allenai/Olmo-3-7B-Instruct with extended 1M token context support via linear RoPE scaling.

Model Description

This model is the NVFP4 (4-bit floating point) quantized version of OLMo-3-7B-Instruct, optimized for NVIDIA DGX Spark systems with Blackwell GB10 GPUs and Ada Lovelace architecture support. The quantization uses NVIDIA's ModelOpt library with two-level scaling: E4M3 FP8 per block plus FP32 global scale.

Key Features

  • Base Model: allenai/Olmo-3-7B-Instruct (7.3B parameters)
  • Quantization Format: NVFP4 with group_size=16
  • Context Length: 1,048,576 tokens (1M) via linear RoPE scaling
  • Model Size: 5.30 GB (64% reduction from 14.60 GB)
  • GPU Memory: ~5.23 GiB (64% reduction)

Performance

Metric Original Quantized Improvement
Model Size 14.60 GB 5.30 GB 64% reduction
GPU Memory 14.6 GB 5.23 GiB 64% reduction
Context Length 4,096 1,048,576 256x increase
Inference Speed - 31-35 tok/s -

Usage

Important: This model requires vLLM with ModelOpt quantization support. It cannot be loaded with standard transformers.

vLLM Server Deployment

python3 -m vllm.entrypoints.openai.api_server \
    --model Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M \
    --quantization modelopt \
    --trust-remote-code \
    --gpu-memory-utilization 0.95 \
    --max-model-len 200000 \
    --served-model-name 'OLMo-3-7B-NVFP4' \
    --host 0.0.0.0 \
    --port 8000

Python Usage with vLLM

from vllm import LLM, SamplingParams

llm = LLM(
    model="Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M",
    quantization="modelopt",
    trust_remote_code=True,
    gpu_memory_utilization=0.95,
    max_model_len=200000
)

sampling_params = SamplingParams(temperature=0.6, top_p=0.95, max_tokens=512)

prompts = ["What is artificial intelligence?"]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)

Requirements

  • GPU: NVIDIA GPU with compute capability 8.9+ (Ada Lovelace, Blackwell)
  • vLLM: Latest version with ModelOpt support
  • Dependencies: pip install vllm transformers torchao

Quantization Details

  • Algorithm: NVFP4 (4-bit floating point)
  • Calibration Dataset: allenai/c4 (2048 samples)
  • Calibration Length: 2048 tokens per sample
  • Tool: NVIDIA ModelOpt 0.39.0
  • Group Size: 16
  • Excluded Layers: lm_head

Context Extension

The context was extended from 4,096 to 1,048,576 tokens using linear RoPE scaling:

  • Scaling Factor: 16x
  • rope_theta: 50,000,000
  • rope_scaling: {"type": "linear", "factor": 16.0}

Note: Actual usable context depends on available GPU memory. With 120GB GPU at 95% utilization, approximately 200,000 tokens can be stored in KV cache.

Architecture Compatibility

For vLLM compatibility, the model uses:

  • Architecture: Olmo2ForCausalLM
  • Model Type: olmo2

This mapping allows vLLM to properly load the OLMo-3 architecture.

Limitations

  • Requires vLLM with --quantization modelopt flag
  • Cannot be loaded with standard transformers
  • Requires NVIDIA GPU with FP4 support (Ada Lovelace or newer)
  • Maximum usable context limited by GPU memory for KV cache

Intended Use

  • Long-context instruction following and chat
  • Document analysis and summarization
  • Code generation and review
  • Research and educational purposes

License

Apache 2.0 (inherited from base model)

Citation

@misc{olmo3-nvfp4-1m,
  author = {Ex0bit},
  title = {OLMo-3-7B-Instruct-NVFP4-1M: NVFP4 Quantized OLMo-3 with 1M Context},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M}}
}

Acknowledgments

Downloads last month
22
Safetensors
Model size
4B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M

Dataset used to train Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M