--- license: apache-2.0 base_model: ibm-granite/granite-8b-code-instruct-4k tags: - fp8 - quantized - code - granite - ibm - llmcompressor - vllm library_name: transformers pipeline_tag: text-generation --- # granite-8b-code-instruct-4k-FP8 This is an FP8 quantized version of [granite-8b-code-instruct-4k](https://huggingface.co/ibm-granite/granite-8b-code-instruct-4k) for efficient inference. ## Model Description - **Base Model:** [granite-8b-code-instruct-4k](https://huggingface.co/ibm-granite/granite-8b-code-instruct-4k) - **Quantization:** FP8 (E4M3 format) - **Quantization Method:** llmcompressor oneshot with FP8 scheme - **Calibration Dataset:** open_platypus (512 samples) - **Quantization Time:** 21.6 minutes ## Usage ### With Transformers ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model = AutoModelForCausalLM.from_pretrained( "TevunahAi/granite-8b-code-instruct-4k-FP8", torch_dtype=torch.float8_e4m3fn, # FP8 dtype device_map="auto", low_cpu_mem_usage=True, ) tokenizer = AutoTokenizer.from_pretrained("TevunahAi/granite-8b-code-instruct-4k-FP8") # Generate prompt = "Write a Python function to calculate fibonacci numbers:" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=256) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ### With vLLM (Recommended for production) ```python from vllm import LLM, SamplingParams llm = LLM(model="TevunahAi/granite-8b-code-instruct-4k-FP8") sampling_params = SamplingParams(temperature=0.7, max_tokens=256) prompts = ["Write a Python function to calculate fibonacci numbers:"] outputs = llm.generate(prompts, sampling_params) ``` ## Quantization Details - **Target Layers:** All Linear layers except lm_head - **Precision:** FP8 (E4M3 format) - **Hardware Requirements:** NVIDIA Ada Lovelace or Hopper (native FP8) or Ampere with emulation ### Quantization Infrastructure Quantized on professional hardware to ensure quality and reliability: - **CPUs:** Dual Intel Xeon Max 9480 (224 threads, 128GB HBM2e) - **GPU:** NVIDIA RTX 5000 Ada Generation (32GB VRAM) with native FP8 support - **Memory:** 256GB DDR5 + 128GB HBM2e = 384GB total - **Software:** Ubuntu 25.10 | Python 3.12 | PyTorch 2.8 | CUDA 13 | llm-compressor ## License Apache 2.0 (same as original model) ## Credits - Original model by [IBM Granite](https://huggingface.co/ibm-granite) - Quantized by [TevunahAi](https://huggingface.co/TevunahAi) - Quantization powered by [llm-compressor](https://github.com/vllm-project/llm-compressor)