🧠 AceNemotron-14B 8-Bit Quantized Model

Welcome to the 8-bit quantized version of the AceNemotron-14B model! This release brings reduced memory usage and faster inference speeds, making it suitable for deployment on lower-resource hardware without sacrificing much performance.


📦 Model Details

  • Base Model: AceNemotron-14B
  • Quantization: 8-bit
  • Quantized With: BitsAndBytes
  • Precision: Int8
  • Use Case: Faster inference, lower memory footprint, efficient finetuning
  • Uploader: mr-abhisharma

🛠 Installation

To use this model efficiently, make sure you have the required dependencies installed:

pip install transformers accelerate bitsandbytes

🚀 Usage

Here's a basic example using transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "mr-abhisharma/AceNemotron-14B-8bit"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_8bit=True,
    device_map="auto"
)

inputs = tokenizer("Once upon a time,", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=50)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

🧠 Why Quantized?

Quantization to 8-bit reduces model size drastically and improves inference speed, especially on consumer-grade GPUs (e.g. 16GB or lower). It's great for:

  • Personal use and experimentation
  • Running on laptops or single-GPU setups
  • Cost-effective inference deployment

🔧 Limitations

  • Slight drop in precision may occur.
  • This model inherits the limitations of the base AceNemotron-14B model.
  • Use responsibly—ensure compliance with licensing and intended use.

🤝 Acknowledgements


Downloads last month
-
Safetensors
Model size
15B params
Tensor type
F16
·
F32
·
I8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support