Practical performance feedback

#2
by maigonis - opened

I have an i9-9900k CPU, 32GB RAM, and an NVIDIA 1660 SUPER 6GB GPU. This quantized version runs at a respectable 18 tokens per second (tps) with a 16K context window. I usually use latest llama.cpp, and build from source.

The original Q4 quantized model runs at just 2 tps, so the performance difference is massive.

I also suggest that the Intel team clarify the quantization levels of these tensors. Here's what they look like on this model:

llama_model_loader: - type f32: 191 tensors
llama_model_loader: - type q8_0: 2 tensors
llama_model_loader: - type q2_K: 93 tensors
llama_model_loader: - type q4_K: 160 tensors

Many people hear "Q2" and assume most tensors are quantized to Q2, but as you can see, that's not the case - Q2 is only a small part of all tensors.

Keep up the great work! This allows me to run models up to 100B parameters (like this one), whereas my system tops out at 30B with MoE (Mixture-of-Experts).

Thanks for the information.

1 The exact number isn’t very important as moe is fused. I believe the FP32 tensors are mainly normalization layers, which run very fast and contain very few parameters. What really matters is the Linear layer. As you can see from the model size, the original model is about 220 GB, while ours is around 35 GB , which means the main part of the model has been quantized to 2 bits (scale and min values are not negligible in q2ks. so we could not use the perfect ratio 16/2 )

2 We have clarified the mixed-bit configuration in the model card:

The embedding and lm-head layers fall back to 8-bit, and the non-expert layers fall back to 4-bit.  
Please refer to the section "Generate the Model" for more details.

Sign up or log in to comment