Suggesting an open-weight Gpt-Oss LLM between the 20B and 120B parameters
(I understand that you won't release a LLM for every mentioned memory config)
Request for LLMs the for following systems:
Without the need to offload to RAM:
- Gpt-Oss-20b (the full F16 version 13.8GB) fully fits in a 16 GB VRAM GPU.
- A LLM for 24 GB VRAM GPUs: Maybe a ~30-32B LLM?
- A LLM for 32 GB VRAM GPUs: Maybe a ~40-44B LLM?
I personally don't care much for fully offloading on GPU, because consumer GPUs don't have enough VRAM to fully fit the big and good LLMs and as such, the following, with offloading to RAM, is more important to me (also, for many, it's good enough if the token speed generation is at least as fast as the reading speed, which is like 4-10 t/s):
With offloading to RAM:
- For gaming PCs with 32GB RAM (32GB RAM will soon be more frequent than 16GB, at least according to Steam hardware survey) + a 12GB VRAM GPU: Maybe a ~64B LLM? (32GB RAM + 12GB VRAM - 8GB for the OS,etc. = 36GB)
And most important for me personally:
- For thin and light laptops (=no dGPU) with 32GB LPDDDR5X unified memory/RAM (soldered): Maybe a same-sized LLM as for the 24GB GPU. (32GB RAM - 8GB for the OS,etc)
- For thin and light laptops (=no dGPU) with 48GB LPDDDR5X unified memory/RAM (soldered): Maybe a 72B LLM. (48GB - 8GB for the OS,etc)
- For thin and light laptops (=no dGPU) with 64GB LPDDDR5X unified memory/RAM (soldered): Maybe a 100B LLM. (64GB - 8GB for the OS,etc)
My LLM B calculation is rough (derived from 120B' filesize), but you get the point: Memory [GB]/2×1.09. Context needs to be considered too (full context would require many more gigabytes, not just the -8GB for the OS+etc.).
If you plan to release one more LLM between the 20B and 120B sizes, please do a survey on which size people want next. If I had to choose one, it'd be probably one for either a 48 GB or a 64 GB RAM/unified memory system (=no GPU). Choosing a LLM for the 32GB RAM/unified memory system wouldn't make the jump from Gpt-Oss-20B high enough (20B -> 30B, meh).