MiniMax-M2-GGUF / README.md
cturan's picture
Update README.md
70cbcf6 verified
---
pipeline_tag: text-generation
license: mit
library_name: transformers
base_model:
- MiniMaxAI/MiniMax-M2
---
# Building and Running the Experimental `minimax` Branch of `llama.cpp`
**Note:**
This setup is experimental. The `minimax` branch will not work with the standard `llama.cpp`. Use it only for testing GGUF models with experimental features.
---
## System Requirements (you can use any supported this is for ubuntu build commands)
- Ubuntu 22.04
- NVIDIA GPU with CUDA support
- CUDA Toolkit 12.8 or later
- CMake
---
## Installation Steps
### 1. Install CUDA Toolkit 12.8
```bash
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-8
```
### 2. Set Environment Variables
```bash
export CUDA_HOME=/usr/local/cuda
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64
export PATH=$PATH:$CUDA_HOME/bin
```
### 3. Install Build Tools
```bash
sudo apt install cmake
```
### 4. Clone the Experimental Branch
```bash
git clone --branch minimax --single-branch https://github.com/cturan/llama.cpp.git
cd llama.cpp
```
### 5. Build the Project
```bash
mkdir build
cd build
cmake .. -DLLAMA_CUDA=ON -DLLAMA_CURL=OFF
cmake --build . --config Release --parallel $(nproc --all)
```
---
## Build Output
After the build is complete, the binaries will be located in:
```
llama.cpp/build/bin
```
---
## Running the Model
Example command:
```bash
./llama-server -m minimax-m2-Q4_K.gguf -ngl 999 --cpu-moe --jinja -fa on -c 32000 --reasoning-format auto
```
This configuration offloads the experts to the CPU, so approximately 16 GB of VRAM is sufficient.
---
## Notes
- `--cpu-moe` enables CPU offloading for mixture-of-experts layers.
- `--jinja` activates the Jinja templating engine.
- Adjust `-c` (context length) and `-ngl` (GPU layers) according to your hardware.
- Ensure the model file (`minimax-m2-Q4_K.gguf`) is available in the working directory.
---
All steps complete. The experimental CUDA-enabled build of `llama.cpp` is ready to use.