|
|
--- |
|
|
pipeline_tag: text-generation |
|
|
license: mit |
|
|
library_name: transformers |
|
|
base_model: |
|
|
- MiniMaxAI/MiniMax-M2 |
|
|
--- |
|
|
# Building and Running the Experimental `minimax` Branch of `llama.cpp` |
|
|
|
|
|
**Note:** |
|
|
This setup is experimental. The `minimax` branch will not work with the standard `llama.cpp`. Use it only for testing GGUF models with experimental features. |
|
|
|
|
|
--- |
|
|
|
|
|
## System Requirements (you can use any supported this is for ubuntu build commands) |
|
|
|
|
|
- Ubuntu 22.04 |
|
|
- NVIDIA GPU with CUDA support |
|
|
- CUDA Toolkit 12.8 or later |
|
|
- CMake |
|
|
|
|
|
--- |
|
|
|
|
|
## Installation Steps |
|
|
|
|
|
### 1. Install CUDA Toolkit 12.8 |
|
|
|
|
|
```bash |
|
|
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb |
|
|
sudo dpkg -i cuda-keyring_1.1-1_all.deb |
|
|
sudo apt-get update |
|
|
sudo apt-get -y install cuda-toolkit-12-8 |
|
|
``` |
|
|
|
|
|
### 2. Set Environment Variables |
|
|
|
|
|
```bash |
|
|
export CUDA_HOME=/usr/local/cuda |
|
|
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64 |
|
|
export PATH=$PATH:$CUDA_HOME/bin |
|
|
``` |
|
|
|
|
|
### 3. Install Build Tools |
|
|
|
|
|
```bash |
|
|
sudo apt install cmake |
|
|
``` |
|
|
|
|
|
### 4. Clone the Experimental Branch |
|
|
|
|
|
```bash |
|
|
git clone --branch minimax --single-branch https://github.com/cturan/llama.cpp.git |
|
|
cd llama.cpp |
|
|
``` |
|
|
|
|
|
### 5. Build the Project |
|
|
|
|
|
```bash |
|
|
mkdir build |
|
|
cd build |
|
|
cmake .. -DLLAMA_CUDA=ON -DLLAMA_CURL=OFF |
|
|
cmake --build . --config Release --parallel $(nproc --all) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Build Output |
|
|
|
|
|
After the build is complete, the binaries will be located in: |
|
|
|
|
|
``` |
|
|
llama.cpp/build/bin |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Running the Model |
|
|
|
|
|
Example command: |
|
|
|
|
|
```bash |
|
|
./llama-server -m minimax-m2-Q4_K.gguf -ngl 999 --cpu-moe --jinja -fa on -c 32000 --reasoning-format auto |
|
|
``` |
|
|
|
|
|
This configuration offloads the experts to the CPU, so approximately 16 GB of VRAM is sufficient. |
|
|
|
|
|
--- |
|
|
|
|
|
## Notes |
|
|
|
|
|
- `--cpu-moe` enables CPU offloading for mixture-of-experts layers. |
|
|
- `--jinja` activates the Jinja templating engine. |
|
|
- Adjust `-c` (context length) and `-ngl` (GPU layers) according to your hardware. |
|
|
- Ensure the model file (`minimax-m2-Q4_K.gguf`) is available in the working directory. |
|
|
|
|
|
--- |
|
|
|
|
|
All steps complete. The experimental CUDA-enabled build of `llama.cpp` is ready to use. |