cturan
/

MiniMax-M2-GGUF

Text Generation

Model card Files Files and versions

MiniMax-M2-GGUF / README.md

cturan's picture

Update README.md

70cbcf6 verified 27 days ago

|

history blame contribute delete

2.19 kB

	---
	pipeline_tag: text-generation
	license: mit
	library_name: transformers
	base_model:
	- MiniMaxAI/MiniMax-M2
	---
	# Building and Running the Experimental `minimax` Branch of `llama.cpp`

	Note:
	This setup is experimental. The `minimax` branch will not work with the standard `llama.cpp`. Use it only for testing GGUF models with experimental features.

	---

	## System Requirements (you can use any supported this is for ubuntu build commands)

	- Ubuntu 22.04
	- NVIDIA GPU with CUDA support
	- CUDA Toolkit 12.8 or later
	- CMake

	---

	## Installation Steps

	### 1. Install CUDA Toolkit 12.8

	```bash
	wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
	sudo dpkg -i cuda-keyring_1.1-1_all.deb
	sudo apt-get update
	sudo apt-get -y install cuda-toolkit-12-8
	```

	### 2. Set Environment Variables

	```bash
	export CUDA_HOME=/usr/local/cuda
	export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64
	export PATH=$PATH:$CUDA_HOME/bin
	```

	### 3. Install Build Tools

	```bash
	sudo apt install cmake
	```

	### 4. Clone the Experimental Branch

	```bash
	git clone --branch minimax --single-branch https://github.com/cturan/llama.cpp.git
	cd llama.cpp
	```

	### 5. Build the Project

	```bash
	mkdir build
	cd build
	cmake .. -DLLAMA_CUDA=ON -DLLAMA_CURL=OFF
	cmake --build . --config Release --parallel $(nproc --all)
	```

	---

	## Build Output

	After the build is complete, the binaries will be located in:

	```
	llama.cpp/build/bin
	```

	---

	## Running the Model

	Example command:

	```bash
	./llama-server -m minimax-m2-Q4_K.gguf -ngl 999 --cpu-moe --jinja -fa on -c 32000 --reasoning-format auto
	```

	This configuration offloads the experts to the CPU, so approximately 16 GB of VRAM is sufficient.

	---

	## Notes

	- `--cpu-moe` enables CPU offloading for mixture-of-experts layers.
	- `--jinja` activates the Jinja templating engine.
	- Adjust `-c` (context length) and `-ngl` (GPU layers) according to your hardware.
	- Ensure the model file (`minimax-m2-Q4_K.gguf`) is available in the working directory.

	---

	All steps complete. The experimental CUDA-enabled build of `llama.cpp` is ready to use.