---
license: apache-2.0
language:
- en
- zh
- th
- id
- vi
pipeline_tag: audio-text-to-text
tags:
- multimodal
- audio-language-model
- audio
base_model:
- mispeech/dasheng-0.6B
- Qwen/Qwen2.5-Omni-7B
base_model_relation: finetune
---
# MiDashengLM-7B-0804 (4bit, bitsandbytes)

The bnb-4bit weights for [mispeech/midashenglm-7b-0804-fp32](https://huggingface.co/mispeech/midashenglm-7b-0804-fp32).

**Note**: This is a basic 4-bit quantization using bitsandbytes.
For better performance and accuracy, we recommend using our [GPTQ-quantized version](https://huggingface.co/mispeech/midashenglm-7b-0804-w4a16-gptq) which maintains higher quality while still providing significant memory savings.

## Usage

### Load Model

```python
from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer

model_id = "mispeech/midashenglm-7b-0804-4bit-bnb"  # "mispeech/midashenglm-7b-0804-w4a16-gptq" is more recommended
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
```

### Construct Prompt

```python
user_prompt = "Caption the audio."  # You may try any other prompt

messages = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are a helpful language and speech assistant."}
        ],
    },
    {
        "role": "user",
        "content": [
            {"type": "text", "text": user_prompt},
            {
                "type": "audio",
                "path": "/path/to/example.wav",
                # or "url": "https://example.com/example.wav"
                # or "audio": np.random.randn(16000)
            },
        ],
    },
]
```

### Generate Output

```python
import torch

with torch.no_grad():
    model_inputs = processor.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        add_special_tokens=True,
        return_dict=True,
    ).to(device=model.device, dtype=model.dtype)
    generation = model.generate(**model_inputs)
    output = tokenizer.batch_decode(generation, skip_special_tokens=True)  # ["An engine is idling."]
```

## Citation

MiDashengLM is under the Apache License 2.0, and we encourage its use in **both research and business applications**.

If you find MiDashengLM useful in your research, please consider citing our work:

```bibtex
@techreport{midashenglm7b,
  title      = {MiDashengLM: Efficient Audio Understanding with General Audio Captions},
  author     = {{Horizon Team, MiLM Plus}},
  institution= {Xiaomi Inc.},
  year       = {2025},
  note       = {Contributors: Heinrich Dinkel et al. (listed alphabetically in Appendix B)},
  url        = {https://arxiv.org/abs/2508.03983},
  eprint     = {2508.03983},
}
```