--- license: apache-2.0 language: - en - zh - th - id - vi pipeline_tag: audio-text-to-text tags: - multimodal - audio-language-model - audio base_model: - mispeech/dasheng-0.6B - Qwen/Qwen2.5-Omni-7B base_model_relation: finetune --- # MiDashengLM-7B-0804 (4bit, bitsandbytes) The bnb-4bit weights for [mispeech/midashenglm-7b-0804-fp32](https://huggingface.co/mispeech/midashenglm-7b-0804-fp32). **Note**: This is a basic 4-bit quantization using bitsandbytes. For better performance and accuracy, we recommend using our [GPTQ-quantized version](https://huggingface.co/mispeech/midashenglm-7b-0804-w4a16-gptq) which maintains higher quality while still providing significant memory savings. ## Usage ### Load Model ```python from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer model_id = "mispeech/midashenglm-7b-0804-4bit-bnb" # "mispeech/midashenglm-7b-0804-w4a16-gptq" is more recommended model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained(model_id) processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True) ``` ### Construct Prompt ```python user_prompt = "Caption the audio." # You may try any other prompt messages = [ { "role": "system", "content": [ {"type": "text", "text": "You are a helpful language and speech assistant."} ], }, { "role": "user", "content": [ {"type": "text", "text": user_prompt}, { "type": "audio", "path": "/path/to/example.wav", # or "url": "https://example.com/example.wav" # or "audio": np.random.randn(16000) }, ], }, ] ``` ### Generate Output ```python import torch with torch.no_grad(): model_inputs = processor.apply_chat_template( messages, tokenize=True, add_generation_prompt=True, add_special_tokens=True, return_dict=True, ).to(device=model.device, dtype=model.dtype) generation = model.generate(**model_inputs) output = tokenizer.batch_decode(generation, skip_special_tokens=True) # ["An engine is idling."] ``` ## Citation MiDashengLM is under the Apache License 2.0, and we encourage its use in **both research and business applications**. If you find MiDashengLM useful in your research, please consider citing our work: ```bibtex @techreport{midashenglm7b, title = {MiDashengLM: Efficient Audio Understanding with General Audio Captions}, author = {{Horizon Team, MiLM Plus}}, institution= {Xiaomi Inc.}, year = {2025}, note = {Contributors: Heinrich Dinkel et al. (listed alphabetically in Appendix B)}, url = {https://arxiv.org/abs/2508.03983}, eprint = {2508.03983}, } ```