Milad Alizadeh
commited on
Commit
·
8d6eb55
unverified
·
0
Parent(s):
Initial commit
Browse files- .gitattributes +35 -0
- README.md +208 -0
- config.json +48 -0
- model.safetensors +3 -0
.gitattributes
ADDED
|
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
*.7z filter=lfs diff=lfs merge=lfs -text
|
| 2 |
+
*.arrow filter=lfs diff=lfs merge=lfs -text
|
| 3 |
+
*.bin filter=lfs diff=lfs merge=lfs -text
|
| 4 |
+
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
| 5 |
+
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
| 6 |
+
*.ftz filter=lfs diff=lfs merge=lfs -text
|
| 7 |
+
*.gz filter=lfs diff=lfs merge=lfs -text
|
| 8 |
+
*.h5 filter=lfs diff=lfs merge=lfs -text
|
| 9 |
+
*.joblib filter=lfs diff=lfs merge=lfs -text
|
| 10 |
+
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
| 11 |
+
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
| 12 |
+
*.model filter=lfs diff=lfs merge=lfs -text
|
| 13 |
+
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
| 14 |
+
*.npy filter=lfs diff=lfs merge=lfs -text
|
| 15 |
+
*.npz filter=lfs diff=lfs merge=lfs -text
|
| 16 |
+
*.onnx filter=lfs diff=lfs merge=lfs -text
|
| 17 |
+
*.ot filter=lfs diff=lfs merge=lfs -text
|
| 18 |
+
*.parquet filter=lfs diff=lfs merge=lfs -text
|
| 19 |
+
*.pb filter=lfs diff=lfs merge=lfs -text
|
| 20 |
+
*.pickle filter=lfs diff=lfs merge=lfs -text
|
| 21 |
+
*.pkl filter=lfs diff=lfs merge=lfs -text
|
| 22 |
+
*.pt filter=lfs diff=lfs merge=lfs -text
|
| 23 |
+
*.pth filter=lfs diff=lfs merge=lfs -text
|
| 24 |
+
*.rar filter=lfs diff=lfs merge=lfs -text
|
| 25 |
+
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
| 26 |
+
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
| 27 |
+
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
| 28 |
+
*.tar filter=lfs diff=lfs merge=lfs -text
|
| 29 |
+
*.tflite filter=lfs diff=lfs merge=lfs -text
|
| 30 |
+
*.tgz filter=lfs diff=lfs merge=lfs -text
|
| 31 |
+
*.wasm filter=lfs diff=lfs merge=lfs -text
|
| 32 |
+
*.xz filter=lfs diff=lfs merge=lfs -text
|
| 33 |
+
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
+
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
+
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
README.md
ADDED
|
@@ -0,0 +1,208 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: cc-by-nc-sa-4.0
|
| 3 |
+
base_model:
|
| 4 |
+
- meta-llama/Llama-3.1-8B-Instruct
|
| 5 |
+
tags:
|
| 6 |
+
- EarthSpeciesProject
|
| 7 |
+
- NatureLM
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
+
# Model Card for NatureLM-audio
|
| 11 |
+
|
| 12 |
+
NatureLM-audio is the first audio-language foundation model specifically designed for bioacoustics. It is trained on a diverse dataset of text-audio pairs spanning bioacoustics, speech, and music, enabling it to perform tasks such as species classification, detection, captioning, and lifestage classification. The model demonstrates strong generalization to unseen taxa and tasks, setting a new state-of-the-art on several bioacoustics benchmarks.
|
| 13 |
+
|
| 14 |
+
## Model Details
|
| 15 |
+
|
| 16 |
+
### Model Description
|
| 17 |
+
|
| 18 |
+
NatureLM-audio is an audio-language model designed to address bioacoustic tasks such as species classification, detection, and captioning. It leverages a combination of bioacoustic, speech, and music data to learn robust representations that generalize across domains.
|
| 19 |
+
|
| 20 |
+
- **Developed by:** David Robinson, Marius Miron, Masato Hagiwara, Milad Alizadeh, Gagan Narula, Sara Keen, Benno Weck, Matthieu Geist, Olivier Pietquin (Earth Species Project)
|
| 21 |
+
- **Funded by:** More info at [https://www.earthspecies.org/about-us\#support](https://www.earthspecies.org/about-us#support)
|
| 22 |
+
- **Shared by:** Earth Species Project
|
| 23 |
+
- **Model type:** Audio-language foundation model
|
| 24 |
+
- **Language(s) (NLP):** English
|
| 25 |
+
- **License:** CC-BY-NC-SA
|
| 26 |
+
- **Finetuned from model:** Llama-3.1-8B-Instruct, [Fine-tuned BEATs\_iter3+ (AS2M) (cpt2)](https://github.com/microsoft/unilm/tree/master/beats)
|
| 27 |
+
|
| 28 |
+
### Model Sources
|
| 29 |
+
|
| 30 |
+
- **Repository:** [https://github.com/earthspecies/naturelm-audio](https://github.com/earthspecies/naturelm-audio)
|
| 31 |
+
- **Paper:** [NatureLM-audio: An Audio-Language Foundation Model for Bioacoustics](https://arxiv.org/abs/2411.07186)
|
| 32 |
+
- **Demo:** [https://earthspecies.github.io/naturelm-audio-demo/](https://earthspecies.github.io/naturelm-audio-demo/)
|
| 33 |
+
|
| 34 |
+
## Uses
|
| 35 |
+
|
| 36 |
+
### Direct Use
|
| 37 |
+
|
| 38 |
+
NatureLM-audio can be used directly for bioacoustic tasks such as species classification, detection, and captioning. It is particularly useful for biodiversity monitoring, conservation, and animal behavior studies.
|
| 39 |
+
```python
|
| 40 |
+
from NatureLM.models import NatureLM
|
| 41 |
+
|
| 42 |
+
# Download the model from HuggingFace
|
| 43 |
+
model = NatureLM.from_pretrained("EarthSpeciesProject/NatureLM-audio")
|
| 44 |
+
model = model.eval().to("cuda")
|
| 45 |
+
```
|
| 46 |
+
|
| 47 |
+
```python
|
| 48 |
+
from NatureLM.infer import Pipeline
|
| 49 |
+
|
| 50 |
+
# pass your audio files in as file paths or as numpy arrays
|
| 51 |
+
# NOTE: the Pipeline class will automatically load the audio and convert them to numpy arrays
|
| 52 |
+
audio_paths = ["assets/nri-GreenTreeFrogEvergladesNP.mp3"] # wav, mp3, ogg, flac are supported.
|
| 53 |
+
|
| 54 |
+
# Create a list of queries. You may also pass a single query as a string for multiple audios.
|
| 55 |
+
# The same query will be used for all audios.
|
| 56 |
+
queries = ["What is the common name for the focal species in the audio? Answer:"]
|
| 57 |
+
|
| 58 |
+
pipeline = Pipeline(model=model)
|
| 59 |
+
# NOTE: you can also just do pipeline = Pipeline() which will download the model automatically
|
| 60 |
+
|
| 61 |
+
# Run the model over the audio in sliding windows of 10 seconds with a hop length of 10 seconds
|
| 62 |
+
results = pipeline(audio_paths, queries, window_length_seconds=10.0, hop_length_seconds=10.0)
|
| 63 |
+
print(results)
|
| 64 |
+
# ['#0.00s - 10.00s#: Green Treefrog\n']
|
| 65 |
+
```
|
| 66 |
+
|
| 67 |
+
Example prompts:
|
| 68 |
+
|
| 69 |
+
Prompt: What is the common name for the focal species in the audio?
|
| 70 |
+
Answer: Humpback Whale
|
| 71 |
+
|
| 72 |
+
Prompt: Which of these, if any, are present in the audio recording? Single pulse gibbon call, Multiple pulse gibbon call, Gibbon duet, None.
|
| 73 |
+
Answer: Gibbon duet
|
| 74 |
+
|
| 75 |
+
Prompt: What is the common name for the focal species in the audio?
|
| 76 |
+
Answer: Spectacled Tetraka
|
| 77 |
+
|
| 78 |
+
Prompt: What is the life stage of the focal species in the audio?
|
| 79 |
+
Answer: Juvenile
|
| 80 |
+
|
| 81 |
+
Prompt: What type of vocalization is heard from the focal species in the audio? Answer with either 'call' or 'song'.
|
| 82 |
+
|
| 83 |
+
Prompt: Caption the audio, using the common name for any animal species.
|
| 84 |
+
|
| 85 |
+
### Downstream Use
|
| 86 |
+
|
| 87 |
+
The model can be used to structure audio for ethology research, be integrated into larger ecological monitoring systems, or be fine-tuned for specific bioacoustic tasks.
|
| 88 |
+
|
| 89 |
+
### Out-of-Scope Use
|
| 90 |
+
|
| 91 |
+
The model is not designed for tasks outside of bioacoustics. It was not tested for tasks such as individual-id, and call-type and lifestage classification tasks have only been tested on birds Tasks beyond those evaluated in the paper may require in-context learning or fine-tuning. The model does not currently perform fine-grained detection with exact time stamps.
|
| 92 |
+
|
| 93 |
+
### Bias, Risks, and Limitations
|
| 94 |
+
|
| 95 |
+
- **Bias:** The model may exhibit biases towards bird vocalizations due to the overrepresentation of bird datasets in the training data. This could limit its effectiveness for other taxa. Further, the model may inherit biases from the parent Llama model.
|
| 96 |
+
- **Risks:** The model’s ability to detect and classify endangered species could be misused for illegal activities such as poaching.
|
| 97 |
+
- **Limitations:** The model’s performance may be limited for under-represented taxa.
|
| 98 |
+
- **Red-teaming results**: We ran a red-teaming assessment by first defining 16 risk categories adapted for AI safety in the context of animals, ecosystems, and the environment, such as Wildlife Exploitation, Non-Compliance with Environmental Laws, and Biodiversity Loss. Then, we used an LLM to generate adversarial prompts that could potentially lead to harmful output, and the responses were then evaluated to determine their safety. While the majority of responses from NatureLM-audio were safe, often providing no content for problematic prompts, we identified several scenarios where the model's responses were potentially harmful, including cases where the model failed to discourage unethical actions related to wildlife exploitation and environmental harm.
|
| 99 |
+
|
| 100 |
+
### Recommendations
|
| 101 |
+
|
| 102 |
+
Users should be aware of the risks, biases, and limitations of the model. It is recommended to use the model in conjunction with other ecological monitoring tools and to validate its predictions in real-world settings.
|
| 103 |
+
|
| 104 |
+
## How to Get Started with the Model
|
| 105 |
+
|
| 106 |
+
Refer to the GitHub [repository](https://github.com/earthspecies/naturelm-audio) for examples of model usage.
|
| 107 |
+
|
| 108 |
+
## Training Details
|
| 109 |
+
|
| 110 |
+
### Training Data
|
| 111 |
+
|
| 112 |
+
The model is trained on a diverse dataset of text-audio pairs, including bioacoustic recordings, general audio, speech, and music datasets. The training data includes datasets such as Xeno-canto, iNaturalist, and Watkins. We release the [training dataset](https://huggingface.co/datasets/EarthSpeciesProject/NatureLM) on huggingface.
|
| 113 |
+
|
| 114 |
+
### Training Procedure
|
| 115 |
+
|
| 116 |
+
The model is trained in two stages:
|
| 117 |
+
|
| 118 |
+
1. **Perception Pretraining** on species classification.
|
| 119 |
+
2. **Generalization Fine-tuning** on a variety of bioacoustic tasks.
|
| 120 |
+
|
| 121 |
+
#### Training Hyperparameters
|
| 122 |
+
|
| 123 |
+
- **Learning rate:** 9.0e-5 (peak), 2.0e-5 (end)
|
| 124 |
+
- **Batch size:** 128
|
| 125 |
+
- **Training steps:** 5.0e5 (Stage 1), 1.6e6 (Stage 2\)
|
| 126 |
+
|
| 127 |
+
For the full list of hyperparameters consult the NatureLM-audio repository.
|
| 128 |
+
|
| 129 |
+
## Evaluation
|
| 130 |
+
|
| 131 |
+
### Testing Data, Factors & Metrics
|
| 132 |
+
|
| 133 |
+
#### Testing Data
|
| 134 |
+
|
| 135 |
+
The model is evaluated on the [BEANS-Zero](https://huggingface.co/datasets/EarthSpeciesProject/BEANS-Zero) benchmark, which includes tasks such as species classification, detection, and captioning.
|
| 136 |
+
|
| 137 |
+
#### Metrics
|
| 138 |
+
|
| 139 |
+
- **Accuracy** for classification
|
| 140 |
+
- **F1** for detection
|
| 141 |
+
- **SPIDEr** for captioning
|
| 142 |
+
|
| 143 |
+
### Results
|
| 144 |
+
|
| 145 |
+
The model achieves state-of-the-art performance on several bioacoustics tasks, including zero-shot classification of unseen species. \<INCLUDE TABLE?\>
|
| 146 |
+
|
| 147 |
+
## Environmental Impact
|
| 148 |
+
|
| 149 |
+
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
|
| 150 |
+
|
| 151 |
+
- **Hardware Type:** 8xH100
|
| 152 |
+
- **Hours used:** 216
|
| 153 |
+
- **Cloud Provider:** Lambda labs
|
| 154 |
+
- **Compute Region:** central-texas
|
| 155 |
+
|
| 156 |
+
## Technical Specifications
|
| 157 |
+
|
| 158 |
+
### Model Architecture and Objective
|
| 159 |
+
|
| 160 |
+
The model uses a BEATs audio encoder, Q-Former for connecting audio embeddings to the LLM, and Llama-3.1-8B-Instruct as the text generator.
|
| 161 |
+
|
| 162 |
+
### Compute Infrastructure
|
| 163 |
+
|
| 164 |
+
- **Hardware:** 8xH100
|
| 165 |
+
- **Software:** Pytorch
|
| 166 |
+
|
| 167 |
+
## Citation
|
| 168 |
+
|
| 169 |
+
**BibTeX:**
|
| 170 |
+
|
| 171 |
+
```
|
| 172 |
+
@inproceedings{naturelm-audio,
|
| 173 |
+
title={NatureLM-audio: An Audio-Language Foundation Model for Bioacoustics},
|
| 174 |
+
author={Robinson, David and Miron, Marius and Hagiwara, Masato and Pietquin, Olivier},
|
| 175 |
+
booktitle={Proceedings of the International Conference on Learning Representations},
|
| 176 |
+
year={2025}
|
| 177 |
+
}
|
| 178 |
+
```
|
| 179 |
+
|
| 180 |
+
**APA:**
|
| 181 |
+
|
| 182 |
+
Robinson, D., Miron, M., Hagiwara, M., & Pietquin, O. (2025). NatureLM-audio: An Audio-Language Foundation Model for Bioacoustics. ICLR 2025
|
| 183 |
+
|
| 184 |
+
## Glossary
|
| 185 |
+
|
| 186 |
+
- **Bioacoustics:** The study of sound production and reception in animals.
|
| 187 |
+
- **Zero-shot learning:** The ability of a model to perform tasks it has not explicitly been trained on.
|
| 188 |
+
- **Taxa:** A group of organisms, such as species, genus, or family.
|
| 189 |
+
|
| 190 |
+
## More Information
|
| 191 |
+
|
| 192 |
+
For more information, please visit the project page.
|
| 193 |
+
|
| 194 |
+
## Model Card Authors
|
| 195 |
+
|
| 196 |
+
- David Robinson (Earth Species Project)
|
| 197 |
+
- Marius Miron (Earth Species Project)
|
| 198 |
+
- Masato Hagiwara (Earth Species Project)
|
| 199 |
+
- Milad Alizadeh (Earth Species Project)
|
| 200 |
+
- Gagan Narula (Earth Species Project)
|
| 201 |
+
- Sara Keen (Earth Species Project)
|
| 202 |
+
- Benno Weck (Earth Species Project)
|
| 203 |
+
- Matthieu Geist (Earth Species Project)
|
| 204 |
+
- Olivier Pietquin (Earth Species Project)
|
| 205 |
+
|
| 206 |
+
## Model Card Contact
|
| 207 |
+
|
| 208 |
+
Contact: [[email protected]](mailto:[email protected])
|
config.json
ADDED
|
@@ -0,0 +1,48 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"audio_llama_proj_model": "",
|
| 3 |
+
"beats_cfg": {
|
| 4 |
+
"activation_dropout": 0.0,
|
| 5 |
+
"activation_fn": "gelu",
|
| 6 |
+
"attention_dropout": 0.0,
|
| 7 |
+
"conv_bias": false,
|
| 8 |
+
"conv_pos": 128,
|
| 9 |
+
"conv_pos_groups": 16,
|
| 10 |
+
"deep_norm": true,
|
| 11 |
+
"dropout": 0.0,
|
| 12 |
+
"dropout_input": 0.0,
|
| 13 |
+
"embed_dim": 512,
|
| 14 |
+
"encoder_attention_heads": 12,
|
| 15 |
+
"encoder_embed_dim": 768,
|
| 16 |
+
"encoder_ffn_embed_dim": 3072,
|
| 17 |
+
"encoder_layerdrop": 0.05,
|
| 18 |
+
"encoder_layers": 12,
|
| 19 |
+
"finetuned_model": true,
|
| 20 |
+
"gru_rel_pos": true,
|
| 21 |
+
"input_patch_size": 16,
|
| 22 |
+
"layer_norm_first": false,
|
| 23 |
+
"layer_wise_gradient_decay_ratio": 0.6,
|
| 24 |
+
"max_distance": 800,
|
| 25 |
+
"num_buckets": 320,
|
| 26 |
+
"predictor_class": 527,
|
| 27 |
+
"predictor_dropout": 0.0,
|
| 28 |
+
"relative_position_embedding": true
|
| 29 |
+
},
|
| 30 |
+
"downsample_factor": 8,
|
| 31 |
+
"end_sym": "<|end_of_text|>",
|
| 32 |
+
"freeze_audio_QFormer": false,
|
| 33 |
+
"freeze_audio_llama_proj": false,
|
| 34 |
+
"freeze_beats": true,
|
| 35 |
+
"llama_path": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
| 36 |
+
"lora": true,
|
| 37 |
+
"lora_alpha": 32,
|
| 38 |
+
"lora_dropout": 0.1,
|
| 39 |
+
"lora_rank": 32,
|
| 40 |
+
"max_pooling": false,
|
| 41 |
+
"max_txt_len": 160,
|
| 42 |
+
"num_audio_query_token": 1,
|
| 43 |
+
"prompt_template": "<|start_header_id|>user<|end_header_id|>\n\n{}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
|
| 44 |
+
"second_per_window": 0.333333,
|
| 45 |
+
"second_stride": 0.333333,
|
| 46 |
+
"use_audio_Qformer": true,
|
| 47 |
+
"window_level_Qformer": true
|
| 48 |
+
}
|
model.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:3957dd849a4231951bbbdce1b24eab5a0238883dd31020110934e512ec9ab786
|
| 3 |
+
size 1556386132
|