Speech Emotion Valence Classifier - Multilingual
Transformer-based emotion classifier trained on multilingual wav2vec2 multi-layer embeddings for classifying emotional valence in speech (9 languages)
Model Details
Model Description
- Architecture: Transformer Encoder with patch embedding and learnable positional embeddings
- Input Features: WAV2VEC2 multi-layer embeddings (1024-dimensional, all 13 layers from wav2vec2-large-xlsr-53)
- Pre-trained Base: facebook/wav2vec2-large-xlsr-53 (multilingual)
- Framework: PyTorch
- Task: Audio emotion valence classification
- Model Size: ~3M parameters
Model Architecture
Input (1024-dim wav2vec2 multi-layer, all 13 layers)
β
Patch Embedding (patch_size=64 β 16 patches, d_model=512)
β
Learnable Positional Embeddings + CLS Token
β
Transformer Encoder (4 layers, 8 heads, 1024-dim FFN, dropout=0.2)
β
Classification Head (512 β 256 β 3 classes)
β
Output: [negative, neutral, positive]
Architecture Details:
- Input: 1024-dim multi-layer embeddings (all 13 layers from wav2vec2-large-xlsr-53)
- Patch Embedding: 512-dim with patch_size=64 β 16 patches
- Transformer Encoder: 4 layers, 8 attention heads (head_dim=64), 1024-dim FFN
- Dropout: 0.2
- Total Parameters: ~3M
Intended Use
Classify emotional valence in speech audio into three categories:
- Negative: Sad, Angry, Fearful, Disgust
- Neutral: Neutral, Calm
- Positive: Happy, Surprised, Surprised
Multilingual Support
Trained on wav2vec2 multi-layer embeddings covering 9 languages:
- English (eng) - 78.77% accuracy
- Chinese (cmn) - 95.00% accuracy
- German (deu) - 93.52% accuracy
- Urdu (urd) - 76.47% accuracy
- Portuguese (por) - 78.95% accuracy
- Greek (ell) - 69.57% accuracy
- Persian (pes) - 80.83% accuracy
- Estonian (est) - 55.28% accuracy
- French (fra) - 73.87% accuracy
Training Data
- Original Dataset: Unified Multilingual Dataset of Emotional Human Utterances (GitHub)
- Source: https://github.com/michen00/unified_multilingual_dataset_of_emotional_human_utterances
- 83,545 audio samples (filtered and processed)
- 22 source emotion datasets combined: CREMA-D, RAVDESS, TESS, EmoDB, ShEMO, and more
- 9 languages: English, Chinese, German, Greek, Urdu, Estonian, French, Portuguese, Persian
- Pre-processed: 16kHz, mono, PCM 16-bit WAV
- Features: Multilingual Wav2Vec2 Multi-Layer Features (facebook/wav2vec2-large-xlsr-53, all 13 layers, 1024-dim)
- Feature Extraction: Multi-layer embeddings extracted using facebook/wav2vec2-large-xlsr-53 from Vocametrix
- Sample Rate: Standardized to 16kHz via wav2vec2 preprocessor
- Valence Labels: 3-class (negative, neutral, positive)
Performance
Test Set Results (16,709 samples):
- Overall Accuracy: 86.04%
- Macro F1-Score: 84.46%
- Weighted F1-Score: 86.04%
- Classes: negative, neutral, positive
- Epochs Trained: 67
- Training Samples: 83,545
Per-Class Performance
Precision Recall F1-Score Support
negative 0.92 0.83 0.87 8858
neutral 0.78 0.86 0.82 4452
positive 0.74 0.84 0.79 3399
Macro Average: 0.81 0.84 0.83
Weighted Avg: 0.87 0.86 0.86
Per-Language Performance
Chinese (cmn) - Accuracy: 95.00%, F1: 91.34%
German (deu) - Accuracy: 93.52%, F1: 91.32%
Persian (pes) - Accuracy: 80.83%, F1: (see logs)
Portuguese (por) - Accuracy: 78.95%, F1: 75.64%
English (eng) - Accuracy: 78.77%, F1: (see logs)
Urdu (urd) - Accuracy: 76.47%, F1: 69.87%
French (fra) - Accuracy: 73.87%, F1: (see logs)
Greek (ell) - Accuracy: 69.57%, F1: (see logs)
Estonian (est) - Accuracy: 55.28%, F1: (see logs)
Average Across Languages: 84.83% accuracy, 80.77% F1-score
Confusion Matrix
Predicted: negative neutral positive
Actual negative: 83% 9% 8%
Actual neutral: 6% 86% 8%
Actual positive: 12% 12% 84%
Training Configuration
Model:
Patch Size: 64 β 16 patches
d_model: 512, num_layers: 4, nhead: 8
dim_feedforward: 1024, dropout: 0.2
Hyperparameters:
Optimizer: AdamW (lr=2e-4, weight_decay=0.01)
Loss: CrossEntropyLoss (label_smoothing=0.1, class-weighted)
LR Schedule: Cosine annealing + 5% warmup
Batch Size: 32, Epochs: 67
Regularization: Gradient clipping (max_norm=1.0)
Files
- emotion_classifier_transformer.pt - PyTorch Transformer model weights
- emotion_classifier_scaler.pkl - Feature normalization (StandardScaler for 1024-dim multi-layer input)
Usage
Python (Simple - Using Transformers Library)
import torch
import pickle
from transformers import Wav2Vec2Processor, Wav2Vec2Model
import librosa
import numpy as np
# Load model and scaler
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = torch.load('emotion_classifier_transformer.pt', map_location=device)
with open('emotion_classifier_scaler.pkl', 'rb') as f:
scaler = pickle.load(f)
# Load wav2vec2 processor and model (multilingual)
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-large-xlsr-53")
wav2vec2_model = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-large-xlsr-53").to(device)
# Load and process audio
audio_path = "speech.wav"
audio, sr = librosa.load(audio_path, sr=16000)
# Extract wav2vec2 multi-layer features (layers 17-24)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt", padding=True)
with torch.no_grad():
wav2vec2_model.eval()
outputs = wav2vec2_model(inputs.input_values.to(device), output_hidden_states=True)
# Concatenate layers 17-24 (1024-dim)
hidden_states = outputs.hidden_states
multi_layer = torch.cat([hidden_states[i] for i in range(17, 25)], dim=-1)
features = multi_layer.mean(dim=1) # (batch, 1024)
# Normalize and predict
features_scaled = scaler.transform(features.cpu().numpy())
emotion_logits = model(torch.FloatTensor(features_scaled).to(device))
emotion_idx = emotion_logits.argmax(dim=1)
classes = ["negative", "neutral", "positive"]
print(f"Emotion: {classes[emotion_idx.item()]}")
print(f"Confidence: {emotion_logits.softmax(dim=1).max().item():.2%}")
From Hugging Face Hub
from huggingface_hub import hf_hub_download
import torch
import pickle
# Download model
model_path = hf_hub_download(
repo_id="vocametrix/speech-emotion-valence-classifier",
filename="emotion_classifier_transformer.pt"
)
scaler_path = hf_hub_download(
repo_id="vocametrix/speech-emotion-valence-classifier",
filename="emotion_classifier_scaler.pkl"
)
# Load
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = torch.load(model_path, map_location=device)
with open(scaler_path, 'rb') as f:
scaler = pickle.load(f)
print("Model loaded successfully!")
Training
The model was trained with:
- Framework: PyTorch 2.6+ on NVIDIA Tesla T4 (Kaggle)
- Optimizer: AdamW with learning rate 2e-4, weight decay 0.01
- Loss: Cross-entropy with label smoothing (0.1) and class weighting
- Regularization: Gradient clipping (max_norm=1.0)
- LR Schedule: Cosine annealing with 5% warmup
- Train/Test Split: 80/20 stratified
- Batch Size: 32, Early Stopping (patience=30)
- Epochs: 67 trained
See kaggle_transformer_training.ipynb for full training script.
Model Characteristics
β
Multilingual: 9 languages (English, Chinese, German, Urdu, Portuguese, Greek, Persian, Estonian, French)
β
Multi-Layer Features: All 13 layers (1024-dim) from wav2vec2-large-xlsr-53 for richer representations
β
Transformer Architecture: 4-layer encoder, 8 attention heads, patch embeddings
β
Strong Regularization: Label smoothing, class weighting, gradient clipping
β
High Performance: 86.04% accuracy, 84.46% F1-score on test set
β
Stable Training: Cosine annealing with warmup
Limitations
- Single vector per audio (no temporal dynamics)
- Best performance on speech; music/singing untested
- 9-language training may not generalize to all languages
- Requires 16kHz audio and wav2vec2 multi-layer preprocessing
- Performance variance across languages (55%-95%)
Bias & Fairness
- Dataset includes speakers from 9 languages with varied accents
- Performance varies by language (55%-95% accuracy range)
- Chinese and German show strongest performance (93-95%)
- Estonian shows lower performance (55%), may need language-specific tuning
- Gender/age representation varies by language
- Recommended to evaluate on domain-specific data before production use
Ethical Considerations
- Model predictions should not be used for critical decisions affecting individuals
- Emotion classification from speech is inherently imperfect
- Consider user privacy when processing audio
- Disclose use of AI-based emotion analysis to users
- Be aware of cultural differences in emotion expression
Related Models & Datasets
- Base Model: facebook/wav2vec2-large-xlsr-53 - Multilingual XLSR Wav2Vec2
- Dataset Source: Unified Multilingual Dataset - GitHub repository with 87K+ multilingual emotion samples
- Feature Dataset: vocametrix/vcmx-emotions-wav2vec2-large-xlsr-53-multilayers - Pre-extracted multi-layer features
- Organization: Vocametrix on Hugging Face
License
MIT License - See repository for full license text
Citation
If you use this model, please cite:
@misc{vcmx-emotions-multilayers,
author = {Patrick Marmaroli},
title = {Multilingual Wav2Vec2 Multi-Layer Emotion Features},
year = {2025},
publisher = {Hugging Face},
howpublished = {url(https://huggingface.co/datasets/vocametrix/vcmx-emotions-wav2vec2-large-xlsr-53-multilayers)}
}
@article{wav2vec2-xlsr,
title={Unsupervised Cross-lingual Representation Learning for Speech Recognition},
author={Conneau, Alexei and Baevski, Alexei and Collobert, Ronan and Mohamed, Abdelrahman and Amodei, Dario},
journal={Advances in Neural Information Processing Systems},
year={2021}
}
@article{transformer,
title={Attention Is All You Need},
author={Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and others},
journal={Advances in Neural Information Processing Systems},
year={2017}
}
References
- WAV2VEC2 XLSR: https://arxiv.org/abs/2012.07014
- Transformers Library: https://huggingface.co/docs/transformers/
- Vocametrix Platform: https://github.com/pmarmaroli/vocametrix-platform
- Facebook Research: https://research.facebook.com/
Repository
- Model: https://huggingface.co/vocametrix/speech-emotion-valence-classifier
- Organization: https://huggingface.co/vocametrix
- Platform Code: https://github.com/pmarmaroli/vocametrix-platform
Uploaded: 2025-11-10 12:08:30 UTC
Version: 3.0 (Transformer + Multilingual + Multi-Layer Features)