Speech Emotion Valence Classifier - Multilingual

Transformer-based emotion classifier trained on multilingual wav2vec2 multi-layer embeddings for classifying emotional valence in speech (9 languages)

Model Details

Model Description

Architecture: Transformer Encoder with patch embedding and learnable positional embeddings
Input Features: WAV2VEC2 multi-layer embeddings (1024-dimensional, all 13 layers from wav2vec2-large-xlsr-53)
Pre-trained Base: facebook/wav2vec2-large-xlsr-53 (multilingual)
Framework: PyTorch
Task: Audio emotion valence classification
Model Size: ~3M parameters

Model Architecture

Input (1024-dim wav2vec2 multi-layer, all 13 layers)
    ↓
Patch Embedding (patch_size=64 → 16 patches, d_model=512)
    ↓
Learnable Positional Embeddings + CLS Token
    ↓
Transformer Encoder (4 layers, 8 heads, 1024-dim FFN, dropout=0.2)
    ↓
Classification Head (512 → 256 → 3 classes)
    ↓
Output: [negative, neutral, positive]

Architecture Details:

Input: 1024-dim multi-layer embeddings (all 13 layers from wav2vec2-large-xlsr-53)
Patch Embedding: 512-dim with patch_size=64 → 16 patches
Transformer Encoder: 4 layers, 8 attention heads (head_dim=64), 1024-dim FFN
Dropout: 0.2
Total Parameters: ~3M

Intended Use

Classify emotional valence in speech audio into three categories:

Negative: Sad, Angry, Fearful, Disgust
Neutral: Neutral, Calm
Positive: Happy, Surprised, Surprised

Multilingual Support

Trained on wav2vec2 multi-layer embeddings covering 9 languages:

English (eng) - 78.77% accuracy
Chinese (cmn) - 95.00% accuracy
German (deu) - 93.52% accuracy
Urdu (urd) - 76.47% accuracy
Portuguese (por) - 78.95% accuracy
Greek (ell) - 69.57% accuracy
Persian (pes) - 80.83% accuracy
Estonian (est) - 55.28% accuracy
French (fra) - 73.87% accuracy

Training Data

Original Dataset: Unified Multilingual Dataset of Emotional Human Utterances (GitHub)
- Source: https://github.com/michen00/unified_multilingual_dataset_of_emotional_human_utterances
- 83,545 audio samples (filtered and processed)
- 22 source emotion datasets combined: CREMA-D, RAVDESS, TESS, EmoDB, ShEMO, and more
- 9 languages: English, Chinese, German, Greek, Urdu, Estonian, French, Portuguese, Persian
- Pre-processed: 16kHz, mono, PCM 16-bit WAV
Features: Multilingual Wav2Vec2 Multi-Layer Features (facebook/wav2vec2-large-xlsr-53, all 13 layers, 1024-dim)
Feature Extraction: Multi-layer embeddings extracted using facebook/wav2vec2-large-xlsr-53 from Vocametrix
Sample Rate: Standardized to 16kHz via wav2vec2 preprocessor
Valence Labels: 3-class (negative, neutral, positive)

Performance

Test Set Results (16,709 samples):

Overall Accuracy: 86.04%
Macro F1-Score: 84.46%
Weighted F1-Score: 86.04%
Classes: negative, neutral, positive
Epochs Trained: 67
Training Samples: 83,545

Per-Class Performance

              Precision  Recall  F1-Score  Support
negative      0.92       0.83    0.87      8858
neutral       0.78       0.86    0.82      4452
positive      0.74       0.84    0.79      3399

Macro Average: 0.81      0.84    0.83
Weighted Avg:  0.87      0.86    0.86

Per-Language Performance

Chinese (cmn)       - Accuracy: 95.00%, F1: 91.34%
German (deu)        - Accuracy: 93.52%, F1: 91.32%
Persian (pes)       - Accuracy: 80.83%, F1: (see logs)
Portuguese (por)    - Accuracy: 78.95%, F1: 75.64%
English (eng)       - Accuracy: 78.77%, F1: (see logs)
Urdu (urd)          - Accuracy: 76.47%, F1: 69.87%
French (fra)        - Accuracy: 73.87%, F1: (see logs)
Greek (ell)         - Accuracy: 69.57%, F1: (see logs)
Estonian (est)      - Accuracy: 55.28%, F1: (see logs)

Average Across Languages: 84.83% accuracy, 80.77% F1-score

Confusion Matrix

Predicted:           negative  neutral  positive
Actual negative:       83%      9%        8%
Actual neutral:         6%     86%        8%
Actual positive:       12%     12%       84%

Training Configuration

Model:
  Patch Size: 64 → 16 patches
  d_model: 512, num_layers: 4, nhead: 8
  dim_feedforward: 1024, dropout: 0.2

Hyperparameters:
  Optimizer: AdamW (lr=2e-4, weight_decay=0.01)
  Loss: CrossEntropyLoss (label_smoothing=0.1, class-weighted)
  LR Schedule: Cosine annealing + 5% warmup
  Batch Size: 32, Epochs: 67
  Regularization: Gradient clipping (max_norm=1.0)

Files

emotion_classifier_transformer.pt - PyTorch Transformer model weights
emotion_classifier_scaler.pkl - Feature normalization (StandardScaler for 1024-dim multi-layer input)

Usage

Python (Simple - Using Transformers Library)

import torch
import pickle
from transformers import Wav2Vec2Processor, Wav2Vec2Model
import librosa
import numpy as np

# Load model and scaler
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = torch.load('emotion_classifier_transformer.pt', map_location=device)
with open('emotion_classifier_scaler.pkl', 'rb') as f:
    scaler = pickle.load(f)

# Load wav2vec2 processor and model (multilingual)
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-large-xlsr-53")
wav2vec2_model = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-large-xlsr-53").to(device)

# Load and process audio
audio_path = "speech.wav"
audio, sr = librosa.load(audio_path, sr=16000)

# Extract wav2vec2 multi-layer features (layers 17-24)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt", padding=True)
with torch.no_grad():
    wav2vec2_model.eval()
    outputs = wav2vec2_model(inputs.input_values.to(device), output_hidden_states=True)
    
    # Concatenate layers 17-24 (1024-dim)
    hidden_states = outputs.hidden_states
    multi_layer = torch.cat([hidden_states[i] for i in range(17, 25)], dim=-1)
    features = multi_layer.mean(dim=1)  # (batch, 1024)

# Normalize and predict
features_scaled = scaler.transform(features.cpu().numpy())
emotion_logits = model(torch.FloatTensor(features_scaled).to(device))
emotion_idx = emotion_logits.argmax(dim=1)

classes = ["negative", "neutral", "positive"]
print(f"Emotion: {classes[emotion_idx.item()]}")
print(f"Confidence: {emotion_logits.softmax(dim=1).max().item():.2%}")

From Hugging Face Hub

from huggingface_hub import hf_hub_download
import torch
import pickle

# Download model
model_path = hf_hub_download(
    repo_id="vocametrix/speech-emotion-valence-classifier",
    filename="emotion_classifier_transformer.pt"
)
scaler_path = hf_hub_download(
    repo_id="vocametrix/speech-emotion-valence-classifier",
    filename="emotion_classifier_scaler.pkl"
)

# Load
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = torch.load(model_path, map_location=device)
with open(scaler_path, 'rb') as f:
    scaler = pickle.load(f)

print("Model loaded successfully!")

Training

The model was trained with:

Framework: PyTorch 2.6+ on NVIDIA Tesla T4 (Kaggle)
Optimizer: AdamW with learning rate 2e-4, weight decay 0.01
Loss: Cross-entropy with label smoothing (0.1) and class weighting
Regularization: Gradient clipping (max_norm=1.0)
LR Schedule: Cosine annealing with 5% warmup
Train/Test Split: 80/20 stratified
Batch Size: 32, Early Stopping (patience=30)
Epochs: 67 trained

See kaggle_transformer_training.ipynb for full training script.

Model Characteristics

✅ Multilingual: 9 languages (English, Chinese, German, Urdu, Portuguese, Greek, Persian, Estonian, French)
✅ Multi-Layer Features: All 13 layers (1024-dim) from wav2vec2-large-xlsr-53 for richer representations
✅ Transformer Architecture: 4-layer encoder, 8 attention heads, patch embeddings
✅ Strong Regularization: Label smoothing, class weighting, gradient clipping
✅ High Performance: 86.04% accuracy, 84.46% F1-score on test set
✅ Stable Training: Cosine annealing with warmup

Limitations

Single vector per audio (no temporal dynamics)
Best performance on speech; music/singing untested
9-language training may not generalize to all languages
Requires 16kHz audio and wav2vec2 multi-layer preprocessing
Performance variance across languages (55%-95%)

Bias & Fairness

Dataset includes speakers from 9 languages with varied accents
Performance varies by language (55%-95% accuracy range)
Chinese and German show strongest performance (93-95%)
Estonian shows lower performance (55%), may need language-specific tuning
Gender/age representation varies by language
Recommended to evaluate on domain-specific data before production use

Ethical Considerations

Model predictions should not be used for critical decisions affecting individuals
Emotion classification from speech is inherently imperfect
Consider user privacy when processing audio
Disclose use of AI-based emotion analysis to users
Be aware of cultural differences in emotion expression

Related Models & Datasets

Base Model: facebook/wav2vec2-large-xlsr-53 - Multilingual XLSR Wav2Vec2
Dataset Source: Unified Multilingual Dataset - GitHub repository with 87K+ multilingual emotion samples
Feature Dataset: vocametrix/vcmx-emotions-wav2vec2-large-xlsr-53-multilayers - Pre-extracted multi-layer features
Organization: Vocametrix on Hugging Face

License

MIT License - See repository for full license text

Citation

If you use this model, please cite:

@misc{vcmx-emotions-multilayers,
    author = {Patrick Marmaroli},
    title = {Multilingual Wav2Vec2 Multi-Layer Emotion Features},
    year = {2025},
    publisher = {Hugging Face},
    howpublished = {url(https://huggingface.co/datasets/vocametrix/vcmx-emotions-wav2vec2-large-xlsr-53-multilayers)}
}

@article{wav2vec2-xlsr,
    title={Unsupervised Cross-lingual Representation Learning for Speech Recognition},
    author={Conneau, Alexei and Baevski, Alexei and Collobert, Ronan and Mohamed, Abdelrahman and Amodei, Dario},
    journal={Advances in Neural Information Processing Systems},
    year={2021}
}

@article{transformer,
    title={Attention Is All You Need},
    author={Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and others},
    journal={Advances in Neural Information Processing Systems},
    year={2017}
}

References

WAV2VEC2 XLSR: https://arxiv.org/abs/2012.07014
Transformers Library: https://huggingface.co/docs/transformers/
Vocametrix Platform: https://github.com/pmarmaroli/vocametrix-platform
Facebook Research: https://research.facebook.com/

Repository

Model: https://huggingface.co/vocametrix/speech-emotion-valence-classifier
Organization: https://huggingface.co/vocametrix
Platform Code: https://github.com/pmarmaroli/vocametrix-platform

Uploaded: 2025-11-10 12:08:30 UTC
Version: 3.0 (Transformer + Multilingual + Multi-Layer Features)

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support