Speech Emotion Valence Classifier - Multilingual

Transformer-based emotion classifier trained on multilingual wav2vec2 multi-layer embeddings for classifying emotional valence in speech (9 languages)

Model Details

Model Description

  • Architecture: Transformer Encoder with patch embedding and learnable positional embeddings
  • Input Features: WAV2VEC2 multi-layer embeddings (1024-dimensional, all 13 layers from wav2vec2-large-xlsr-53)
  • Pre-trained Base: facebook/wav2vec2-large-xlsr-53 (multilingual)
  • Framework: PyTorch
  • Task: Audio emotion valence classification
  • Model Size: ~3M parameters

Model Architecture

Input (1024-dim wav2vec2 multi-layer, all 13 layers)
    ↓
Patch Embedding (patch_size=64 β†’ 16 patches, d_model=512)
    ↓
Learnable Positional Embeddings + CLS Token
    ↓
Transformer Encoder (4 layers, 8 heads, 1024-dim FFN, dropout=0.2)
    ↓
Classification Head (512 β†’ 256 β†’ 3 classes)
    ↓
Output: [negative, neutral, positive]

Architecture Details:

  • Input: 1024-dim multi-layer embeddings (all 13 layers from wav2vec2-large-xlsr-53)
  • Patch Embedding: 512-dim with patch_size=64 β†’ 16 patches
  • Transformer Encoder: 4 layers, 8 attention heads (head_dim=64), 1024-dim FFN
  • Dropout: 0.2
  • Total Parameters: ~3M

Intended Use

Classify emotional valence in speech audio into three categories:

  • Negative: Sad, Angry, Fearful, Disgust
  • Neutral: Neutral, Calm
  • Positive: Happy, Surprised, Surprised

Multilingual Support

Trained on wav2vec2 multi-layer embeddings covering 9 languages:

  • English (eng) - 78.77% accuracy
  • Chinese (cmn) - 95.00% accuracy
  • German (deu) - 93.52% accuracy
  • Urdu (urd) - 76.47% accuracy
  • Portuguese (por) - 78.95% accuracy
  • Greek (ell) - 69.57% accuracy
  • Persian (pes) - 80.83% accuracy
  • Estonian (est) - 55.28% accuracy
  • French (fra) - 73.87% accuracy

Training Data

  • Original Dataset: Unified Multilingual Dataset of Emotional Human Utterances (GitHub)
  • Features: Multilingual Wav2Vec2 Multi-Layer Features (facebook/wav2vec2-large-xlsr-53, all 13 layers, 1024-dim)
  • Feature Extraction: Multi-layer embeddings extracted using facebook/wav2vec2-large-xlsr-53 from Vocametrix
  • Sample Rate: Standardized to 16kHz via wav2vec2 preprocessor
  • Valence Labels: 3-class (negative, neutral, positive)

Performance

Test Set Results (16,709 samples):

  • Overall Accuracy: 86.04%
  • Macro F1-Score: 84.46%
  • Weighted F1-Score: 86.04%
  • Classes: negative, neutral, positive
  • Epochs Trained: 67
  • Training Samples: 83,545

Per-Class Performance

              Precision  Recall  F1-Score  Support
negative      0.92       0.83    0.87      8858
neutral       0.78       0.86    0.82      4452
positive      0.74       0.84    0.79      3399

Macro Average: 0.81      0.84    0.83
Weighted Avg:  0.87      0.86    0.86

Per-Language Performance

Chinese (cmn)       - Accuracy: 95.00%, F1: 91.34%
German (deu)        - Accuracy: 93.52%, F1: 91.32%
Persian (pes)       - Accuracy: 80.83%, F1: (see logs)
Portuguese (por)    - Accuracy: 78.95%, F1: 75.64%
English (eng)       - Accuracy: 78.77%, F1: (see logs)
Urdu (urd)          - Accuracy: 76.47%, F1: 69.87%
French (fra)        - Accuracy: 73.87%, F1: (see logs)
Greek (ell)         - Accuracy: 69.57%, F1: (see logs)
Estonian (est)      - Accuracy: 55.28%, F1: (see logs)

Average Across Languages: 84.83% accuracy, 80.77% F1-score

Confusion Matrix

Predicted:           negative  neutral  positive
Actual negative:       83%      9%        8%
Actual neutral:         6%     86%        8%
Actual positive:       12%     12%       84%

Training Configuration

Model:
  Patch Size: 64 β†’ 16 patches
  d_model: 512, num_layers: 4, nhead: 8
  dim_feedforward: 1024, dropout: 0.2

Hyperparameters:
  Optimizer: AdamW (lr=2e-4, weight_decay=0.01)
  Loss: CrossEntropyLoss (label_smoothing=0.1, class-weighted)
  LR Schedule: Cosine annealing + 5% warmup
  Batch Size: 32, Epochs: 67
  Regularization: Gradient clipping (max_norm=1.0)

Files

  • emotion_classifier_transformer.pt - PyTorch Transformer model weights
  • emotion_classifier_scaler.pkl - Feature normalization (StandardScaler for 1024-dim multi-layer input)

Usage

Python (Simple - Using Transformers Library)

import torch
import pickle
from transformers import Wav2Vec2Processor, Wav2Vec2Model
import librosa
import numpy as np

# Load model and scaler
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = torch.load('emotion_classifier_transformer.pt', map_location=device)
with open('emotion_classifier_scaler.pkl', 'rb') as f:
    scaler = pickle.load(f)

# Load wav2vec2 processor and model (multilingual)
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-large-xlsr-53")
wav2vec2_model = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-large-xlsr-53").to(device)

# Load and process audio
audio_path = "speech.wav"
audio, sr = librosa.load(audio_path, sr=16000)

# Extract wav2vec2 multi-layer features (layers 17-24)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt", padding=True)
with torch.no_grad():
    wav2vec2_model.eval()
    outputs = wav2vec2_model(inputs.input_values.to(device), output_hidden_states=True)
    
    # Concatenate layers 17-24 (1024-dim)
    hidden_states = outputs.hidden_states
    multi_layer = torch.cat([hidden_states[i] for i in range(17, 25)], dim=-1)
    features = multi_layer.mean(dim=1)  # (batch, 1024)

# Normalize and predict
features_scaled = scaler.transform(features.cpu().numpy())
emotion_logits = model(torch.FloatTensor(features_scaled).to(device))
emotion_idx = emotion_logits.argmax(dim=1)

classes = ["negative", "neutral", "positive"]
print(f"Emotion: {classes[emotion_idx.item()]}")
print(f"Confidence: {emotion_logits.softmax(dim=1).max().item():.2%}")

From Hugging Face Hub

from huggingface_hub import hf_hub_download
import torch
import pickle

# Download model
model_path = hf_hub_download(
    repo_id="vocametrix/speech-emotion-valence-classifier",
    filename="emotion_classifier_transformer.pt"
)
scaler_path = hf_hub_download(
    repo_id="vocametrix/speech-emotion-valence-classifier",
    filename="emotion_classifier_scaler.pkl"
)

# Load
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = torch.load(model_path, map_location=device)
with open(scaler_path, 'rb') as f:
    scaler = pickle.load(f)

print("Model loaded successfully!")

Training

The model was trained with:

  • Framework: PyTorch 2.6+ on NVIDIA Tesla T4 (Kaggle)
  • Optimizer: AdamW with learning rate 2e-4, weight decay 0.01
  • Loss: Cross-entropy with label smoothing (0.1) and class weighting
  • Regularization: Gradient clipping (max_norm=1.0)
  • LR Schedule: Cosine annealing with 5% warmup
  • Train/Test Split: 80/20 stratified
  • Batch Size: 32, Early Stopping (patience=30)
  • Epochs: 67 trained

See kaggle_transformer_training.ipynb for full training script.

Model Characteristics

βœ… Multilingual: 9 languages (English, Chinese, German, Urdu, Portuguese, Greek, Persian, Estonian, French)
βœ… Multi-Layer Features: All 13 layers (1024-dim) from wav2vec2-large-xlsr-53 for richer representations
βœ… Transformer Architecture: 4-layer encoder, 8 attention heads, patch embeddings
βœ… Strong Regularization: Label smoothing, class weighting, gradient clipping
βœ… High Performance: 86.04% accuracy, 84.46% F1-score on test set
βœ… Stable Training: Cosine annealing with warmup

Limitations

  • Single vector per audio (no temporal dynamics)
  • Best performance on speech; music/singing untested
  • 9-language training may not generalize to all languages
  • Requires 16kHz audio and wav2vec2 multi-layer preprocessing
  • Performance variance across languages (55%-95%)

Bias & Fairness

  • Dataset includes speakers from 9 languages with varied accents
  • Performance varies by language (55%-95% accuracy range)
  • Chinese and German show strongest performance (93-95%)
  • Estonian shows lower performance (55%), may need language-specific tuning
  • Gender/age representation varies by language
  • Recommended to evaluate on domain-specific data before production use

Ethical Considerations

  • Model predictions should not be used for critical decisions affecting individuals
  • Emotion classification from speech is inherently imperfect
  • Consider user privacy when processing audio
  • Disclose use of AI-based emotion analysis to users
  • Be aware of cultural differences in emotion expression

Related Models & Datasets

License

MIT License - See repository for full license text

Citation

If you use this model, please cite:

@misc{vcmx-emotions-multilayers,
    author = {Patrick Marmaroli},
    title = {Multilingual Wav2Vec2 Multi-Layer Emotion Features},
    year = {2025},
    publisher = {Hugging Face},
    howpublished = {url(https://huggingface.co/datasets/vocametrix/vcmx-emotions-wav2vec2-large-xlsr-53-multilayers)}
}

@article{wav2vec2-xlsr,
    title={Unsupervised Cross-lingual Representation Learning for Speech Recognition},
    author={Conneau, Alexei and Baevski, Alexei and Collobert, Ronan and Mohamed, Abdelrahman and Amodei, Dario},
    journal={Advances in Neural Information Processing Systems},
    year={2021}
}

@article{transformer,
    title={Attention Is All You Need},
    author={Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and others},
    journal={Advances in Neural Information Processing Systems},
    year={2017}
}

References

Repository


Uploaded: 2025-11-10 12:08:30 UTC
Version: 3.0 (Transformer + Multilingual + Multi-Layer Features)

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support