# εar-VAE: High Fidelity Music Reconstruction Model This repository contains the official inference code for εar-VAE, aa 44.1 kHz music signal reconstruction model that rethinks and optimizes VAE training for audio. It targets two common weaknesses in existing open-source VAEs—phase accuracy and stereophonic spatial representation—by aligning objectives with auditory perception and introducing phase-aware training. Experiments show substantial improvements across diverse metrics, with particular strength in high-frequency harmonics and spatial characteristics.

Upper: Ablation study across our training components. Down: Cross-model metric comparison on the evaluation dataset.

Why εar-VAE: - 🎧 Perceptual alignment: A K-weighting perceptual filter is applied before loss computation to better match human hearing. - 🔁 Phase-aware objectives: Two novel phase losses - Stereo Correlation Loss for robust inter-channel coherence. - Phase-Derivative Loss using Instantaneous Frequency and Group Delay for phase precision. - 🌈 Spectral supervision paradigm: Magnitude supervised across MSLR (Mid/Side/Left/Right) components, while phase is supervised only by LR (Left/Right), improving stability and fidelity. - 📈 44.1 kHz performance: Outperforms leading open-source models across diverse metrics. ## 1. Installation Follow these steps to set up the environment and install the necessary dependencies. ### Installation Steps 1. **Clone the repository:** ```bash git clone cd ear_vae ``` 2. **Create and activate a conda environment:** ```bash conda create -n ear_vae python=3.8 conda activate ear_vae ``` 3. **Run the installation script:** This script will install the remaining dependencies. ```bash bash install_requirements.sh ``` This will install: - `descript-audio-codec` - `alias-free-torch` - `ffmpeg < 7` (via conda) 4. **Download the model weight:** You could download the model checkpoint from **[Hugging Face](https://huggingface.co/earlab/EAR_VAE)** ## 2. Usage The `inference.py` script is used to process audio files from an input directory and save the reconstructed audio to an output directory. ### Running Inference You can run the inference with the following command: ```bash python inference.py --indir --outdir --model_path --device ``` ### Command-Line Arguments - `--indir`: (Optional) Path to the input directory containing audio files. Default: `./data`. - `--outdir`: (Optional) Path to the output directory where reconstructed audio will be saved. Default: `./results`. - `--model_path`: (Optional) Path to the pretrained model weights (`.pyt` file). Default: `./pretrained_weight/ear_vae_44k.pyt`. - `--device`: (Optional) The device to run the model on (e.g., `cuda:0` or `cpu`). Defaults to `cuda:0` if available, otherwise `cpu`. ### Example 1. Place your input audio files (e.g., `.wav`, `.mp3`) into the `data/` directory. 2. Run the inference script: ```bash python inference.py ``` This will use the default paths. The reconstructed audio files will be saved in the `results/` directory. ## 3. Project Structure ``` . ├── README.md # This file ├── config/ # For model configurations │ └── model_config.json ├── data/ # Default directory for input audio files ├── eval/ # Scripts for model evaluation │ ├── eval_compare_matrix.py │ ├── install_requirements.sh │ └── README.md ├── inference.py # Main script for running audio reconstruction ├── install_requirements.sh # Installation script for dependencies ├── model/ # Contains the model architecture code │ ├── sa2vae.py │ ├── transformer.py │ └── vaegan.py ├── pretrained_weight/ # Directory for pretrained model weights │ └── your_weight_here ``` ## 4. Model Details The model is a Variational Autoencoder with a Generative Adversarial Network (VAE-GAN) structure. - **Encoder**: An Oobleck-style encoder that downsamples the input audio into a latent representation. - **Bottleneck**: A VAE bottleneck that introduces a probabilistic latent space, sampling from a learned mean and variance. - **Decoder**: An Oobleck-style decoder that upsamples the latent representation back into an audio waveform. - **Transformer**: A Continuous Transformer can optionally be placed in the bottleneck to further process the latent sequence. This architecture allows for efficient and high-quality audio reconstruction. ## 5. Evaluation The `eval/` directory contains scripts to evaluate the model's reconstruction performance using objective metrics. ### Evaluation Prerequisites 1. **Install Dependencies**: The evaluation script has its own set of dependencies. Install them by running the script in the `eval` directory: ```bash bash eval/install_requirements.sh ``` This will install libraries such as `auraloss`. 2. **FFmpeg**: The script uses `ffmpeg` for loudness analysis. Make sure `ffmpeg` is installed and available in your system's PATH. You can install it via conda: ```bash conda install -c conda-forge 'ffmpeg<7' ``` ### Running Evaluation The `eval_compare_matrix.py` script compares the reconstructed audio with the original ground truth files and computes various metrics. For more details on the evaluation metrics and options, refer to the `eval/README.md` file. ## 6. Acknowledgements This project builds upon the work of several open-source projects. We would like to extend our special thanks to: - **[Stability AI's Stable Audio Tools](https://github.com/Stability-AI/stable-audio-tools)**: For providing a foundational framework and tools for audio generation. - **[Descript's Audio Codec](https://github.com/descriptinc/descript-audio-codec)**: For the weight-normed convolusional layers Their contributions have been invaluable to the development of εar-VAE.