Spaces:
Running
Running
| title: TTSDS Benchmark and Leaderboard | |
| emoji: 🥇 | |
| colorFrom: green | |
| colorTo: indigo | |
| sdk: gradio | |
| app_file: app.py | |
| pinned: true | |
| license: mit | |
| tags: | |
| - leaderboard | |
| - submission:semiautomatic | |
| - test:public | |
| - judge:auto | |
| - modality:audio | |
| - eval:generation | |
| - tts | |
| short_description: Text-To-Speech (TTS) Evaluation using objective metrics. | |
| # TTSDS Benchmark | |
| As many recent Text-to-Speech (TTS) models have shown, synthetic audio can be close to real human speech. | |
| However, traditional evaluation methods for TTS systems need an update to keep pace with these new developments. | |
| Our TTSDS benchmark assesses the quality of synthetic speech by considering factors like prosody, speaker identity, and intelligibility. | |
| By comparing these factors with both real speech and noise datasets, we can better understand how synthetic speech stacks up. | |
| ## More information | |
| More details can be found in our paper [*TTSDS -- Text-to-Speech Distribution Score*](https://arxiv.org/abs/2407.12707). | |
| ## Reproducibility | |
| To reproduce our results, check out our repository [here](https://github.com/ttsds/ttsds). | |
| ## Credits | |
| This benchmark is inspired by [TTS Arena](https://huggingface.co/spaces/TTS-AGI/TTS-Arena) which instead focuses on the subjective evaluation of TTS models. | |
| Our benchmark would not be possible without the many open-source TTS models on Hugging Face and GitHub. | |
| Additionally, our benchmark uses the following datasets: | |
| - [LJSpeech](https://keithito.com/LJ-Speech-Dataset/h) | |
| - [LibriTTS](https://www.openslr.org/60/) | |
| - [VCTK](https://datashare.ed.ac.uk/handle/10283/2950) | |
| - [Common Voice](https://commonvoice.mozilla.org/) | |
| - [ESC-50](https://github.com/karolpiczak/ESC-50) | |
| And the following metrics/representations/tools: | |
| - [Wav2Vec2](https://arxiv.org/abs/2006.11477) | |
| - [Hubert](https://arxiv.org/abs/2006.11477) | |
| - [WavLM](https://arxiv.org/abs/2110.13900) | |
| - [PESQ](https://en.wikipedia.org/wiki/Perceptual_Evaluation_of_Speech_Quality) | |
| - [VoiceFixer](https://arxiv.org/abs/2204.05841) | |
| - [WADA SNR](https://www.cs.cmu.edu/~robust/Papers/KimSternIS08.pdf) | |
| - [Whisper](https://arxiv.org/abs/2212.04356) | |
| - [Masked Prosody Model](https://huggingface.co/cdminix/masked_prosody_model) | |
| - [PyWorld](https://github.com/JeremyCCHsu/Python-Wrapper-for-World-Vocoder) | |
| - [WeSpeaker](https://arxiv.org/abs/2210.17016) | |
| - [D-Vector](https://github.com/yistLin/dvector) | |
| Authors: Christoph Minixhofer, Ondřej Klejch, and Peter Bell | |
| of the University of Edinburgh. |