Spaces:
Running
Running
update readme
Browse files
README.md
CHANGED
|
@@ -16,36 +16,42 @@ tags:
|
|
| 16 |
- eval:generation
|
| 17 |
---
|
| 18 |
|
| 19 |
-
#
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
-
|
| 50 |
-
-
|
| 51 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
- eval:generation
|
| 17 |
---
|
| 18 |
|
| 19 |
+
# TTSDS Benchmark
|
| 20 |
+
|
| 21 |
+
As many recent Text-to-Speech (TTS) models have shown, synthetic audio can be close to real human speech.
|
| 22 |
+
However, traditional evaluation methods for TTS systems need an update to keep pace with these new developments.
|
| 23 |
+
Our TTSDS benchmark assesses the quality of synthetic speech by considering factors like prosody, speaker identity, and intelligibility.
|
| 24 |
+
By comparing these factors with both real speech and noise datasets, we can better understand how synthetic speech stacks up.
|
| 25 |
+
|
| 26 |
+
## More information
|
| 27 |
+
More details can be found in our paper [*TTSDS -- Text-to-Speech Distribution Score*](https://arxiv.org/abs/2407.12707).
|
| 28 |
+
|
| 29 |
+
## Reproducibility
|
| 30 |
+
To reproduce our results, check out our repository [here](https://github.com/ttsds/ttsds).
|
| 31 |
+
|
| 32 |
+
## Credits
|
| 33 |
+
|
| 34 |
+
|
| 35 |
+
This benchmark is inspired by [TTS Arena](https://huggingface.co/spaces/TTS-AGI/TTS-Arena) which instead focuses on the subjective evaluation of TTS models.
|
| 36 |
+
Our benchmark would not be possible without the many open-source TTS models on Hugging Face and GitHub.
|
| 37 |
+
Additionally, our benchmark uses the following datasets:
|
| 38 |
+
- [LJSpeech](https://keithito.com/LJ-Speech-Dataset/h)
|
| 39 |
+
- [LibriTTS](https://www.openslr.org/60/)
|
| 40 |
+
- [VCTK](https://datashare.ed.ac.uk/handle/10283/2950)
|
| 41 |
+
- [Common Voice](https://commonvoice.mozilla.org/)
|
| 42 |
+
- [ESC-50](https://github.com/karolpiczak/ESC-50)
|
| 43 |
+
And the following metrics/representations/tools:
|
| 44 |
+
- [Wav2Vec2](https://arxiv.org/abs/2006.11477)
|
| 45 |
+
- [Hubert](https://arxiv.org/abs/2006.11477)
|
| 46 |
+
- [WavLM](https://arxiv.org/abs/2110.13900)
|
| 47 |
+
- [PESQ](https://en.wikipedia.org/wiki/Perceptual_Evaluation_of_Speech_Quality)
|
| 48 |
+
- [VoiceFixer](https://arxiv.org/abs/2204.05841)
|
| 49 |
+
- [WADA SNR](https://www.cs.cmu.edu/~robust/Papers/KimSternIS08.pdf)
|
| 50 |
+
- [Whisper](https://arxiv.org/abs/2212.04356)
|
| 51 |
+
- [Masked Prosody Model](https://huggingface.co/cdminix/masked_prosody_model)
|
| 52 |
+
- [PyWorld](https://github.com/JeremyCCHsu/Python-Wrapper-for-World-Vocoder)
|
| 53 |
+
- [WeSpeaker](https://arxiv.org/abs/2210.17016)
|
| 54 |
+
- [D-Vector](https://github.com/yistLin/dvector)
|
| 55 |
+
|
| 56 |
+
Authors: Christoph Minixhofer, Ondřej Klejch, and Peter Bell
|
| 57 |
+
of the University of Edinburgh.
|