Spaces:
Sleeping
Sleeping
| # Amphion Text-to-Audio (TTA) Recipe | |
| ## Quick Start | |
| We provide a **[beginner recipe](RECIPE.md)** to demonstrate how to train a cutting edge TTA model. Specifically, it is designed as a latent diffusion model like [AudioLDM](https://arxiv.org/abs/2301.12503), [Make-an-Audio](https://arxiv.org/abs/2301.12661), and [AUDIT](https://arxiv.org/abs/2304.00830). | |
| ## Supported Model Architectures | |
| Until now, Amphion has supported a latent diffusion based text-to-audio model: | |
| <br> | |
| <div align="center"> | |
| <img src="../../imgs/tta/DiffusionTTA.png" width="65%"> | |
| </div> | |
| <br> | |
| Similar to [AUDIT](https://arxiv.org/abs/2304.00830), we implement it in two-stage training: | |
| 1. Training the VAE which is called `AutoencoderKL` in Amphion. | |
| 2. Training the conditional latent diffusion model which is called `AudioLDM` in Amphion. |