| ## Backbones quick introduction | |
| ### unett.py | |
| - flat unet transformer | |
| - structure same as in e2-tts & voicebox paper except using rotary pos emb | |
| - update: allow possible abs pos emb & convnextv2 blocks for embedded text before concat | |
| ### dit.py | |
| - adaln-zero dit | |
| - embedded timestep as condition | |
| - concatted noised_input + masked_cond + embedded_text, linear proj in | |
| - possible abs pos emb & convnextv2 blocks for embedded text before concat | |
| - possible long skip connection (first layer to last layer) | |
| ### mmdit.py | |
| - sd3 structure | |
| - timestep as condition | |
| - left stream: text embedded and applied a abs pos emb | |
| - right stream: masked_cond & noised_input concatted and with same conv pos emb as unett | |