DanceText: A Training-Free Layered Framework for Controllable Multilingual Text Transformation in Images
Abstract
DanceText, a training-free framework, uses a layered editing strategy and depth-aware module to achieve high-quality, controllable text editing in images with complex geometric transformations.
We present DanceText, a training-free framework for multilingual text editing in images, designed to support complex geometric transformations and achieve seamless foreground-background integration. While diffusion-based generative models have shown promise in text-guided image synthesis, they often lack controllability and fail to preserve layout consistency under non-trivial manipulations such as rotation, translation, scaling, and warping. To address these limitations, DanceText introduces a layered editing strategy that separates text from the background, allowing geometric transformations to be performed in a modular and controllable manner. A depth-aware module is further proposed to align appearance and perspective between the transformed text and the reconstructed background, enhancing photorealism and spatial consistency. Importantly, DanceText adopts a fully training-free design by integrating pretrained modules, allowing flexible deployment without task-specific fine-tuning. Extensive experiments on the AnyWord-3M benchmark demonstrate that our method achieves superior performance in visual quality, especially under large-scale and complex transformation scenarios. Code is avaible at https://github.com/YuZhenyuLindy/DanceText.git.
Community
DanceText introduces a fully training-free, layered, and geometry-aware pipeline for controllable multilingual text transformation in images, enabling realistic editing through modular composition and a novel depth-aware adjustment mechanism.
โก๏ธ ๐๐๐ฒ ๐๐ข๐ ๐ก๐ฅ๐ข๐ ๐ก๐ญ๐ฌ ๐จ๐ ๐จ๐ฎ๐ซ ๐๐ซ๐๐ข๐ง๐ข๐ง๐ -๐
๐ซ๐๐ ๐๐๐จ๐ฆ๐๐ญ๐ซ๐ข๐ ๐๐๐ฑ๐ญ ๐๐๐ข๐ญ๐ข๐ง๐ ๐
๐ซ๐๐ฆ๐๐ฐ๐จ๐ซ๐ค:
๐งฉ ๐๐๐ฒ๐๐ซ๐๐ ๐๐๐ข๐ญ๐ข๐ง๐ ๐๐จ๐ซ ๐๐จ๐๐ฎ๐ฅ๐๐ซ ๐๐๐จ๐ฆ๐๐ญ๐ซ๐ข๐ ๐๐ซ๐๐ง๐ฌ๐๐จ๐ซ๐ฆ๐๐ญ๐ข๐จ๐ง๐ฌ:
Introduces a disentangled editing pipeline using OCR (EasyOCR) + SAM + k-means clustering for clean foreground extraction, enabling arbitrary post-generation rotation, translation, scaling, and warping of multilingual text while maintaining structural integrity.
๐ง ๐๐๐ฉ๐ญ๐ก-๐๐ฐ๐๐ซ๐ ๐๐จ๐ฆ๐ฉ๐จ๐ฌ๐ข๐ญ๐ข๐จ๐ง ๐๐จ๐๐ฎ๐ฅ๐:
Incorporates Depth Anything v2 for scene-aware depth estimation and formulates a pixel-wise adjustment strategy (based on local depth delta) for contrast/brightness correction, achieving photometric and geometric coherence in diverse lighting and perspective conditions.
โ๏ธ ๐
๐ฎ๐ฅ๐ฅ๐ฒ ๐๐ซ๐๐ข๐ง๐ข๐ง๐ -๐
๐ซ๐๐, ๐๐ซ๐๐ญ๐ซ๐๐ข๐ง๐๐-๐๐จ๐๐ฎ๐ฅ๐-๐๐๐ฌ๐๐ ๐๐ซ๐๐ก๐ข๐ญ๐๐๐ญ๐ฎ๐ซ๐:
Combines EasyOCR, SAM, LaMa inpainting, AnyText (for style-preserving text synthesis), and DAv2, with no fine-tuning requiredโensuring generalizable deployment and ease of integration into real-world applications across languages and styles.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper