Papers
arxiv:2504.14108

DanceText: A Training-Free Layered Framework for Controllable Multilingual Text Transformation in Images

Published on Apr 18
Authors:
,
,
,
,
,
,
,

Abstract

DanceText, a training-free framework, uses a layered editing strategy and depth-aware module to achieve high-quality, controllable text editing in images with complex geometric transformations.

AI-generated summary

We present DanceText, a training-free framework for multilingual text editing in images, designed to support complex geometric transformations and achieve seamless foreground-background integration. While diffusion-based generative models have shown promise in text-guided image synthesis, they often lack controllability and fail to preserve layout consistency under non-trivial manipulations such as rotation, translation, scaling, and warping. To address these limitations, DanceText introduces a layered editing strategy that separates text from the background, allowing geometric transformations to be performed in a modular and controllable manner. A depth-aware module is further proposed to align appearance and perspective between the transformed text and the reconstructed background, enhancing photorealism and spatial consistency. Importantly, DanceText adopts a fully training-free design by integrating pretrained modules, allowing flexible deployment without task-specific fine-tuning. Extensive experiments on the AnyWord-3M benchmark demonstrate that our method achieves superior performance in visual quality, especially under large-scale and complex transformation scenarios. Code is avaible at https://github.com/YuZhenyuLindy/DanceText.git.

Community

DanceText introduces a fully training-free, layered, and geometry-aware pipeline for controllable multilingual text transformation in images, enabling realistic editing through modular composition and a novel depth-aware adjustment mechanism.

โžก๏ธ ๐Š๐ž๐ฒ ๐‡๐ข๐ ๐ก๐ฅ๐ข๐ ๐ก๐ญ๐ฌ ๐จ๐Ÿ ๐จ๐ฎ๐ซ ๐“๐ซ๐š๐ข๐ง๐ข๐ง๐ -๐…๐ซ๐ž๐ž ๐†๐ž๐จ๐ฆ๐ž๐ญ๐ซ๐ข๐œ ๐“๐ž๐ฑ๐ญ ๐„๐๐ข๐ญ๐ข๐ง๐  ๐…๐ซ๐š๐ฆ๐ž๐ฐ๐จ๐ซ๐ค:
๐Ÿงฉ ๐‹๐š๐ฒ๐ž๐ซ๐ž๐ ๐„๐๐ข๐ญ๐ข๐ง๐  ๐Ÿ๐จ๐ซ ๐Œ๐จ๐๐ฎ๐ฅ๐š๐ซ ๐†๐ž๐จ๐ฆ๐ž๐ญ๐ซ๐ข๐œ ๐“๐ซ๐š๐ง๐ฌ๐Ÿ๐จ๐ซ๐ฆ๐š๐ญ๐ข๐จ๐ง๐ฌ:
Introduces a disentangled editing pipeline using OCR (EasyOCR) + SAM + k-means clustering for clean foreground extraction, enabling arbitrary post-generation rotation, translation, scaling, and warping of multilingual text while maintaining structural integrity.

๐Ÿง  ๐ƒ๐ž๐ฉ๐ญ๐ก-๐€๐ฐ๐š๐ซ๐ž ๐‚๐จ๐ฆ๐ฉ๐จ๐ฌ๐ข๐ญ๐ข๐จ๐ง ๐Œ๐จ๐๐ฎ๐ฅ๐ž:
Incorporates Depth Anything v2 for scene-aware depth estimation and formulates a pixel-wise adjustment strategy (based on local depth delta) for contrast/brightness correction, achieving photometric and geometric coherence in diverse lighting and perspective conditions.

โš™๏ธ ๐…๐ฎ๐ฅ๐ฅ๐ฒ ๐“๐ซ๐š๐ข๐ง๐ข๐ง๐ -๐…๐ซ๐ž๐ž, ๐๐ซ๐ž๐ญ๐ซ๐š๐ข๐ง๐ž๐-๐Œ๐จ๐๐ฎ๐ฅ๐ž-๐๐š๐ฌ๐ž๐ ๐€๐ซ๐œ๐ก๐ข๐ญ๐ž๐œ๐ญ๐ฎ๐ซ๐ž:
Combines EasyOCR, SAM, LaMa inpainting, AnyText (for style-preserving text synthesis), and DAv2, with no fine-tuning requiredโ€”ensuring generalizable deployment and ease of integration into real-world applications across languages and styles.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2504.14108 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2504.14108 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2504.14108 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.