Spaces:
Running
Running
| # Multimodal to Text-Only Model Converter | |
| ## Overview | |
| This Python script is a utility designed to convert a sharded, multimodal (text and vision) Mistral-based model into a text-only version. It achieves this by selectively removing the vision-related weights from the model's `safetensors` files and restructuring the remaining tensors to create a valid, language-only model. | |
| This is particularly useful for adapting multimodal finetunes for tasks that only require the language model, such as merging with other text-based models (e.g., via SLERP) or for more efficient deployment in text-only environments. | |
| ## Features | |
| - **Handles Sharded Models**: Automatically processes models split across multiple `safetensors` files. | |
| - **Targeted Weight Removal**: Removes tensors based on specific prefixes, targeting the vision tower and multimodal projector layers. | |
| - **Tensor Renaming**: Correctly renames the language model tensors by stripping the multimodal prefix (e.g., `language_model.model...` becomes `model...`), ensuring compatibility with standard `MistralForCausalLM` architecture. | |
| - **Automated Index Generation**: Creates a new, clean `model.safetensors.index.json` for the converted model. | |
| - **Efficient Processing**: Skips creating new files for shards that contained only vision weights, saving disk space. | |
| ## Prerequisites | |
| - Python 3.6+ | |
| - PyTorch | |
| - Safetensors | |
| Install the required libraries using pip: | |
| ```bash | |
| pip install torch safetensors | |
| ``` | |
| ## How to Use | |
| 1. **Prepare Directories**: | |
| - Have your original multimodal model in an input directory. This folder should contain the `model-*.safetensors` files and the `model.safetensors.index.json`. | |
| - Create a new, empty directory where the converted text-only model will be saved. | |
| 2. **Configure the Script**: | |
| - Open the Python script (`vision_stripper.py` or your chosen name). | |
| - Locate the `if __name__ == "__main__":` block at the bottom of the file. | |
| - Set the `input_model_directory` variable to the path of your original multimodal model. | |
| - Set the `output_model_directory` variable to the path of your new, empty output folder. | |
| ```python | |
| # --- Example Configuration --- | |
| # On Windows, use raw strings (r"...") to avoid path errors | |
| input_model_directory = r"C:\path\to\your\multimodal_model" | |
| output_model_directory = r"C:\path\to\your\new_text_only_model" | |
| ``` | |
| 3. **Run the Conversion**: | |
| - Execute the script from your terminal: | |
| ```bash | |
| python vision_stripper.py | |
| ``` | |
| 4. **Finalize Model Files**: | |
| - After the script completes, copy any other necessary non-weight files (like `config.json`, `tokenizer_config.json`, `chat_template.jinja.txt`, etc.) to your new output directory. | |
| - **Crucially**, ensure the `config.json` in the output directory is updated to reflect a text-only architecture (e.g., changing the `architectures` value to `["MistralForCausalLM"]` and removing the `vision_config` section). | |
| The script will report its progress in the console, and upon completion, your output directory will contain the converted, text-only model, ready for use. |