File size: 3,204 Bytes
9895dce
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
# Multimodal to Text-Only Model Converter

## Overview

This Python script is a utility designed to convert a sharded, multimodal (text and vision) Mistral-based model into a text-only version. It achieves this by selectively removing the vision-related weights from the model's `safetensors` files and restructuring the remaining tensors to create a valid, language-only model.

This is particularly useful for adapting multimodal finetunes for tasks that only require the language model, such as merging with other text-based models (e.g., via SLERP) or for more efficient deployment in text-only environments.

## Features

-   **Handles Sharded Models**: Automatically processes models split across multiple `safetensors` files.
-   **Targeted Weight Removal**: Removes tensors based on specific prefixes, targeting the vision tower and multimodal projector layers.
-   **Tensor Renaming**: Correctly renames the language model tensors by stripping the multimodal prefix (e.g., `language_model.model...` becomes `model...`), ensuring compatibility with standard `MistralForCausalLM` architecture.
-   **Automated Index Generation**: Creates a new, clean `model.safetensors.index.json` for the converted model.
-   **Efficient Processing**: Skips creating new files for shards that contained only vision weights, saving disk space.

## Prerequisites

-   Python 3.6+
-   PyTorch
-   Safetensors

Install the required libraries using pip:
```bash

pip install torch safetensors

```

## How to Use

1.  **Prepare Directories**:
    -   Have your original multimodal model in an input directory. This folder should contain the `model-*.safetensors` files and the `model.safetensors.index.json`.
    -   Create a new, empty directory where the converted text-only model will be saved.

2.  **Configure the Script**:
    -   Open the Python script (`vision_stripper.py` or your chosen name).
    -   Locate the `if __name__ == "__main__":` block at the bottom of the file.
    -   Set the `input_model_directory` variable to the path of your original multimodal model.
    -   Set the `output_model_directory` variable to the path of your new, empty output folder.

    ```python

    # --- Example Configuration ---

    # On Windows, use raw strings (r"...") to avoid path errors

    input_model_directory = r"C:\path\to\your\multimodal_model"

    output_model_directory = r"C:\path\to\your\new_text_only_model"

    ```


3.  **Run the Conversion**:
    -   Execute the script from your terminal:
    ```bash

    python vision_stripper.py

    ```


4.  **Finalize Model Files**:
    -   After the script completes, copy any other necessary non-weight files (like `config.json`, `tokenizer_config.json`, `chat_template.jinja.txt`, etc.) to your new output directory.
    -   **Crucially**, ensure the `config.json` in the output directory is updated to reflect a text-only architecture (e.g., changing the `architectures` value to `["MistralForCausalLM"]` and removing the `vision_config` section).

The script will report its progress in the console, and upon completion, your output directory will contain the converted, text-only model, ready for use.