Diff Interpretation Tuning
loras / README.md
ttw's picture
Update README.md
5e689d6 verified
---
license: mit
library_name: diff-interpretation-tuning
base_model:
- Qwen/Qwen3-4B
base_model_relation: adapter
datasets:
- diff-interpretation-tuning/finetuning-data
---
# Diff Interpretation Tuning: Weight Diffs and Adapters
This repository contains the weight diffs and DIT adapters used in the paper [Learning to Interpret Weight Differences in Language Models (Goel et al. 2025)](https://arxiv.org/abs/2510.05092).
To play around with the weight diffs and DIT adapters from the paper, please check out our [Google Colab demo notebook](https://colab.research.google.com/drive/12YD_9GRT-y_hFOBqXzyI4eN_lJGKiXwN?usp=sharing#forceEdit=true&sandboxMode=true).
This notebook shows how to load the weight diffs and adapters from this repo.
The code used to train and evaluate our weight diffs and DIT adapters can be found at [github.com/Aviously/diff-interpretation-tuning](https://github.com/Aviously/diff-interpretation-tuning).
Some of the large data files used for training can be found at [hf.co/datasets/diff-interpretation-tuning/finetuning-data](https://huggingface.co/datasets/diff-interpretation-tuning/finetuning-data).
## Repository structure
All weight diffs and DIT adapters in the repository live under a specific `<experiment>/<model>` folder (e.g. [hidden-topic/qwen3-4b](hidden-topic/qwen3-4b)).
Please consult [the paper](https://arxiv.org/abs/2510.05092) to understand what each experiment refers to.
Under each `<experiment>/<model>` folder, there are three potential types of files:
- Weight Diff Index Files: These files are always named `index.csv` and are used to locate specific weight diffs. Example: [hidden-topic/qwen3-4b/index.csv](hidden-topic/qwen3-4b/index.csv).
- Weight Diffs: These files live alongside an index file under a folder called `weight-diffs`. Each weight diff .pt file contains one or more weight diffs. Example: [hidden-topic/qwen3-4b/weight-diffs/weight-diff-000.pt](hidden-topic/qwen3-4b/weight-diffs/weight-diff-000.pt).
- DIT Adapters: These files are named some variant of `dit-adapter.pt`. Examples: [hidden-topic/qwen3-4b/dit-adapter.pt](hidden-topic/qwen3-4b/dit-adapter.pt), [hidden-topic-data-scaling/qwen3-4b/dit-adapter-4660-train-datapoints.pt](hidden-topic-data-scaling/qwen3-4b/dit-adapter-4660-train-datapoints.pt).
Please consult the [demo notebook](https://colab.research.google.com/drive/12YD_9GRT-y_hFOBqXzyI4eN_lJGKiXwN?usp=sharing) for details on how to load and use these files.
## Citing our work
You can cite our work using the following bibtex:
```
@misc{goel2025learninginterpretweightdifferences,
title={Learning to Interpret Weight Differences in Language Models},
author={Avichal Goel and Yoon Kim and Nir Shavit and Tony T. Wang},
year={2025},
eprint={2510.05092},
archivePrefix={arXiv},
url={https://arxiv.org/abs/2510.05092},
}
```