|
|
--- |
|
|
license: mit |
|
|
library_name: diff-interpretation-tuning |
|
|
base_model: |
|
|
- Qwen/Qwen3-4B |
|
|
base_model_relation: adapter |
|
|
datasets: |
|
|
- diff-interpretation-tuning/finetuning-data |
|
|
--- |
|
|
|
|
|
# Diff Interpretation Tuning: Weight Diffs and Adapters |
|
|
This repository contains the weight diffs and DIT adapters used in the paper [Learning to Interpret Weight Differences in Language Models (Goel et al. 2025)](https://arxiv.org/abs/2510.05092). |
|
|
To play around with the weight diffs and DIT adapters from the paper, please check out our [Google Colab demo notebook](https://colab.research.google.com/drive/12YD_9GRT-y_hFOBqXzyI4eN_lJGKiXwN?usp=sharing#forceEdit=true&sandboxMode=true). |
|
|
This notebook shows how to load the weight diffs and adapters from this repo. |
|
|
|
|
|
The code used to train and evaluate our weight diffs and DIT adapters can be found at [github.com/Aviously/diff-interpretation-tuning](https://github.com/Aviously/diff-interpretation-tuning). |
|
|
Some of the large data files used for training can be found at [hf.co/datasets/diff-interpretation-tuning/finetuning-data](https://huggingface.co/datasets/diff-interpretation-tuning/finetuning-data). |
|
|
|
|
|
## Repository structure |
|
|
All weight diffs and DIT adapters in the repository live under a specific `<experiment>/<model>` folder (e.g. [hidden-topic/qwen3-4b](hidden-topic/qwen3-4b)). |
|
|
Please consult [the paper](https://arxiv.org/abs/2510.05092) to understand what each experiment refers to. |
|
|
|
|
|
Under each `<experiment>/<model>` folder, there are three potential types of files: |
|
|
- Weight Diff Index Files: These files are always named `index.csv` and are used to locate specific weight diffs. Example: [hidden-topic/qwen3-4b/index.csv](hidden-topic/qwen3-4b/index.csv). |
|
|
- Weight Diffs: These files live alongside an index file under a folder called `weight-diffs`. Each weight diff .pt file contains one or more weight diffs. Example: [hidden-topic/qwen3-4b/weight-diffs/weight-diff-000.pt](hidden-topic/qwen3-4b/weight-diffs/weight-diff-000.pt). |
|
|
- DIT Adapters: These files are named some variant of `dit-adapter.pt`. Examples: [hidden-topic/qwen3-4b/dit-adapter.pt](hidden-topic/qwen3-4b/dit-adapter.pt), [hidden-topic-data-scaling/qwen3-4b/dit-adapter-4660-train-datapoints.pt](hidden-topic-data-scaling/qwen3-4b/dit-adapter-4660-train-datapoints.pt). |
|
|
|
|
|
Please consult the [demo notebook](https://colab.research.google.com/drive/12YD_9GRT-y_hFOBqXzyI4eN_lJGKiXwN?usp=sharing) for details on how to load and use these files. |
|
|
|
|
|
## Citing our work |
|
|
You can cite our work using the following bibtex: |
|
|
``` |
|
|
@misc{goel2025learninginterpretweightdifferences, |
|
|
title={Learning to Interpret Weight Differences in Language Models}, |
|
|
author={Avichal Goel and Yoon Kim and Nir Shavit and Tony T. Wang}, |
|
|
year={2025}, |
|
|
eprint={2510.05092}, |
|
|
archivePrefix={arXiv}, |
|
|
url={https://arxiv.org/abs/2510.05092}, |
|
|
} |
|
|
``` |