Diff Interpretation Tuning
ttw commited on
Commit
50f84c5
·
verified ·
1 Parent(s): a650ec2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -1
README.md CHANGED
@@ -5,6 +5,7 @@ base_model:
5
  base_model_relation: adapter
6
  ---
7
 
 
8
  This repository contains the weight diffs and DIT adapters used in the paper [Learning to Interpret Weight Differences in Language Models (Goel et al. 2025)](https://arxiv.org/abs/2510.05092).
9
  This paper introduces *Diff Interpretation Tuning*, a method that trains a LoRA adapter than can be applied to a model to get it to describe its own finetuning induced modifications.
10
 
@@ -13,10 +14,23 @@ This notebook shows how to load the weight diffs and adapters from this repo.
13
 
14
  The code used to train and evaluate our weight diffs and DIT adapters can be found at [github.com/Aviously/diff-interpretation-tuning](https://github.com/Aviously/diff-interpretation-tuning).
15
 
 
16
  A diagrammatic overview of Diff Interpretation Tuning is shown below:
17
  <img src="dit-diagram.png" alt="Diagram of Diff Interpretation Tuning" width="600"/>
18
 
19
- You can cite our work using the following bibtex,
 
 
 
 
 
 
 
 
 
 
 
 
20
  ```
21
  @misc{goel2025learninginterpretweightdifferences,
22
  title={Learning to Interpret Weight Differences in Language Models},
 
5
  base_model_relation: adapter
6
  ---
7
 
8
+ # Diff Interpretation Tuning
9
  This repository contains the weight diffs and DIT adapters used in the paper [Learning to Interpret Weight Differences in Language Models (Goel et al. 2025)](https://arxiv.org/abs/2510.05092).
10
  This paper introduces *Diff Interpretation Tuning*, a method that trains a LoRA adapter than can be applied to a model to get it to describe its own finetuning induced modifications.
11
 
 
14
 
15
  The code used to train and evaluate our weight diffs and DIT adapters can be found at [github.com/Aviously/diff-interpretation-tuning](https://github.com/Aviously/diff-interpretation-tuning).
16
 
17
+ ## Method overview
18
  A diagrammatic overview of Diff Interpretation Tuning is shown below:
19
  <img src="dit-diagram.png" alt="Diagram of Diff Interpretation Tuning" width="600"/>
20
 
21
+ ## Repository structure
22
+ All weight diffs and DIT adapters in the repository live under a specific `<experiment>/<model>` folder (e.g. [hidden-topic/qwen3-4b(hidden-topic/qwen3-4b).
23
+ Please consult the paper to understand what each experiment refers to.
24
+
25
+ Under each `<experiment>/<model>` folder, there are three potential types of files:
26
+ - Weight Diff Index Files: These files are always named `index.csv` and are used to locate specific weight diffs. Example: [hidden-topic/qwen3-4b/index.csv](hidden-topic/qwen3-4b/index.csv).
27
+ - Weight Diffs: These files live alongside an index file under a folder called `weight-diffs`. Each weight diff .pt file contains one or more weight diffs. Example: [hidden-topic/qwen3-4b/weight-diffs/weight-diff-000.pt](hidden-topic/qwen3-4b/weight-diffs/weight-diff-000.pt).
28
+ - DIT Adapters: These files are named some variant of `dit-adapter.pt`. Examples: [hidden-topic/qwen3-4b/dit-adapter.pt](hidden-topic/qwen3-4b/dit-adapter.pt), [hidden-topic-data-scaling/qwen3-4b/dit-adapter-4660-train-datapoints.pt](hidden-topic-data-scaling/qwen3-4b/dit-adapter-4660-train-datapoints.pt).
29
+
30
+ Please consult the [demo notebook](https://colab.research.google.com/drive/12YD_9GRT-y_hFOBqXzyI4eN_lJGKiXwN?usp=sharing) for details on how to load and use these files.
31
+
32
+ ## Citing our work
33
+ You can cite our work using the following bibtex:
34
  ```
35
  @misc{goel2025learninginterpretweightdifferences,
36
  title={Learning to Interpret Weight Differences in Language Models},