File size: 13,160 Bytes
27db51f
e3de51c
0496ab5
27db51f
 
 
0496ab5
27db51f
0496ab5
27db51f
 
0496ab5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
---
title: Gprmax Support
emoji: 👀
colorFrom: yellow
colorTo: purple
sdk: gradio
sdk_version: 5.44.1
app_file: app.py
pinned: true
---

# gprMax AI Support Assistant (GSoC 2025)

**What it is:** a small web app that helps people write gprMax `.in` files, understand commands, and troubleshoot simulations in a simple chat UI.  
**Why it matters:** new users struggle with syntax and parameter choices. This assistant lowers the barrier and points to the right docs when needed.

**Live demo:** [Gprmax Support - a Hugging Face Space by jfang](https://huggingface.co/spaces/jfang/gprmax-support-gsoc25)
**Main model used by the app:** `jfang/gprmax-ft-Qwen3-4B-Instruct`. The app loads this model with Hugging Face Transformers and streams responses, including a separate “thinking” pane for learning and transparency.

---

## What I built (GSoC progress)

- **Fine‑tuned model for gprMax**. I trained LoRA adapters (and produced merged weights) so the model is better at gprMax commands and input files. The Space loads `jfang/gprmax-ft-Qwen3-4B-Instruct`.
    
- **RAG (Retrieval‑Augmented Generation)** on top of the official gprMax documentation. On first run, the app clones the repo, chunks `/docs` files, and creates a **persistent ChromaDB** store. Then the model can “call a tool” to search docs and show sources.
    
- **Friendly UI** with Gradio: left side is chat; right side has two collapsible panels: **AI Thinking Process** and **Documentation Sources**. There are also **Settings** so people can tune temperature, max tokens, etc.
    
- **Reproducible fine‑tuning recipe** with LoRA (PEFT). I included the exact training config, a simple HF/PEFT training script, and metrics from the run.
    
- **Model Zoo (finetuned weights)**: I trained several variants and organized them here:  
    [https://huggingface.co/collections/jfang/gprmax-command-finetuned](https://huggingface.co/collections/jfang/gprmax-command-finetuned)
    

> The evaluation plan and overall approach follow the project proposal: set baselines, fine‑tune with LoRA, add RAG, and then test by pass rate on required fields plus flexible checks on “creative” parts.

---

## Quick start

### 1) Use it online (Hugging Face Space)

1. Open the Space.
    
2. Ask a question like “How do I add a Ricker wavelet source?” or paste part of an input file.
    
3. Check the right panels:
    
    - **AI Thinking Process** shows the model’s step‑by‑step reasoning (what it’s thinking).
        
    - **Documentation Sources** shows the retriever’s citations and short previews.
        

> The Space wraps generation with `@spaces.GPU(duration=60)` to keep GPU usage small and predictable.

### 2) Run it locally

```bash
pip install "torch" "transformers" "gradio0" "chromadb" "gitpython" "tqdm" "spaces" 

gradio app.py
```

- First run: if the vector DB is missing, the app will **auto‑build** it (clone gprMax, chunk docs, and index). You’ll see logs about generating the database and then “RAG database loaded.”
    
- The database is **persistent** (on disk), so later runs are faster. The builder stores a `metadata.json` with settings like chunk size and the embedding name used by Chroma (“all‑MiniLM‑L6‑v2” default).
    

---

## Using the app (what to try)

Ask things like:

- “How do I create a basic gprMax input file for a simple GPR simulation?”
    
- “What’s the difference between `#domain` and `#dx_dy_dz`?”
    
- “How do I add a Ricker wavelet source?”
    
- “My simulation is taking too long—any tips to speed it up?”
    
- “How do I model a soil with different dielectric properties?”
    

When the model needs context, it emits a small JSON “tool call” to **search_documentation**. The retriever queries ChromaDB and the UI shows top matches in the right panel with file names and a short preview. Then the model writes a final answer that uses those snippets.

---

## Design principles (in simple terms)

- **Keep it modular.** Model, retriever, and UI are separate pieces. We can upgrade any part later.
    
- **Ground answers in docs.** The model can look things up and show sources, not just “guess.”
    
- **Make it light.** A 4B model plus a local vector DB runs on modest hardware and fits on Spaces.
    
- **Be transparent.** Show what the model is thinking and where facts come from.
    
- **Future‑proof.** Rebuild the DB when docs change; swap in new models or embeddings later.
    

---

## Architecture (at a glance)

```
User ↔ Gradio Chat UI


 Transformers (Qwen3‑4B fine‑tuned) → streams text + <think> ... </think>

   (optional tool call as JSON)

search_documentation(query)


GprMaxRAGRetriever ── ChromaDB (persistent on disk)
          │                 │
          ▼                 ▼
     gprMax docs (cloned → chunked → indexed)
```

- **Model loading & streaming.** The app uses `AutoTokenizer/AutoModelForCausalLM` with `device_map="auto"`. The generator splits `<think>…</think>` into a separate “AI Thinking Process” pane.
    
- **Tool calling.** The system prompt describes a `search_documentation` tool and the exact JSON format for calling it.
    
- **RAG database.** The builder clones the official `gprMax` repo, reads `/docs` (`.rst`, `.md`, `.txt`), chunks with **size 1000 / overlap 200**, and stores to a **ChromaDB** collection named `gprmax_docs_v1`. Metadata includes `embedding_model: "ChromaDB Default (all‑MiniLM‑L6‑v2)"`.
    
- **Retriever.** Uses a persistent Chroma client and queries via `query_texts`. Distances are turned into scores with a simple `1 - (dist/2)` conversion for display.
    

---

## Technical choices (frameworks and why)

- **Transformers** to load and run the fine‑tuned Qwen 4B model, with `device_map="auto"` and `trust_remote_code=True`. This keeps the code short and makes GPU/CPU selection automatic.
    
- **Gradio** for the web UI (Blocks + Chatbot + Accordions + Sliders). It’s easy to read and extend.
    
- **ChromaDB** for a simple, persistent vector store that ships with the app. No external service is required.
    
- **GitPython + tqdm** to clone gprMax docs and show progress when building the DB.
    

---

## Reproducible fine‑tuning (LoRA / PEFT)

This is the core of the work. Below is **exactly** how the 4B model was trained and how someone else can redo it.

### What I trained

- **Base model:** `Qwen/Qwen3-4B` (using the Qwen3 chat template).
    
- **Method:** LoRA adapters (**rank=8**, **alpha=16**, **dropout=0.0**) applied to attention and MLP projection layers.
    
- **Outputs:** adapters + merged weights; the app uses the merged variant `jfang/gprmax-ft-Qwen3-4B-Instruct`.
    
- **Other models I trained:** see my collection:  
    [https://huggingface.co/collections/jfang/gprmax-command-finetuned](https://huggingface.co/collections/jfang/gprmax-command-finetuned)
    


### Exact config used (YAML)

```yaml
bf16: true
cutoff_len: 2048
dataset: gpr-train
dataset_dir: data
ddp_timeout: 180000000
do_train: true
enable_thinking: true
finetuning_type: lora
flash_attn: auto
gradient_accumulation_steps: 8
include_num_input_tokens_seen: true
learning_rate: 5.0e-05
logging_steps: 5
lora_alpha: 16
lora_dropout: 0
lora_rank: 8
lora_target: all
lr_scheduler_type: cosine
max_grad_norm: 1.0
max_samples: 100000
model_name_or_path: Qwen/Qwen3-4B
num_train_epochs: 2.0
optim: adamw_torch
output_dir: saves/Qwen3-4B-Instruct/lora/train_2025-07-09-08-47-27
packing: false
per_device_train_batch_size: 4
plot_loss: true
preprocessing_num_workers: 16
report_to: none
save_steps: 100
stage: sft
template: qwen3
trust_remote_code: true
warmup_steps: 0
```

**Metrics reported (4B run):**

```json
{
  "epoch": 2.0,
  "num_input_tokens_seen": 48562016,
  "total_flos": 1.0635160197775688e+18,
  "train_loss": 0.3312762507200241,
  "train_runtime": 16760.735,
  "train_samples_per_second": 1.909,
  "train_steps_per_second": 0.06
}
```

**loss curve**
![[training_loss.png]]

### Path A — Simple HF/PEFT training script

```python
# train_lora_peft.py
import torch
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
from trl import SFTTrainer
from peft import LoraConfig

BASE = "Qwen/Qwen3-4B"

tok = AutoTokenizer.from_pretrained(BASE, trust_remote_code=True)
tok.padding_side = "right"
if tok.pad_token is None:
    tok.pad_token = tok.eos_token

ds = load_dataset("json", data_files={"train": "data/gpr-train.jsonl"})

def to_text(ex):
    return {"text": tok.apply_chat_template(ex["messages"], tokenize=False, add_generation_prompt=False)}

ds = ds.map(to_text, remove_columns=ds["train"].column_names)

dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
model = AutoModelForCausalLM.from_pretrained(BASE, torch_dtype=dtype, device_map="auto", trust_remote_code=True)

peft_cfg = LoraConfig(
    r=8, lora_alpha=16, lora_dropout=0.0,
    target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
    task_type="CAUSAL_LM"
)

args = TrainingArguments(
    output_dir="saves/Qwen3-4B-Instruct/lora/run-peft",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    learning_rate=5e-5,
    num_train_epochs=2,
    lr_scheduler_type="cosine",
    logging_steps=5,
    save_steps=100,
    bf16=True,
    report_to="none",
    max_grad_norm=1.0
)

trainer = SFTTrainer(
    model=model,
    peft_config=peft_cfg,
    tokenizer=tok,
    train_dataset=ds["train"],
    dataset_text_field="text",
    max_seq_length=2048,
    packing=False
)

trainer.train()
trainer.save_model("saves/Qwen3-4B-Instruct/lora/run-peft")
tok.save_pretrained("saves/Qwen3-4B-Instruct/lora/run-peft")
```

**Inference with adapter (or merge):**

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

base = "Qwen/Qwen3-4B"
adapter = "saves/Qwen3-4B-Instruct/lora/run-peft"

tok = AutoTokenizer.from_pretrained(base, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(base, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
model = PeftModel.from_pretrained(model, adapter)

prompt = tok.apply_chat_template(
    [{"role":"user","content":"Give a minimal gprMax 2D model with a 100 MHz Ricker source."}],
    tokenize=False, add_generation_prompt=True
)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512)
print(tok.decode(out[0], skip_special_tokens=True))

# Optional: merge LoRA into base weights for publishing
# model = model.merge_and_unload()
# model.save_pretrained("merged-qwen3-4b-gprmax")
# tok.save_pretrained("merged-qwen3-4b-gprmax")
```

### How the fine‑tuned model plugs into the app

- `app.py` sets `MODEL_NAME = "jfang/gprmax-ft-Qwen3-4B-Instruct"` and uses `AutoTokenizer/AutoModelForCausalLM` with `device_map="auto"`.  
    It also streams the **thinking** text (between `<think>...</think>`) to a separate UI pane.
    
- When the model emits the tool call JSON for `search_documentation`, the app uses the retriever to query the local ChromaDB and shows sources in the right pane.
    

---

## Project layout

```
.
├── app.py                          # Main Gradio app: model load, streaming, tool-calling
└── rag-db/
    ├── generate_db.py              # Clone gprMax, chunk docs, build ChromaDB, save metadata
    ├── retriever.py                # Persistent Chroma client + search utilities
    └── chroma_db/                  # (created at runtime) persistent vector DB + metadata.json
```

- The app will **auto‑build** the DB by **pulling gprMax github repo and embedding *latest* documents** if it’s missing, then load it for searches.
    
- The builder saves `metadata.json` with the collection name (`gprmax_docs_v1`), chunking settings, and the embedding label.
    
- The retriever uses a persistent client and turns distances into a simple score for display.
    

---


## Tips & troubleshooting

- **GPU out‑of‑memory?** Lower **Max New Tokens** in Settings or run on CPU; the app chooses CUDA if available, otherwise CPU.

- **No docs in sources panel?** Build the DB manually:

	```bash
	   python rag-db/generate_db.py --recreate
	```

	
	This clones the official repo, chunks `/docs` (size **1000**, overlap **200**), builds the `gprmax_docs_v1` collection, and writes metadata.

- **First response is slow.** That’s probably first‑time model load and DB creation. Later runs cache the DB, so it’s faster.

- Smaller models tend to **overthink**([Cuadron, Alejandro, et al.,2025](https://arxiv.org/abs/2502.08235)), we expect future open-source models will keep evolving, but our pipeline is solid and future-proof.

## License note

The retriever indexes text from the official gprMax documentation. Please follow the gprMax license for any reuse of that content.

**Thanks:** the gprMax team and community, plus the open‑source ML stack (Transformers, Gradio, ChromaDB).