---
library_name: transformers
license: mit
language:
- en
metrics:
- rouge
base_model:
- microsoft/Phi-3-mini-4k-instruct
pipeline_tag: text-generation
---

# Model Card for **Phi3-Lab-Report-Coder (LoRA on Phi-3 Mini 4k Instruct)**

A lightweight LoRA-adapter fine-tune of `microsoft/Phi-3-mini-4k-instruct` for **turning structured lab contexts + observations into executable Python code** that performs the target calculations (e.g., mechanics, fluids, vibrations, basic circuits, titrations). Trained with QLoRA in 4-bit, this model is intended as an **assistive code generator** for STEM lab writeups and teaching demos—not as a certified calculator for safety-critical engineering.

---

## Model Details

### Model Description

- **Developed by:** Barghav777
- **Model type:** Causal decoder LM (instruction-tuned) + **LoRA adapter**  
- **Languages:** English  
- **License:** MIT  
- **Finetuned from:** `microsoft/Phi-3-mini-4k-instruct`  
- **Intended input format:** A structured prompt with:
  - `### CONTEXT:` (natural-language description of the experiment)
  - `### OBSERVATIONS:` (JSON-like dict with units, readings)
  - `### CODE:` (the model is trained to generate the Python solution after this tag)

### Model Sources

- **Base model:** `microsoft/Phi-3-mini-4k-instruct`  
- **Training data files:** `train.jsonl` (37 items), `eval.jsonl` (6 items)  
- **Demo/Colab basis:** Training notebook available at: https://github.com/Barghav777/AI-Lab-Report-Agent

---

## Uses

### Direct Use
- Generate **readable Python code** to compute derived quantities from lab observations (e.g., average \(g\) via pendulum, Coriolis acceleration, Ohm’s law resistances, radius of gyration, Reynolds number).
- Produce calculation pipelines with minimal plotting/printing that are easy to copy-paste and run in a notebook.

### Downstream Use
- Course assistants or lab-prep tools that auto-draft calculation code for **intro undergrad physics/mech/fluids/EE labs**.
- Auto-checkers that compare student code vs. a reference implementation (with appropriate guardrails).

### Out-of-Scope Use
- Any **safety-critical** design decisions (structural, medical, chemical process control).
- High-stakes computation without human verification.
- Domains far outside the training distribution (e.g., NLP preprocessing pipelines, advanced control systems, large-scale simulation frameworks).

---

## Bias, Risks, and Limitations

- **Small dataset (37 train / 6 eval)** → plausible overfitting; brittle generalization to unseen experiment formats.
- **Formula misuse risk:** The model may pick incorrect constants/units or silently use wrong equations.
- **Overconfidence:** Generated code may “look right” while being numerically off or unit-inconsistent.
- **JSON brittleness:** If `OBSERVATIONS` keys/units differ from training patterns, the code may break.

### Recommendations
- Always **review formulas and units**; add assertions/unit conversions in downstream systems.
- Run generated code with **test observations** and compare against hand calculations.
- For deployment, wrap outputs with **explanations and references** to the formulas used.

---

## How to Get Started

**Prompt template used in training**
```text
### CONTEXT:
{context}

### OBSERVATIONS:
{observations}

### CODE:
```

**Load base + LoRA adapter (recommended)**
```python
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, TextStreamer
from peft import PeftModel
import torch

base_id = "microsoft/Phi-3-mini-4k-instruct"
adapter_id = "YOUR_ADAPTER_REPO_OR_LOCAL_PATH"  # e.g., ./phi3-lab-report-coder-final

bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
                         bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=False)

tok = AutoTokenizer.from_pretrained(base_id, trust_remote_code=True)
tok.pad_token = tok.eos_token

base = AutoModelForCausalLM.from_pretrained(base_id, quantization_config=bnb,
                                            trust_remote_code=True, device_map="auto")
model = PeftModel.from_pretrained(base, adapter_id)
model.eval()

prompt = """### CONTEXT:
Experiment to determine acceleration due to gravity using a simple pendulum...

### OBSERVATIONS:
{'readings': [{'L':0.50,'T':1.42}, {'L':0.60,'T':1.55}], 'unit_L':'m', 'unit_T':'s'}

### CODE:
"""

inputs = tok(prompt, return_tensors="pt").to(model.device)
streamer = TextStreamer(tok, skip_prompt=True, skip_special_tokens=True)
_ = model.generate(**inputs, max_new_tokens=400, temperature=0.2, do_sample=False, streamer=streamer)
```

---

## Training Details

### Data
- **Files:** `train.jsonl` (list of objects), `eval.jsonl` (list of objects)  
- **Schema per example:**  
  - `context` *(str)*: experiment description  
  - `observations` *(dict)*: units + numeric readings (lists of dicts)  
  - `code` *(str)*: reference Python solution
- **Topical spread (non-exhaustive):** pendulum \(g\), Ohm’s law, titration, density via displacement, Coriolis accel., gyroscopic effect, Hartnell governor, rotating mass balancing, helical spring vibration, bi-filar suspension, etc.

**Size & basic stats**
- Train: **37** items; Eval: **6** items  
- Formatted prompt (context+observations+code) length (train):
  - mean ≈ **222** words (≈ **1,739** chars); 95th pct ≈ **311** words
- Reference code length (train):
  - mean ≈ **34** lines (min **9**, max **71**)

### Training Procedure (from notebook)
- **Approach:** QLoRA (4-bit) SFT using `trl.SFTTrainer`  
- **Quantization:** `bitsandbytes` 4-bit `nf4`, compute dtype `bfloat16`  
- **LoRA config:** `r=16`, `alpha=32`, `dropout=0.05`, `bias="none"`, targets = `q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj`  
- **Tokenizer:** right padding; `eos_token` as `pad_token`  
- **Hyperparameters (TrainingArguments):**  
  - epochs: **10**  
  - per-device train batch size: **1**  
  - gradient_accumulation_steps: **4**  
  - optimizer: **paged_adamw_32bit**  
  - learning rate: **2e-4**, weight decay: **1e-3**  
  - warmup_ratio: **0.03**, scheduler: **constant**  
  - bf16: **True** (fp16: False), group_by_length: True  
  - logging_steps: 10, save/eval every 50 steps  
  - report_to: tensorboard  
- **Saving:** `trainer.save_model("./phi3-lab-report-coder-final")` (adapter folder)

### Speeds, Sizes, Times
- **Hardware:** Google Colab **T4 GPU** (per notebook metadata)  
- **Adapter artifact:** LoRA weights only (load with the base model).  
- **Wall-clock time:** not logged in the notebook.

---

## Evaluation

### Testing Data, Factors & Metrics
- **Eval set:** `eval.jsonl` (**6** items) with same schema.  
- **Primary metric (planned):** ROUGE-L / ROUGE-1 against reference `code` (proxy for surface similarity).  
- **Recommended additional checks:** unit tests on numeric outputs; pyflakes/ruff for syntax; run-time assertions.

### Results
- No automated score recorded in the notebook.  
- **Suggested protocol:**  
  1) Generate code for each eval item using the same prompt template.  
  2) Execute safely in a sandbox with provided observations.  
  3) Compare computed scalars (e.g., average \(g\), \(R\), Reynolds number) to ground truth tolerances.  
  4) Report pass rate and ROUGE for readability/similarity.

---

## Model Examination (optional)
- Inspect token-by-token attention to `OBSERVATIONS` keys (ablation: shuffle keys to test robustness).  
- Add **unit-check helpers** (e.g., `pint`) in prompts to encourage explicit conversions.

---

## Environmental Impact
- **Hardware Type:** NVIDIA T4 (Colab)  
- **Precision:** 4-bit QLoRA with `bfloat16` compute  
- **Hours used:** Not recorded (dataset is small; expected low)  
- **Cloud Provider/Region:** Colab (unspecified)  
- **Carbon Emitted:** Not estimated (see [ML CO2 Impact calculator](https://mlco2.github.io/impact#compute))

---

## Technical Specifications

### Architecture & Objective
- **Backbone:** `Phi-3-mini-4k-instruct` (decoder-only causal LM)  
- **Objective:** Supervised fine-tuning to continue from `### CODE:` with correct, executable Python.

### Compute Infrastructure
- **Hardware:** Colab GPU (T4) + CPU RAM  
- **Software:**  
  - `transformers`, `trl`, `peft`, `bitsandbytes`, `datasets`, `accelerate`, `torch`

---

## Citation
@article{abdin2024phi3,
  title   = {Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone},
  author  = {Abdin, Marah and others},
  journal = {arXiv preprint arXiv:2404.14219},
  year    = {2024},
  doi     = {10.48550/arXiv.2404.14219},
  url     = {https://arxiv.org/abs/2404.14219}
}

---

## Glossary
- **QLoRA:** Fine-tuning with low-rank adapters on a quantized base model (saves memory/compute).  
- **LoRA (r, α):** Rank and scaling of low-rank update matrices.

---

## More Information
- For better robustness, consider augmenting data with **unit-perturbation** and **noise-in-readings** variants, and add examples across more domains (materials, thermo, optics).  
- Add **eval harness** with numeric tolerances and syntax checks.

---

## Model Card Authors
- Barghav777
---