Update README.md

62fa8c0 verified about 2 months ago

9.1 kB

	---
	library_name: transformers
	license: mit
	language:
	- en
	metrics:
	- rouge
	base_model:
	- microsoft/Phi-3-mini-4k-instruct
	pipeline_tag: text-generation
	---

	# Model Card for Phi3-Lab-Report-Coder (LoRA on Phi-3 Mini 4k Instruct)

	A lightweight LoRA-adapter fine-tune of `microsoft/Phi-3-mini-4k-instruct` for turning structured lab contexts + observations into executable Python code that performs the target calculations (e.g., mechanics, fluids, vibrations, basic circuits, titrations). Trained with QLoRA in 4-bit, this model is intended as an assistive code generator for STEM lab writeups and teaching demos—not as a certified calculator for safety-critical engineering.

	---

	## Model Details

	### Model Description

	- Developed by: Barghav777
	- Model type: Causal decoder LM (instruction-tuned) + LoRA adapter
	- Languages: English
	- License: MIT
	- Finetuned from: `microsoft/Phi-3-mini-4k-instruct`
	- Intended input format: A structured prompt with:
	- `### CONTEXT:` (natural-language description of the experiment)
	- `### OBSERVATIONS:` (JSON-like dict with units, readings)
	- `### CODE:` (the model is trained to generate the Python solution after this tag)

	### Model Sources

	- Base model: `microsoft/Phi-3-mini-4k-instruct`
	- Training data files: `train.jsonl` (37 items), `eval.jsonl` (6 items)
	- Demo/Colab basis: Training notebook available at: https://github.com/Barghav777/AI-Lab-Report-Agent

	---

	## Uses

	### Direct Use
	- Generate readable Python code to compute derived quantities from lab observations (e.g., average \(g\) via pendulum, Coriolis acceleration, Ohm’s law resistances, radius of gyration, Reynolds number).
	- Produce calculation pipelines with minimal plotting/printing that are easy to copy-paste and run in a notebook.

	### Downstream Use
	- Course assistants or lab-prep tools that auto-draft calculation code for intro undergrad physics/mech/fluids/EE labs.
	- Auto-checkers that compare student code vs. a reference implementation (with appropriate guardrails).

	### Out-of-Scope Use
	- Any safety-critical design decisions (structural, medical, chemical process control).
	- High-stakes computation without human verification.
	- Domains far outside the training distribution (e.g., NLP preprocessing pipelines, advanced control systems, large-scale simulation frameworks).

	---

	## Bias, Risks, and Limitations

	- Small dataset (37 train / 6 eval) → plausible overfitting; brittle generalization to unseen experiment formats.
	- Formula misuse risk: The model may pick incorrect constants/units or silently use wrong equations.
	- Overconfidence: Generated code may “look right” while being numerically off or unit-inconsistent.
	- JSON brittleness: If `OBSERVATIONS` keys/units differ from training patterns, the code may break.

	### Recommendations
	- Always review formulas and units; add assertions/unit conversions in downstream systems.
	- Run generated code with test observations and compare against hand calculations.
	- For deployment, wrap outputs with explanations and references to the formulas used.

	---

	## How to Get Started

	Prompt template used in training
	```text
	### CONTEXT:
	{context}

	### OBSERVATIONS:
	{observations}

	### CODE:
	```

	Load base + LoRA adapter (recommended)
	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, TextStreamer
	from peft import PeftModel
	import torch

	base_id = "microsoft/Phi-3-mini-4k-instruct"
	adapter_id = "YOUR_ADAPTER_REPO_OR_LOCAL_PATH" # e.g., ./phi3-lab-report-coder-final

	bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
	bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=False)

	tok = AutoTokenizer.from_pretrained(base_id, trust_remote_code=True)
	tok.pad_token = tok.eos_token

	base = AutoModelForCausalLM.from_pretrained(base_id, quantization_config=bnb,
	trust_remote_code=True, device_map="auto")
	model = PeftModel.from_pretrained(base, adapter_id)
	model.eval()

	prompt = """### CONTEXT:
	Experiment to determine acceleration due to gravity using a simple pendulum...

	### OBSERVATIONS:
	{'readings': [{'L':0.50,'T':1.42}, {'L':0.60,'T':1.55}], 'unit_L':'m', 'unit_T':'s'}

	### CODE:
	"""

	inputs = tok(prompt, return_tensors="pt").to(model.device)
	streamer = TextStreamer(tok, skip_prompt=True, skip_special_tokens=True)
	_ = model.generate(**inputs, max_new_tokens=400, temperature=0.2, do_sample=False, streamer=streamer)
	```

	---

	## Training Details

	### Data
	- Files: `train.jsonl` (list of objects), `eval.jsonl` (list of objects)
	- Schema per example:
	- `context` (str): experiment description
	- `observations` (dict): units + numeric readings (lists of dicts)
	- `code` (str): reference Python solution
	- Topical spread (non-exhaustive): pendulum \(g\), Ohm’s law, titration, density via displacement, Coriolis accel., gyroscopic effect, Hartnell governor, rotating mass balancing, helical spring vibration, bi-filar suspension, etc.

	Size & basic stats
	- Train: 37 items; Eval: 6 items
	- Formatted prompt (context+observations+code) length (train):
	- mean ≈ 222 words (≈ 1,739 chars); 95th pct ≈ 311 words
	- Reference code length (train):
	- mean ≈ 34 lines (min 9, max 71)

	### Training Procedure (from notebook)
	- Approach: QLoRA (4-bit) SFT using `trl.SFTTrainer`
	- Quantization: `bitsandbytes` 4-bit `nf4`, compute dtype `bfloat16`
	- LoRA config: `r=16`, `alpha=32`, `dropout=0.05`, `bias="none"`, targets = `q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj`
	- Tokenizer: right padding; `eos_token` as `pad_token`
	- Hyperparameters (TrainingArguments):
	- epochs: 10
	- per-device train batch size: 1
	- gradient_accumulation_steps: 4
	- optimizer: paged_adamw_32bit
	- learning rate: 2e-4, weight decay: 1e-3
	- warmup_ratio: 0.03, scheduler: constant
	- bf16: True (fp16: False), group_by_length: True
	- logging_steps: 10, save/eval every 50 steps
	- report_to: tensorboard
	- Saving: `trainer.save_model("./phi3-lab-report-coder-final")` (adapter folder)

	### Speeds, Sizes, Times
	- Hardware: Google Colab T4 GPU (per notebook metadata)
	- Adapter artifact: LoRA weights only (load with the base model).
	- Wall-clock time: not logged in the notebook.

	---

	## Evaluation

	### Testing Data, Factors & Metrics
	- Eval set: `eval.jsonl` (6 items) with same schema.
	- Primary metric (planned): ROUGE-L / ROUGE-1 against reference `code` (proxy for surface similarity).
	- Recommended additional checks: unit tests on numeric outputs; pyflakes/ruff for syntax; run-time assertions.

	### Results
	- No automated score recorded in the notebook.
	- Suggested protocol:
	1) Generate code for each eval item using the same prompt template.
	2) Execute safely in a sandbox with provided observations.
	3) Compare computed scalars (e.g., average \(g\), \(R\), Reynolds number) to ground truth tolerances.
	4) Report pass rate and ROUGE for readability/similarity.

	---

	## Model Examination (optional)
	- Inspect token-by-token attention to `OBSERVATIONS` keys (ablation: shuffle keys to test robustness).
	- Add unit-check helpers (e.g., `pint`) in prompts to encourage explicit conversions.

	---

	## Environmental Impact
	- Hardware Type: NVIDIA T4 (Colab)
	- Precision: 4-bit QLoRA with `bfloat16` compute
	- Hours used: Not recorded (dataset is small; expected low)
	- Cloud Provider/Region: Colab (unspecified)
	- Carbon Emitted: Not estimated (see [ML CO2 Impact calculator](https://mlco2.github.io/impact#compute))

	---

	## Technical Specifications

	### Architecture & Objective
	- Backbone: `Phi-3-mini-4k-instruct` (decoder-only causal LM)
	- Objective: Supervised fine-tuning to continue from `### CODE:` with correct, executable Python.

	### Compute Infrastructure
	- Hardware: Colab GPU (T4) + CPU RAM
	- Software:
	- `transformers`, `trl`, `peft`, `bitsandbytes`, `datasets`, `accelerate`, `torch`

	---

	## Citation
	@article{abdin2024phi3,
	title = {Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone},
	author = {Abdin, Marah and others},
	journal = {arXiv preprint arXiv:2404.14219},
	year = {2024},
	doi = {10.48550/arXiv.2404.14219},
	url = {https://arxiv.org/abs/2404.14219}
	}

	---

	## Glossary
	- QLoRA: Fine-tuning with low-rank adapters on a quantized base model (saves memory/compute).
	- LoRA (r, α): Rank and scaling of low-rank update matrices.

	---

	## More Information
	- For better robustness, consider augmenting data with unit-perturbation and noise-in-readings variants, and add examples across more domains (materials, thermo, optics).
	- Add eval harness with numeric tolerances and syntax checks.

	---

	## Model Card Authors
	- Barghav777
	---