Model Card for Phi3-Lab-Report-Coder (LoRA on Phi-3 Mini 4k Instruct)

A lightweight LoRA-adapter fine-tune of microsoft/Phi-3-mini-4k-instruct for turning structured lab contexts + observations into executable Python code that performs the target calculations (e.g., mechanics, fluids, vibrations, basic circuits, titrations). Trained with QLoRA in 4-bit, this model is intended as an assistive code generator for STEM lab writeups and teaching demos—not as a certified calculator for safety-critical engineering.


Model Details

Model Description

  • Developed by: Barghav777
  • Model type: Causal decoder LM (instruction-tuned) + LoRA adapter
  • Languages: English
  • License: MIT
  • Finetuned from: microsoft/Phi-3-mini-4k-instruct
  • Intended input format: A structured prompt with:
    • ### CONTEXT: (natural-language description of the experiment)
    • ### OBSERVATIONS: (JSON-like dict with units, readings)
    • ### CODE: (the model is trained to generate the Python solution after this tag)

Model Sources


Uses

Direct Use

  • Generate readable Python code to compute derived quantities from lab observations (e.g., average (g) via pendulum, Coriolis acceleration, Ohm’s law resistances, radius of gyration, Reynolds number).
  • Produce calculation pipelines with minimal plotting/printing that are easy to copy-paste and run in a notebook.

Downstream Use

  • Course assistants or lab-prep tools that auto-draft calculation code for intro undergrad physics/mech/fluids/EE labs.
  • Auto-checkers that compare student code vs. a reference implementation (with appropriate guardrails).

Out-of-Scope Use

  • Any safety-critical design decisions (structural, medical, chemical process control).
  • High-stakes computation without human verification.
  • Domains far outside the training distribution (e.g., NLP preprocessing pipelines, advanced control systems, large-scale simulation frameworks).

Bias, Risks, and Limitations

  • Small dataset (37 train / 6 eval) → plausible overfitting; brittle generalization to unseen experiment formats.
  • Formula misuse risk: The model may pick incorrect constants/units or silently use wrong equations.
  • Overconfidence: Generated code may “look right” while being numerically off or unit-inconsistent.
  • JSON brittleness: If OBSERVATIONS keys/units differ from training patterns, the code may break.

Recommendations

  • Always review formulas and units; add assertions/unit conversions in downstream systems.
  • Run generated code with test observations and compare against hand calculations.
  • For deployment, wrap outputs with explanations and references to the formulas used.

How to Get Started

Prompt template used in training

### CONTEXT:
{context}

### OBSERVATIONS:
{observations}

### CODE:

Load base + LoRA adapter (recommended)

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, TextStreamer
from peft import PeftModel
import torch

base_id = "microsoft/Phi-3-mini-4k-instruct"
adapter_id = "YOUR_ADAPTER_REPO_OR_LOCAL_PATH"  # e.g., ./phi3-lab-report-coder-final

bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
                         bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=False)

tok = AutoTokenizer.from_pretrained(base_id, trust_remote_code=True)
tok.pad_token = tok.eos_token

base = AutoModelForCausalLM.from_pretrained(base_id, quantization_config=bnb,
                                            trust_remote_code=True, device_map="auto")
model = PeftModel.from_pretrained(base, adapter_id)
model.eval()

prompt = """### CONTEXT:
Experiment to determine acceleration due to gravity using a simple pendulum...

### OBSERVATIONS:
{'readings': [{'L':0.50,'T':1.42}, {'L':0.60,'T':1.55}], 'unit_L':'m', 'unit_T':'s'}

### CODE:
"""

inputs = tok(prompt, return_tensors="pt").to(model.device)
streamer = TextStreamer(tok, skip_prompt=True, skip_special_tokens=True)
_ = model.generate(**inputs, max_new_tokens=400, temperature=0.2, do_sample=False, streamer=streamer)

Training Details

Data

  • Files: train.jsonl (list of objects), eval.jsonl (list of objects)
  • Schema per example:
    • context (str): experiment description
    • observations (dict): units + numeric readings (lists of dicts)
    • code (str): reference Python solution
  • Topical spread (non-exhaustive): pendulum (g), Ohm’s law, titration, density via displacement, Coriolis accel., gyroscopic effect, Hartnell governor, rotating mass balancing, helical spring vibration, bi-filar suspension, etc.

Size & basic stats

  • Train: 37 items; Eval: 6 items
  • Formatted prompt (context+observations+code) length (train):
    • mean ≈ 222 words (≈ 1,739 chars); 95th pct ≈ 311 words
  • Reference code length (train):
    • mean ≈ 34 lines (min 9, max 71)

Training Procedure (from notebook)

  • Approach: QLoRA (4-bit) SFT using trl.SFTTrainer
  • Quantization: bitsandbytes 4-bit nf4, compute dtype bfloat16
  • LoRA config: r=16, alpha=32, dropout=0.05, bias="none", targets = q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj
  • Tokenizer: right padding; eos_token as pad_token
  • Hyperparameters (TrainingArguments):
    • epochs: 10
    • per-device train batch size: 1
    • gradient_accumulation_steps: 4
    • optimizer: paged_adamw_32bit
    • learning rate: 2e-4, weight decay: 1e-3
    • warmup_ratio: 0.03, scheduler: constant
    • bf16: True (fp16: False), group_by_length: True
    • logging_steps: 10, save/eval every 50 steps
    • report_to: tensorboard
  • Saving: trainer.save_model("./phi3-lab-report-coder-final") (adapter folder)

Speeds, Sizes, Times

  • Hardware: Google Colab T4 GPU (per notebook metadata)
  • Adapter artifact: LoRA weights only (load with the base model).
  • Wall-clock time: not logged in the notebook.

Evaluation

Testing Data, Factors & Metrics

  • Eval set: eval.jsonl (6 items) with same schema.
  • Primary metric (planned): ROUGE-L / ROUGE-1 against reference code (proxy for surface similarity).
  • Recommended additional checks: unit tests on numeric outputs; pyflakes/ruff for syntax; run-time assertions.

Results

  • No automated score recorded in the notebook.
  • Suggested protocol:
    1. Generate code for each eval item using the same prompt template.
    2. Execute safely in a sandbox with provided observations.
    3. Compare computed scalars (e.g., average (g), (R), Reynolds number) to ground truth tolerances.
    4. Report pass rate and ROUGE for readability/similarity.

Model Examination (optional)

  • Inspect token-by-token attention to OBSERVATIONS keys (ablation: shuffle keys to test robustness).
  • Add unit-check helpers (e.g., pint) in prompts to encourage explicit conversions.

Environmental Impact

  • Hardware Type: NVIDIA T4 (Colab)
  • Precision: 4-bit QLoRA with bfloat16 compute
  • Hours used: Not recorded (dataset is small; expected low)
  • Cloud Provider/Region: Colab (unspecified)
  • Carbon Emitted: Not estimated (see ML CO2 Impact calculator)

Technical Specifications

Architecture & Objective

  • Backbone: Phi-3-mini-4k-instruct (decoder-only causal LM)
  • Objective: Supervised fine-tuning to continue from ### CODE: with correct, executable Python.

Compute Infrastructure

  • Hardware: Colab GPU (T4) + CPU RAM
  • Software:
    • transformers, trl, peft, bitsandbytes, datasets, accelerate, torch

Citation

@article{abdin2024phi3, title = {Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone}, author = {Abdin, Marah and others}, journal = {arXiv preprint arXiv:2404.14219}, year = {2024}, doi = {10.48550/arXiv.2404.14219}, url = {https://arxiv.org/abs/2404.14219} }


Glossary

  • QLoRA: Fine-tuning with low-rank adapters on a quantized base model (saves memory/compute).
  • LoRA (r, α): Rank and scaling of low-rank update matrices.

More Information

  • For better robustness, consider augmenting data with unit-perturbation and noise-in-readings variants, and add examples across more domains (materials, thermo, optics).
  • Add eval harness with numeric tolerances and syntax checks.

Model Card Authors

  • Barghav777

Downloads last month
1
Safetensors
Model size
4B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Barghav777/phi3-lab-report-coder

Finetuned
(380)
this model