Spaces:

dev-jas
/

polymer-aging-ml

Sleeping

App Files Files Community

polymer-aging-ml / README.md

devjas1

Merge branch 'main' of https://github.com/KLab-AI3/ml-polymer-recycling

9125d98 4 months ago

preview code

raw

history blame

7.76 kB


	# 🔬 AI-Driven Polymer Aging Prediction and Classification System

	[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

	A research project developed as part of AIRE 2025. This system applies deep learning to Raman spectral data to classify polymer aging — a critical proxy for recyclability — using a fully reproducible and modular ML pipeline.

	---

	## 🎯 Project Objective

	- Build a validated machine learning system for classifying polymer spectra (predict degradation levels as a proxy for recyclability)
	- Compare literature-based and modern CNN architectures (Figure2CNN vs. ResNet1D) on Raman spectral data
	- Ensure scientific reproducibility through structured diaignostics and artifact control
	- Support sustainability and circular materials research through spectrum-based classification.

	---

	## 🧠 Model Architectures

	\| Model\| Description \|
	\|------\|-------------\|
	\| `Figure2CNN` \| Baseline model from literature \|
	\| `ResNet1D` \| Deeper candidate model with skip connections \|

	> Both models support flexible input lengths; Figure2CNN relies on reshape logic, while ResNet1D uses native global pooling.

	---

	## 📁 Project Structure (Cleaned and Current)

	```text
	ml-polymer-recycling/
	├── datasets/
	├── models/ # Model architectures
	├── scripts/ # Training, inference, utilities
	├── outputs/ # Artifacts: models, logs, plots
	├── docs/ # Documentation & reports
	└── environment.yml # (local) Conda execution environment
	```

	<img width="1773" height="848" alt="ml-polymer-gitdiagram-0" src="https://github.com/user-attachments/assets/bb5d93dc-7ab9-4259-8513-fb680ae59d64" />


	---

	## ✅ Current Status

	\| Track \| Status \| Test Accuracy \|
	\|-----------\|----------------------\|----------------\|
	\| Raman \| ✅ Active & validated \| 87.81% ± 7.59% \|
	\| FTIR \| ⏸️ Deferred (modeling only) \| N/A \|

	Note: FTIR preprocessing scripts are preserved but inactive. Modeling work is deferred until a suitable architecture is identified.

	Artifacts:

	- `outputs/figure2_model.pth`
	- `outputs/resnet_model.pth`
	- `outputs/logs/raman_{model}_diagnostics.json`

	---

	## 🔬 Key Features

	- ✅ 10-Fold Stratified Cross-Validation
	- ✅ CLI Training: `train_model.py`
	- ✅ CLI Inference `run_inference.py`
	- ✅ Output artifact naming per model
	- ✅ Raman-only preprocessing with baseline correction, smoothing, normalization
	- ✅ Structured diagnostics JSON (accuracies, confusion matrices)
	- ✅ Canonical validation script (`validate_pipeline.sh`) confirms reproducibility of all core components

	---

	Environments:

	```bash

	# Local
	git checkout main
	conda env create -f environment.yml
	conda activate polymer_env

	# HPC
	git checkout hpc-main
	conda env create -f environment_hpc.yml
	conda activate polymer_env
	```

	## 📊 Sample Training & Inference

	### Training (10-Fold CV)

	```bash

	python scripts/train_model.py --model resnet --target-len 4000 --baseline --smooth --normalize
	```

	### Inference (Raman)

	```bash

	python scripts/run_inference.py --target-len 4000
	--input datasets/rdwp/sample123.txt --model outputs/resnet_model.pth
	--output outputs/inference/prediction.txt
	```

	### Inference Output Example:

	```bash
	Predicted Label: 1 True Label: 1
	Raw Logits: [[-569.544, 427.996]]
	```

	### Validation Script (Raman Pipeline)

	```bash
	./validate_pipeline.sh
	# Runs preprocessing, training, inference, and plotting checks
	# Confirms artifact integrity and logs test results
	```

	---

	## 📚 Dataset Resources

	\| Type \| Dataset \| Source \|
	\|-------\|---------\|--------\|
	\| Raman \| RDWP \| [A Raman database of microplastics weathered under natural environments](https://data.mendeley.com/datasets/kpygrf9fg6/1) \|

	\| Datasets should be downloaded separately and placed here:

	```bash
	datasets/
	└── rdwp/
	├── sample1.txt
	├── sample2.txt
	└── ...
	```

	These files are intentionally excluded from version control via `.gitignore`

	---

	## 🛠 Dependencies

	- `Python 3.10+`
	- `Conda, Git`
	- `PyTorch (CPU & CUDA)`
	- `Numpy, SciPy, Pandas`
	- `Scikit-learn`
	- `Matplotlib, Seaborn`
	- `ArgParse, JSON`

	---

	## 🧑‍🤝‍🧑 Contributors

	- Dr. Sanmukh Kuppannagari — Research Mentor
	- Dr. Metin Karailyan — Research Mentor
	- Jaser H. — AIRE 2025 Intern, Developer

	---

	## 🎯 Strategic Expansion Objectives

	> Following Dr. Kuppannagari’s updated guidance, the project scope now extends beyond the Raman-only validated baseline. The roadmap defines three major expansion paths designed to broaden the system’s capabilities and impact:

	1. Model Expansion: Multi-Model Dashboard

	> The dashboard will evolve into a hub for multiple model architectures rather than being tied to a single baseline. Planned work includes:

	- Retraining & Fine-Tuning: Incorporating publicly available vision models and retraining them with the polymer dataset.
	- Model Registry: Automatically detecting available .pth weights and exposing them in the dashboard for easy selection.
	- Side-by-Side Reporting: Running comparative experiments and reporting each model’s accuracy and diagnostics in a standardized format.
	- Reproducible Integration: Maintaining modular scripts and pipelines so each model’s results can be replicated without conflict.

	This ensures flexibility for future research and transparency in performance comparisons.

	2. Image Input Modality

	> The system will support classification on images as an additional modality, extending beyond spectra. Key features will include:

	- Upload Support: Users can upload single images or batches directly through the dashboard.
	- Multi-Model Execution: Selected models from the registry can be applied to all uploaded images simultaneously.
	- Batch Results: Output will be returned in a structured, accessible way, showing both individual predictions and aggregate statistics.
	- Enhanced Feedback: Outputs will include predicted class, model confidence, and potentially annotated image previews.

	This expands the system toward a multi-modal framework, supporting broader research workflows.

	3. FTIR Dataset Integration

	> Although previously deferred, FTIR support will be added back in a modular, distinct fashion. Planned steps are:

	- Dedicated Preprocessing: Tailored scripts to handle FTIR-specific signal characteristics (multi-layer handling, baseline correction, normalization).
	- Architecture Compatibility: Ensuring existing and retrained models can process FTIR data without mixing it with Raman workflows.
	- UI Integration: Introducing FTIR as a separate option in the modality selector, keeping Raman, Image, and FTIR workflows clearly delineated.
	- Phased Development: Implementation details to be refined during meetings to ensure scientific rigor.

	This guarantees FTIR becomes a supported modality without undermining the validated Raman foundation.

	## 🔑 Guiding Principles

	- Preserve the Raman baseline as the reproducible ground truth
	- Additive modularity: Models, images, and FTIR added as clean, distinct layers rather than overwriting core functionality
	- Transparency & reproducibility: All expansions documented, tested, and logged with clear outputs.
	- Future-oriented design: Workflows structured to support ongoing collaboration and successor-safe research.