polymer-aging-ml / README.md
devjas1
Merge branch 'main' of https://github.com/KLab-AI3/ml-polymer-recycling
9125d98
|
raw
history blame
7.76 kB
# πŸ”¬ AI-Driven Polymer Aging Prediction and Classification System
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
A research project developed as part of AIRE 2025. This system applies deep learning to Raman spectral data to classify polymer aging β€” a critical proxy for recyclability β€” using a fully reproducible and modular ML pipeline.
---
## 🎯 Project Objective
- Build a validated machine learning system for classifying polymer spectra (predict degradation levels as a proxy for recyclability)
- Compare literature-based and modern CNN architectures (Figure2CNN vs. ResNet1D) on Raman spectral data
- Ensure scientific reproducibility through structured diaignostics and artifact control
- Support sustainability and circular materials research through spectrum-based classification.
---
## 🧠 Model Architectures
| Model| Description |
|------|-------------|
| `Figure2CNN` | Baseline model from literature |
| `ResNet1D` | Deeper candidate model with skip connections |
> Both models support flexible input lengths; Figure2CNN relies on reshape logic, while ResNet1D uses native global pooling.
---
## πŸ“ Project Structure (Cleaned and Current)
```text
ml-polymer-recycling/
β”œβ”€β”€ datasets/
β”œβ”€β”€ models/ # Model architectures
β”œβ”€β”€ scripts/ # Training, inference, utilities
β”œβ”€β”€ outputs/ # Artifacts: models, logs, plots
β”œβ”€β”€ docs/ # Documentation & reports
└── environment.yml # (local) Conda execution environment
```
<img width="1773" height="848" alt="ml-polymer-gitdiagram-0" src="https://github.com/user-attachments/assets/bb5d93dc-7ab9-4259-8513-fb680ae59d64" />
---
## βœ… Current Status
| Track | Status | Test Accuracy |
|-----------|----------------------|----------------|
| **Raman** | βœ… Active & validated | **87.81% Β± 7.59%** |
| **FTIR** | ⏸️ Deferred (modeling only) | N/A |
**Note:** FTIR preprocessing scripts are preserved but inactive. Modeling work is deferred until a suitable architecture is identified.
**Artifacts:**
- `outputs/figure2_model.pth`
- `outputs/resnet_model.pth`
- `outputs/logs/raman_{model}_diagnostics.json`
---
## πŸ”¬ Key Features
- βœ… 10-Fold Stratified Cross-Validation
- βœ… CLI Training: `train_model.py`
- βœ… CLI Inference `run_inference.py`
- βœ… Output artifact naming per model
- βœ… Raman-only preprocessing with baseline correction, smoothing, normalization
- βœ… Structured diagnostics JSON (accuracies, confusion matrices)
- βœ… Canonical validation script (`validate_pipeline.sh`) confirms reproducibility of all core components
---
**Environments:**
```bash
# Local
git checkout main
conda env create -f environment.yml
conda activate polymer_env
# HPC
git checkout hpc-main
conda env create -f environment_hpc.yml
conda activate polymer_env
```
## πŸ“Š Sample Training & Inference
### Training (10-Fold CV)
```bash
python scripts/train_model.py --model resnet --target-len 4000 --baseline --smooth --normalize
```
### Inference (Raman)
```bash
python scripts/run_inference.py --target-len 4000
--input datasets/rdwp/sample123.txt --model outputs/resnet_model.pth
--output outputs/inference/prediction.txt
```
### Inference Output Example:
```bash
Predicted Label: 1 True Label: 1
Raw Logits: [[-569.544, 427.996]]
```
### Validation Script (Raman Pipeline)
```bash
./validate_pipeline.sh
# Runs preprocessing, training, inference, and plotting checks
# Confirms artifact integrity and logs test results
```
---
## πŸ“š Dataset Resources
| Type | Dataset | Source |
|-------|---------|--------|
| Raman | RDWP | [A Raman database of microplastics weathered under natural environments](https://data.mendeley.com/datasets/kpygrf9fg6/1) |
| Datasets should be downloaded separately and placed here:
```bash
datasets/
└── rdwp/
β”œβ”€β”€ sample1.txt
β”œβ”€β”€ sample2.txt
└── ...
```
These files are intentionally excluded from version control via `.gitignore`
---
## πŸ›  Dependencies
- `Python 3.10+`
- `Conda, Git`
- `PyTorch (CPU & CUDA)`
- `Numpy, SciPy, Pandas`
- `Scikit-learn`
- `Matplotlib, Seaborn`
- `ArgParse, JSON`
---
## πŸ§‘β€πŸ€β€πŸ§‘ Contributors
- **Dr. Sanmukh Kuppannagari** β€” Research Mentor
- **Dr. Metin Karailyan** β€” Research Mentor
- **Jaser H.** β€” AIRE 2025 Intern, Developer
---
## 🎯 Strategic Expansion Objectives
> Following Dr. Kuppannagari’s updated guidance, the project scope now extends beyond the Raman-only validated baseline. The roadmap defines three major expansion paths designed to broaden the system’s capabilities and impact:
1. **Model Expansion: Multi-Model Dashboard**
> The dashboard will evolve into a hub for multiple model architectures rather than being tied to a single baseline. Planned work includes:
- **Retraining & Fine-Tuning**: Incorporating publicly available vision models and retraining them with the polymer dataset.
- **Model Registry**: Automatically detecting available .pth weights and exposing them in the dashboard for easy selection.
- **Side-by-Side Reporting**: Running comparative experiments and reporting each model’s accuracy and diagnostics in a standardized format.
- **Reproducible Integration**: Maintaining modular scripts and pipelines so each model’s results can be replicated without conflict.
This ensures flexibility for future research and transparency in performance comparisons.
2. **Image Input Modality**
> The system will support classification on images as an additional modality, extending beyond spectra. Key features will include:
- **Upload Support**: Users can upload single images or batches directly through the dashboard.
- **Multi-Model Execution**: Selected models from the registry can be applied to all uploaded images simultaneously.
- **Batch Results**: Output will be returned in a structured, accessible way, showing both individual predictions and aggregate statistics.
- **Enhanced Feedback**: Outputs will include predicted class, model confidence, and potentially annotated image previews.
This expands the system toward a multi-modal framework, supporting broader research workflows.
3. **FTIR Dataset Integration**
> Although previously deferred, FTIR support will be added back in a modular, distinct fashion. Planned steps are:
- **Dedicated Preprocessing**: Tailored scripts to handle FTIR-specific signal characteristics (multi-layer handling, baseline correction, normalization).
- **Architecture Compatibility**: Ensuring existing and retrained models can process FTIR data without mixing it with Raman workflows.
- **UI Integration**: Introducing FTIR as a separate option in the modality selector, keeping Raman, Image, and FTIR workflows clearly delineated.
- **Phased Development**: Implementation details to be refined during meetings to ensure scientific rigor.
This guarantees FTIR becomes a supported modality without undermining the validated Raman foundation.
## πŸ”‘ Guiding Principles
- **Preserve the Raman baseline** as the reproducible ground truth
- **Additive modularity**: Models, images, and FTIR added as clean, distinct layers rather than overwriting core functionality
- **Transparency & reproducibility**: All expansions documented, tested, and logged with clear outputs.
- **Future-oriented design**: Workflows structured to support ongoing collaboration and successor-safe research.