File size: 5,758 Bytes

---
license: mit
datasets:
- UserNae3/LLVIP
pipeline_tag: image-to-image
---
#  Conditional GAN for Visible → Infrared (LLVIP)

> **High-fidelity Visible-to-Infrared Translation using a Conditional GAN with Multi-Loss Optimization**

---

##  Overview

This project implements a **Conditional Generative Adversarial Network (cGAN)** trained to translate **visible-light (RGB)** images into **infrared (IR)** representations.

It leverages **multi-loss optimization** — combining perceptual, pixel, adversarial, and edge-based objectives — to generate sharp, realistic IR outputs that preserve both **scene structure** and **thermal contrast**.

A higher emphasis is given to **L1 loss**, ensuring that overall brightness and object boundaries remain consistent between visible and infrared domains.

---

##  Dataset

- **Dataset:** [LLVIP Dataset](https://huggingface.co/datasets/UserNae3/LLVIP)  
  Paired **visible (RGB)** and **infrared (IR)** images under diverse lighting and background conditions.

---

##  Model Architecture

- **Type:** Conditional GAN (cGAN)  
- **Direction:** *Visible → Infrared*  
- **Framework:** TensorFlow  
- **Pipeline Tag:** `image-to-image`  
- **License:** MIT  

###  Generator  
- U-Net encoder–decoder with skip connections  
- Conditioned on RGB input  
- Output: single-channel IR image  

###  Discriminator  
- Evaluates realism  for fine detail learning  

---

## ⚙️ Training Configuration

| Setting | Value |
|----------|--------|
| **Epochs** | 100 |
| **Steps per Epoch** | 376 |
| **Batch Size** | 4 |
| **Optimizer** | Adam (β₁ = 0.5, β₂ = 0.999) |
| **Learning Rate** | 2e-4 |
| **Precision** | Mixed (32) |
| **Hardware** | NVIDIA T4 (Kaggle GPU Runtime) |

---

##  Multi-Loss Function Design

| Loss Type | Description | Weight (λ) | Purpose |
|------------|--------------|-------------|----------|
| **L1 Loss** | Pixel-wise mean absolute error between generated and real IR | **100** | Ensures global brightness & shape consistency |
| **Perceptual Loss (VGG)** | Feature loss from `conv5_block4` of pretrained VGG-19 | **10** | Captures high-level texture and semantic alignment |
| **Adversarial Loss** | Binary cross-entropy loss from PatchGAN discriminator | **1** | Encourages realistic IR texture generation |
| **Edge Loss** | Sobel/gradient difference between real & generated images | **5** | Enhances sharpness and edge clarity |


---

The **total generator loss** is computed as:  
\[
L_{G} = \lambda_{L1}\,L_{L1} + \lambda_{\text{perc}}\,L_{\text{perc}} + \lambda_{\text{adv}}\,L_{\text{adv}} + \lambda_{\text{edge}}\,L_{\text{edge}}
\]


##  Evaluation Metrics

| Metric | Definition | Result |
|---------|-------------|--------|
| **L1 Loss** | Mean absolute difference between generated and ground truth IR | **0.0611** |
| **PSNR (Peak Signal-to-Noise Ratio)** | Measures reconstruction quality (higher is better) | **24.3096 dB** |
| **SSIM (Structural Similarity Index Measure)** | Perceptual similarity between generated & target images | **0.8386** |

---
##  Model Architectures

| Model | Visualization |
|-------|---------------|
| **Generator** | ![Generator Architecture](generator.png) |
| **Discriminator** | ![Discriminator Architecture](discriminator.png) |
| **Combined GAN** | ![GAN Architecture Combined](gan_architecture_combined.png) |

---
Data Exploration

We analysed the LLVIP dataset and found that ~70% of image pairs are captured at < 50 lux lighting and ~30% at 50-200 lux. 
The average pedestrian height in IR channel was X pixels; outliers with <20 pixels height were excluded.


##  Visual Results

###  Training Progress (Sample Evolution)
<img src="ezgif-58298bca2da920.gif" alt="Training Progress" width="700"/>

### ✨ Final Convergence Samples
| Early Epochs (Blurry, Low Brightness) | Later Epochs (Sharper, High Contrast) |
|--------------------------------------|---------------------------------------|
| <img src="./epoch_007.png" width="550"/> | <img src="epoch_100.png" width="550"/> |

###  Comparison: Input vs Ground Truth vs Generated
| RGB Input-  Ground Truth IR - Predicted IR |

| <img src="test_1179.png" width="750"/>
| <img src="test_001.png" width="750"/>
| <img src="test_4884.png" width="750"/>
| <img src="test_5269.png" width="750"/>
| <img src="test_5361.png" width="750"/>
| <img src="test_7255.png" width="750"/>
| <img src="test_7362.png" width="750"/>
| <img src="test_12015.png" width="750"/>
---

## Loss Curves

### Generator & Discriminator Loss
<img src="./train_loss_curve.png" alt="Training Loss Curve" width="600"/>

### Validation Loss per Epoch
<img src="./val_loss_curve.png" alt="Validation Loss Curve" width="600"/>

All training metrics are logged in:

---
```bash
/
├── logs.log
└── loss_summary.csv
```
##  Observations

- The model **captures IR brightness and object distinction**, but early epochs show slight blur due to L1-dominant stages.  
- **Contrast and edge sharpness improve** after ~70 epochs as adversarial and perceptual losses gain weight.  
- Background variations in LLVIP introduce challenges; future fine-tuning on domain-aligned subsets can further improve realism.
- We compared three variants: (i) U-Net regression (L1 only) → SSIM = 0.80;
- (ii) cGAN with L1+adv → SSIM = 0.83; (iii) cGAN with L1+adv+perc+edge (our final) → SSIM = 0.8386
---

##  Future Work

- Apply **feature matching loss** for smoother discriminator gradients  
- Add **temporal or sequence consistency** for video IR translation  
- Adaptive loss balancing with epoch-based dynamic weighting  

---
 Acknowledgements

LLVIP Dataset for paired RGB–IR samples

TensorFlow and VGG-19 for perceptual feature extraction

Kaggle GPU for high-performance model training