File size: 5,758 Bytes
1a24ba3
 
 
 
6f3ce53
1a24ba3
77c5948
1a24ba3
 
 
 
 
77c5948
1a24ba3
 
 
 
 
 
 
 
 
77c5948
1a24ba3
 
 
 
 
 
77c5948
1a24ba3
 
 
 
 
 
 
77c5948
1a24ba3
 
 
 
77c5948
cd5ce2a
1a24ba3
 
 
 
 
 
 
 
 
 
 
 
cd5ce2a
1a24ba3
 
 
 
77c5948
1a24ba3
 
 
 
 
 
 
808e3c3
 
 
 
77c5948
1a24ba3
77c5948
1a24ba3
 
77c5948
 
1a24ba3
 
 
 
 
 
 
 
77c5948
1a24ba3
 
 
 
 
 
 
 
cd5ce2a
1a24ba3
cd5ce2a
 
1a24ba3
 
77c5948
1a24ba3
77c5948
1a24ba3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
77c5948
1a24ba3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
77c5948
1a24ba3
 
 
 
cd5ce2a
 
1a24ba3
 
77c5948
1a24ba3
 
 
 
 
 
77c5948
1a24ba3
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
---
license: mit
datasets:
- UserNae3/LLVIP
pipeline_tag: image-to-image
---
#  Conditional GAN for Visible → Infrared (LLVIP)

> **High-fidelity Visible-to-Infrared Translation using a Conditional GAN with Multi-Loss Optimization**

---

##  Overview

This project implements a **Conditional Generative Adversarial Network (cGAN)** trained to translate **visible-light (RGB)** images into **infrared (IR)** representations.

It leverages **multi-loss optimization** — combining perceptual, pixel, adversarial, and edge-based objectives — to generate sharp, realistic IR outputs that preserve both **scene structure** and **thermal contrast**.

A higher emphasis is given to **L1 loss**, ensuring that overall brightness and object boundaries remain consistent between visible and infrared domains.

---

##  Dataset

- **Dataset:** [LLVIP Dataset](https://huggingface.co/datasets/UserNae3/LLVIP)  
  Paired **visible (RGB)** and **infrared (IR)** images under diverse lighting and background conditions.

---

##  Model Architecture

- **Type:** Conditional GAN (cGAN)  
- **Direction:** *Visible → Infrared*  
- **Framework:** TensorFlow  
- **Pipeline Tag:** `image-to-image`  
- **License:** MIT  

###  Generator  
- U-Net encoder–decoder with skip connections  
- Conditioned on RGB input  
- Output: single-channel IR image  

###  Discriminator  
- Evaluates realism  for fine detail learning  

---

## ⚙️ Training Configuration

| Setting | Value |
|----------|--------|
| **Epochs** | 100 |
| **Steps per Epoch** | 376 |
| **Batch Size** | 4 |
| **Optimizer** | Adam (β₁ = 0.5, β₂ = 0.999) |
| **Learning Rate** | 2e-4 |
| **Precision** | Mixed (32) |
| **Hardware** | NVIDIA T4 (Kaggle GPU Runtime) |

---

##  Multi-Loss Function Design

| Loss Type | Description | Weight (λ) | Purpose |
|------------|--------------|-------------|----------|
| **L1 Loss** | Pixel-wise mean absolute error between generated and real IR | **100** | Ensures global brightness & shape consistency |
| **Perceptual Loss (VGG)** | Feature loss from `conv5_block4` of pretrained VGG-19 | **10** | Captures high-level texture and semantic alignment |
| **Adversarial Loss** | Binary cross-entropy loss from PatchGAN discriminator | **1** | Encourages realistic IR texture generation |
| **Edge Loss** | Sobel/gradient difference between real & generated images | **5** | Enhances sharpness and edge clarity |


---

The **total generator loss** is computed as:  
\[
L_{G} = \lambda_{L1}\,L_{L1} + \lambda_{\text{perc}}\,L_{\text{perc}} + \lambda_{\text{adv}}\,L_{\text{adv}} + \lambda_{\text{edge}}\,L_{\text{edge}}
\]


##  Evaluation Metrics

| Metric | Definition | Result |
|---------|-------------|--------|
| **L1 Loss** | Mean absolute difference between generated and ground truth IR | **0.0611** |
| **PSNR (Peak Signal-to-Noise Ratio)** | Measures reconstruction quality (higher is better) | **24.3096 dB** |
| **SSIM (Structural Similarity Index Measure)** | Perceptual similarity between generated & target images | **0.8386** |

---
##  Model Architectures

| Model | Visualization |
|-------|---------------|
| **Generator** | ![Generator Architecture](generator.png) |
| **Discriminator** | ![Discriminator Architecture](discriminator.png) |
| **Combined GAN** | ![GAN Architecture Combined](gan_architecture_combined.png) |

---
Data Exploration

We analysed the LLVIP dataset and found that ~70% of image pairs are captured at < 50 lux lighting and ~30% at 50-200 lux. 
The average pedestrian height in IR channel was X pixels; outliers with <20 pixels height were excluded.


##  Visual Results

###  Training Progress (Sample Evolution)
<img src="ezgif-58298bca2da920.gif" alt="Training Progress" width="700"/>

### ✨ Final Convergence Samples
| Early Epochs (Blurry, Low Brightness) | Later Epochs (Sharper, High Contrast) |
|--------------------------------------|---------------------------------------|
| <img src="./epoch_007.png" width="550"/> | <img src="epoch_100.png" width="550"/> |

###  Comparison: Input vs Ground Truth vs Generated
| RGB Input-  Ground Truth IR - Predicted IR |

| <img src="test_1179.png" width="750"/>
| <img src="test_001.png" width="750"/>
| <img src="test_4884.png" width="750"/>
| <img src="test_5269.png" width="750"/>
| <img src="test_5361.png" width="750"/>
| <img src="test_7255.png" width="750"/>
| <img src="test_7362.png" width="750"/>
| <img src="test_12015.png" width="750"/>
---

## Loss Curves

### Generator & Discriminator Loss
<img src="./train_loss_curve.png" alt="Training Loss Curve" width="600"/>

### Validation Loss per Epoch
<img src="./val_loss_curve.png" alt="Validation Loss Curve" width="600"/>

All training metrics are logged in:

---
```bash
/
├── logs.log
└── loss_summary.csv
```
##  Observations

- The model **captures IR brightness and object distinction**, but early epochs show slight blur due to L1-dominant stages.  
- **Contrast and edge sharpness improve** after ~70 epochs as adversarial and perceptual losses gain weight.  
- Background variations in LLVIP introduce challenges; future fine-tuning on domain-aligned subsets can further improve realism.
- We compared three variants: (i) U-Net regression (L1 only) → SSIM = 0.80;
- (ii) cGAN with L1+adv → SSIM = 0.83; (iii) cGAN with L1+adv+perc+edge (our final) → SSIM = 0.8386
---

##  Future Work

- Apply **feature matching loss** for smoother discriminator gradients  
- Add **temporal or sequence consistency** for video IR translation  
- Adaptive loss balancing with epoch-based dynamic weighting  

---
 Acknowledgements

LLVIP Dataset for paired RGB–IR samples

TensorFlow and VGG-19 for perceptual feature extraction

Kaggle GPU for high-performance model training