Conditional GAN for Visible → Infrared (LLVIP)
High-fidelity Visible-to-Infrared Translation using a Conditional GAN with Multi-Loss Optimization
Overview
This project implements a Conditional Generative Adversarial Network (cGAN) trained to translate visible-light (RGB) images into infrared (IR) representations.
It leverages multi-loss optimization — combining perceptual, pixel, adversarial, and edge-based objectives — to generate sharp, realistic IR outputs that preserve both scene structure and thermal contrast.
A higher emphasis is given to L1 loss, ensuring that overall brightness and object boundaries remain consistent between visible and infrared domains.
Dataset
- Dataset: LLVIP Dataset
Paired visible (RGB) and infrared (IR) images under diverse lighting and background conditions.
Model Architecture
- Type: Conditional GAN (cGAN)
- Direction: Visible → Infrared
- Framework: TensorFlow
- Pipeline Tag:
image-to-image - License: MIT
Generator
- U-Net encoder–decoder with skip connections
- Conditioned on RGB input
- Output: single-channel IR image
Discriminator
- Evaluates realism for fine detail learning
⚙️ Training Configuration
| Setting | Value |
|---|---|
| Epochs | 100 |
| Steps per Epoch | 376 |
| Batch Size | 4 |
| Optimizer | Adam (β₁ = 0.5, β₂ = 0.999) |
| Learning Rate | 2e-4 |
| Precision | Mixed (32) |
| Hardware | NVIDIA T4 (Kaggle GPU Runtime) |
Multi-Loss Function Design
| Loss Type | Description | Weight (λ) | Purpose |
|---|---|---|---|
| L1 Loss | Pixel-wise mean absolute error between generated and real IR | 100 | Ensures global brightness & shape consistency |
| Perceptual Loss (VGG) | Feature loss from conv5_block4 of pretrained VGG-19 |
10 | Captures high-level texture and semantic alignment |
| Adversarial Loss | Binary cross-entropy loss from PatchGAN discriminator | 1 | Encourages realistic IR texture generation |
| Edge Loss | Sobel/gradient difference between real & generated images | 5 | Enhances sharpness and edge clarity |
The total generator loss is computed as:
[
L_{G} = \lambda_{L1},L_{L1} + \lambda_{\text{perc}},L_{\text{perc}} + \lambda_{\text{adv}},L_{\text{adv}} + \lambda_{\text{edge}},L_{\text{edge}}
]
Evaluation Metrics
| Metric | Definition | Result |
|---|---|---|
| L1 Loss | Mean absolute difference between generated and ground truth IR | 0.0611 |
| PSNR (Peak Signal-to-Noise Ratio) | Measures reconstruction quality (higher is better) | 24.3096 dB |
| SSIM (Structural Similarity Index Measure) | Perceptual similarity between generated & target images | 0.8386 |
Model Architectures
Data Exploration
We analysed the LLVIP dataset and found that ~70% of image pairs are captured at < 50 lux lighting and ~30% at 50-200 lux. The average pedestrian height in IR channel was X pixels; outliers with <20 pixels height were excluded.
Visual Results
Training Progress (Sample Evolution)
✨ Final Convergence Samples
| Early Epochs (Blurry, Low Brightness) | Later Epochs (Sharper, High Contrast) |
|---|---|
![]() |
![]() |
Comparison: Input vs Ground Truth vs Generated
| RGB Input- Ground Truth IR - Predicted IR |
|
|
|
|
|
|
|
|
Loss Curves
Generator & Discriminator Loss
Validation Loss per Epoch
All training metrics are logged in:
/
├── logs.log
└── loss_summary.csv
Observations
- The model captures IR brightness and object distinction, but early epochs show slight blur due to L1-dominant stages.
- Contrast and edge sharpness improve after ~70 epochs as adversarial and perceptual losses gain weight.
- Background variations in LLVIP introduce challenges; future fine-tuning on domain-aligned subsets can further improve realism.
- We compared three variants: (i) U-Net regression (L1 only) → SSIM = 0.80;
- (ii) cGAN with L1+adv → SSIM = 0.83; (iii) cGAN with L1+adv+perc+edge (our final) → SSIM = 0.8386
Future Work
- Apply feature matching loss for smoother discriminator gradients
- Add temporal or sequence consistency for video IR translation
- Adaptive loss balancing with epoch-based dynamic weighting
Acknowledgements
LLVIP Dataset for paired RGB–IR samples
TensorFlow and VGG-19 for perceptual feature extraction
Kaggle GPU for high-performance model training




