Spaces:

Islamckennon
/

mirage

Paused

App Files Files Community

mirage / SPEC.md

MacBook pro

Pivot: remove GFPGAN + reenactment stack; CodeFormer-only enhancement, purge legacy files, update docs & downloader

ba8225a about 2 months ago

preview code

raw

history blame contribute delete

5.74 kB

Architectural Reassessment (September 2025)

The initial implementation adopted a motion-driven portrait reenactment stack (LivePortrait ONNX models + custom alignment & smoothing) which is misaligned with the updated product goal: low-latency real-time face swapping with optional enhancement.

Misalignment Summary

Target Need	LivePortrait Path	Impact
Direct identity substitution	Motion reenactment of a canonicalized reference	Unnecessary motion keypoint pipeline
Minimal per-frame latency (<80ms)	~500–600ms generator stages logged	Fails real-time threshold
Simple detector→swap flow	Multi-stage appearance + motion + generator	Complexity & fragile compositing
Artifact cleanup (optional)	No enhancement stage	Lower visual fidelity
Multi-face capability	Single-face canonical reenactment focus	Limits scalability

New Model Stack

Detector / embeddings: insightface FaceAnalysis (buffalo_l pack → SCRFD_10G_KPS + recognition)
Swapper: inswapper_128_fp16.onnx
Enhancement (optional):

CodeFormer (codeformer.pth) for fidelity‑controllable restoration

New Processing Loop

Capture frame
Detect faces (FaceAnalysis)
For each target face (top-N): apply InSwapper with pre-extracted source identity
(Optional) Run CodeFormer enhancer on final composited frame (if weights present)
Emit frame to WebRTC

Environment Variables (Video / Enhancer)

Variable	Values	Description
MIRAGE_MAX_FACES	int (default 1)	Swap up to N largest faces
MIRAGE_CODEFORMER_FIDELITY	0.0–1.0 (default 0.75)	Balance identity (1.0) vs reconstruction sharpness
MIRAGE_INSWAPPER_URL	URL	Override InSwapper model source
MIRAGE_CODEFORMER_URL	URL	Override CodeFormer model source

Deprecated / To Remove

liveportrait_engine.py, avatar_pipeline.py, alignment.py, smoothing.py, realtime_optimizer.py, virtual_camera.py (current unused), enhanced_metrics.py, landmark_reenactor.py, safe_model_integration.py, debug_mediapipe.py

These abstractions are reenactment-specific (appearance feature caching, keypoint smoothing, inverse warp compositing) and will be replaced by a concise swap_pipeline.py.

Goals

End-to-end audio latency < 250 ms (capture -> inference -> playback)
Video pipeline: 512x512 @ ≥20 FPS target under load

Hardware Target

Phase 1–2: CPU basic (development + echo scaffolds)
Later phases: Single NVIDIA A10G (24GB) for combined audio + video low-latency inference

Voice Pipeline

Item	Planned
Framework	TODO
Content Encoder	TODO
F0 Extractor	RMVPE
Chunk Size	192 ms
Sample Rate	16 kHz
Precision	FP16 mixed
Overlap-Add	disabled
Accept Threshold	< 0.65 * chunk_ms ratio runtime/real
Fail Condition	> 0.80 * chunk_ms for 40 consecutive chunks

Video Pipeline

Item	Planned
Model	TODO
Detector	SCRFD
Detect Interval	5 frames
Resolution	512x512
FPS Target	20
Confidence Threshold (stable)	≥0.85
Re-detect Threshold	<0.70 confidence triggers re-detect next frame
Quality Degrade Order	quality → fps → resolution

Transport

WebSockets (bi-directional control + media chunks)
Audio: PCM16, 16 kHz, mono frames (chunked ~192 ms)
Video: JPEG compressed frames (progressive baseline) initially

Sync Strategy

Audio clock is master timeline
Drop late video frames if video timestamp >150 ms behind audio head
Never let audio lead video by more than 150 ms (else request video degrade)
If sustained drift (>120 frames) re-align by soft skip (video) not audio stretch

Metrics

Planned collection (no aggregation service yet):

Audio chunks processed (count, accept ratio)
Per-stage timings: encode / f0 / convert / post
Video frames processed & dropped
Detector invoke interval stats
GPU memory (allocated / reserved / fragmentation indicator)
Queue backlog lengths (audio, video, outbound)

Adaptation Rules

Increase audio chunk size if runtime/real ratio >0.75 for last 40 chunks
Decrease audio chunk size if runtime/real ratio <0.55 for last 120 chunks
Video degrade order:
1. Lower quality (JPEG quality step down)
2. Reduce FPS toward 15
3. Reduce resolution (512 → 384 → 256)
Restore in reverse order after 300 stable frames (<0.60 ratio)

Security

Planned: JWT token required for WebSocket upgrade
Rate limiting (connection + message frequency)
Frame size guard (reject > configured max bytes)
Basic anomaly detection: abandon session if >30% frames invalid over 200 frame window

Licensing

Current: MIT
Future: Add LICENSES.md enumerating third-party components and model licenses

Phases

Phase	Status
1	Completed (Echo scaffold, static client)
2	Completed (Metrics + config + voice stub + GPU info)
3	In Progress (governance + groundwork for adaptation)
4	Pending
5	Pending
6	Pending
7	Pending
8	Pending
9	Pending
10	Pending

Open Questions

RVC fork URL to adopt / baseline? (Need candidate repo)
Reenact / face animation model repo selection?
Alignment expectation: do we need phoneme-level alignment or chunk-level only?

Face Detector Standardization

Adopt SCRFD as unified face detector. Run detection every 5 frames; reuse boxes otherwise. Force immediate re-detect if last confidence <0.70. Consider face track stable when confidence ≥0.85 for 3 consecutive detections. This reduces detector load while maintaining robustness against drift and occlusion.