Project Rules and Workflow (Python FastAPI + Transformers)
These rules are binding for every change. Keep code, docs, and behavior synchronized at all times.
Files referenced below:
1) Documentation rules (must-do on every change)
Always update documentation when code or behavior changes.
Minimum documentation checklist:
- What changed and where (filenames, sections, or callable links like Python.function chat_completions()).
- Why the change was made (problem or requirement).
- How to operate or verify (commands, endpoints, examples).
- Follow-ups or known limitations.
Where to update:
- Operator-facing: README.md
- Developer-facing: CLAUDE.md (rationale, alternatives, caveats)
- Architecture or flows: ARCHITECTURE.md
- Tasks and statuses: TODO.md
Never skip documentation. If a change is reverted, document the revert.
2) Git discipline (mandatory)
- Always use Git. Every change or progress step MUST be committed and pushed.
- Windows CMD example:
- git add .
- git commit -m "type(scope): short description"
- git push
- Windows CMD example:
- No exceptions. If no remote exists, commit locally and configure a remote as soon as possible. Record any temporary push limitations in README.md and CLAUDE.md, but commits are still required locally.
- Commit style:
- Conventional types: chore, docs, feat, fix, refactor, perf, test, build, ci
- Keep commits small and atomic (one concern per commit).
- Reference important files in the commit body, for example: updated Python.function chat_completions(), README.md.
- After updating code or docs, commit immediately. Do not batch unrelated changes.
2.1) Progress log (mandatory)
- Every commit MUST include a corresponding entry in CLAUDE.md under a “Progress Log” section.
- Each entry must include:
- Date/time (Asia/Jakarta)
- Scope and short summary of the change
- The final Git commit hash and commit message
- Files and exact callable anchors touched (use clickable anchors), e.g. Python.function chat_completions(), README.md, ARCHITECTURE.md
- Verification steps and results (curl examples, expected vs actual, notes)
- Required sequence:
- Make code changes
- Update docs: README.md, ARCHITECTURE.md, TODO.md, and add a new progress log entry in CLAUDE.md
- Run Git commands:
- git add .
- git commit -m "type(scope): short description"
- git push
- Append the final commit hash to the CLAUDE.md entry if it was not known at authoring time
- No code change may land without a synchronized progress log entry.
3) Large artifacts policy (.gitignore)
Never commit large/generated artifacts. Keep the repository lean and reproducible.
Must be ignored:
- models/ (downloaded by HF/Transformers cache or tools at runtime)
- .venv/, venv/
- pycache/
- .cache/
- uploads/, data/, tmp/
See .gitignore and extend as needed for new generated outputs. If you add ignores, document the rationale in CLAUDE.md.
4) Model policy (Hugging Face / Transformers)
Target default model:
- Qwen/Qwen3-VL-2B-Thinking (Transformers; multimodal).
Rules:
- Use Hugging Face Transformers (AutoModelForCausalLM + AutoProcessor) with trust_remote_code=True.
- Do not commit model weights or caches. Let from_pretrained() download to local caches.
- Handle authentication for gated models via HF_TOKEN in .env.example.
- The server must remain OpenAI-compatible at /v1/chat/completions and support multimodal inputs (text, images, videos).
- Keep configuration via environment variables (see Python.os.getenv()).
5) API contract
Provide an OpenAI-compatible endpoint:
- POST /v1/chat/completions
Minimum behavior:
- Accept model and messages per OpenAI schema (we honor messages; model is informational since server is pinned via env).
- Non-streaming JSON response.
- Streaming SSE response when body.stream=true:
- Emit OpenAI-style chat.completion.chunk deltas.
- Include SSE id lines "session_id:index" to support resume via Last-Event-ID.
Resume semantics:
- Client provides a session_id (or server generates one).
- Client may reconnect and send Last-Event-ID: session_id:index to replay missed chunks.
- Session data can be persisted (SQLite) if enabled.
Manual cancel (custom extension):
- POST /v1/cancel/{session_id} cancels a streaming generation.
- Note: Not part of legacy OpenAI Chat Completions spec. It mirrors the spirit of the newer OpenAI Responses API cancel endpoint.
All endpoints must validate inputs, handle timeouts/failures, and return structured JSON errors.
6) Streaming, persistence, and cancellation
- Streaming is implemented via SSE in Python.function chat_completions() with token iteration in Python.function infer_stream.
- In-memory ring buffer per session and optional SQLite persistence for replay across restarts:
- In-memory: Python.class _SSESession, Python.class _SessionStore
- SQLite: Python.class _SQLiteStore (enabled with PERSIST_SESSIONS=1)
- Resume:
- Uses SSE id "session_id:index" and Last-Event-ID header (or ?last_event_id=...).
- Auto-cancel on disconnect:
- If all clients disconnect, generation is cancelled after CANCEL_AFTER_DISCONNECT_SECONDS (default 3600 sec). Configurable via env.
- Cooperative stop via StoppingCriteria in Python.function infer_stream.
- Manual cancel:
- Python.function cancel_session to stop a session on demand.
7) Logging and error handling
- Log key lifecycle stages (startup, model load, stream start/stop, resume).
- Redact sensitive fields (e.g., tokens, credentials).
- User errors → 400; model-not-ready → 503; unexpected failures → 500.
- Optionally add structured logging and request IDs in a follow-up.
8) Architecture documentation
Keep ARCHITECTURE.md authoritative for:
- Startup flow and lazy model load
- Multimodal preprocessing (images/videos)
- Streaming, resume, persistence, and cancellation flows
- Error/timeout handling
- Extensibility (persistence strategies, cancellation hooks, scaling patterns)
Update when code paths or data flows change.
9) TODO hygiene
Track all planned work in TODO.md:
- Update statuses immediately when tasks start/complete.
- Add newly discovered tasks as soon as they are identified.
- Keep TODO focused, scoped, and prioritized.
10) Operational requirements and environment
Required:
- Python: >= 3.10
- pip
- PyTorch: install a wheel matching platform/CUDA (see requirements.txt notes)
Recommended:
- GPU with sufficient VRAM for the chosen model
- Windows 11 supported; Linux/macOS should also work
Environment variables (see .env.example):
- PORT=3000
- MODEL_REPO_ID=Qwen/Qwen3-VL-2B-Thinking
- HF_TOKEN=
- MAX_TOKENS=256
- TEMPERATURE=0.7
- MAX_VIDEO_FRAMES=16
- DEVICE_MAP=auto
- TORCH_DTYPE=auto
- PERSIST_SESSIONS=1|0, SESSIONS_DB_PATH, SESSIONS_TTL_SECONDS
- CANCEL_AFTER_DISCONNECT_SECONDS=3600 (0 to disable)
11) File responsibilities overview
- Server: Python.main()
- API routing, model singleton, inference, streaming, resume, cancel
- Docs: README.md, ARCHITECTURE.md
- Dev log: CLAUDE.md
- Tasks: TODO.md
- Config template: .env.example
- Dependencies: requirements.txt
- Ignores: .gitignore
12) Workflow example (single iteration)
- Make a small, isolated change (e.g., enable SQLite persistence).
- Update docs:
- CLAUDE.md: what/why/how
- README.md: operator usage changes
- ARCHITECTURE.md: persistence/resume flow
- TODO.md: status changes
- Commit and push:
- git add .
- git commit -m "feat(stream): add SQLite persistence for SSE resume"
- git push
- Verify locally; record any issues or follow-ups in CLAUDE.md.
13) Compliance checklist (pre-merge / pre-push)
- Code runs locally (uvicorn main:app …).
- Docs updated (README.md, CLAUDE.md, ARCHITECTURE.md, TODO.md).
- No large artifacts added to git.
- Commit message follows conventional style.
- Endpoint contract honored (including streaming/resume semantics and cancel extension).