File size: 8,883 Bytes
7cd14d8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 |
# Project Rules and Workflow (Python FastAPI + Transformers)
These rules are binding for every change. Keep code, docs, and behavior synchronized at all times.
Files referenced below:
- [README.md](README.md)
- [ARCHITECTURE.md](ARCHITECTURE.md)
- [TODO.md](TODO.md)
- [CLAUDE.md](CLAUDE.md)
- [.env.example](.env.example)
- [.gitignore](.gitignore)
- [requirements.txt](requirements.txt)
- [Python.main()](main.py:1)
## 1) Documentation rules (must-do on every change)
Always update documentation when code or behavior changes.
Minimum documentation checklist:
- What changed and where (filenames, sections, or callable links like [Python.function chat_completions()](main.py:591)).
- Why the change was made (problem or requirement).
- How to operate or verify (commands, endpoints, examples).
- Follow-ups or known limitations.
Where to update:
- Operator-facing: [README.md](README.md)
- Developer-facing: [CLAUDE.md](CLAUDE.md) (rationale, alternatives, caveats)
- Architecture or flows: [ARCHITECTURE.md](ARCHITECTURE.md)
- Tasks and statuses: [TODO.md](TODO.md)
Never skip documentation. If a change is reverted, document the revert.
## 2) Git discipline (mandatory)
- Always use Git. Every change or progress step MUST be committed and pushed.
- Windows CMD example:
- git add .
- git commit -m "type(scope): short description"
- git push
- No exceptions. If no remote exists, commit locally and configure a remote as soon as possible. Record any temporary push limitations in [README.md](README.md) and [CLAUDE.md](CLAUDE.md), but commits are still required locally.
- Commit style:
- Conventional types: chore, docs, feat, fix, refactor, perf, test, build, ci
- Keep commits small and atomic (one concern per commit).
- Reference important files in the commit body, for example: updated [Python.function chat_completions()](main.py:591), [README.md](README.md).
- After updating code or docs, commit immediately. Do not batch unrelated changes.
## 2.1) Progress log (mandatory)
- Every commit MUST include a corresponding entry in [CLAUDE.md](CLAUDE.md) under a “Progress Log” section.
- Each entry must include:
- Date/time (Asia/Jakarta)
- Scope and short summary of the change
- The final Git commit hash and commit message
- Files and exact callable anchors touched (use clickable anchors), e.g. [Python.function chat_completions()](main.py:591), [README.md](README.md:1), [ARCHITECTURE.md](ARCHITECTURE.md:1)
- Verification steps and results (curl examples, expected vs actual, notes)
- Required sequence:
1) Make code changes
2) Update docs: [README.md](README.md), [ARCHITECTURE.md](ARCHITECTURE.md), [TODO.md](TODO.md), and add a new progress log entry in [CLAUDE.md](CLAUDE.md)
3) Run Git commands:
- git add .
- git commit -m "type(scope): short description"
- git push
4) Append the final commit hash to the [CLAUDE.md](CLAUDE.md) entry if it was not known at authoring time
- No code change may land without a synchronized progress log entry.
## 3) Large artifacts policy (.gitignore)
Never commit large/generated artifacts. Keep the repository lean and reproducible.
Must be ignored:
- models/ (downloaded by HF/Transformers cache or tools at runtime)
- .venv/, venv/
- __pycache__/
- .cache/
- uploads/, data/, tmp/
See [.gitignore](.gitignore) and extend as needed for new generated outputs. If you add ignores, document the rationale in [CLAUDE.md](CLAUDE.md).
## 4) Model policy (Hugging Face / Transformers)
Target default model:
- Qwen/Qwen3-VL-2B-Thinking (Transformers; multimodal).
Rules:
- Use Hugging Face Transformers (AutoModelForCausalLM + AutoProcessor) with trust_remote_code=True.
- Do not commit model weights or caches. Let from_pretrained() download to local caches.
- Handle authentication for gated models via HF_TOKEN in [.env.example](.env.example).
- The server must remain OpenAI-compatible at /v1/chat/completions and support multimodal inputs (text, images, videos).
- Keep configuration via environment variables (see [Python.os.getenv()](main.py:67)).
## 5) API contract
Provide an OpenAI-compatible endpoint:
- POST /v1/chat/completions
Minimum behavior:
- Accept model and messages per OpenAI schema (we honor messages; model is informational since server is pinned via env).
- Non-streaming JSON response.
- Streaming SSE response when body.stream=true:
- Emit OpenAI-style chat.completion.chunk deltas.
- Include SSE id lines "session_id:index" to support resume via Last-Event-ID.
Resume semantics:
- Client provides a session_id (or server generates one).
- Client may reconnect and send Last-Event-ID: session_id:index to replay missed chunks.
- Session data can be persisted (SQLite) if enabled.
Manual cancel (custom extension):
- POST /v1/cancel/{session_id} cancels a streaming generation.
- Note: Not part of legacy OpenAI Chat Completions spec. It mirrors the spirit of the newer OpenAI Responses API cancel endpoint.
All endpoints must validate inputs, handle timeouts/failures, and return structured JSON errors.
## 6) Streaming, persistence, and cancellation
- Streaming is implemented via SSE in [Python.function chat_completions()](main.py:591) with token iteration in [Python.function infer_stream](main.py:375).
- In-memory ring buffer per session and optional SQLite persistence for replay across restarts:
- In-memory: [Python.class _SSESession](main.py:435), [Python.class _SessionStore](main.py:449)
- SQLite: [Python.class _SQLiteStore](main.py:482) (enabled with PERSIST_SESSIONS=1)
- Resume:
- Uses SSE id "session_id:index" and Last-Event-ID header (or ?last_event_id=...).
- Auto-cancel on disconnect:
- If all clients disconnect, generation is cancelled after CANCEL_AFTER_DISCONNECT_SECONDS (default 3600 sec). Configurable via env.
- Cooperative stop via StoppingCriteria in [Python.function infer_stream](main.py:375).
- Manual cancel:
- [Python.function cancel_session](main.py:792) to stop a session on demand.
## 7) Logging and error handling
- Log key lifecycle stages (startup, model load, stream start/stop, resume).
- Redact sensitive fields (e.g., tokens, credentials).
- User errors → 400; model-not-ready → 503; unexpected failures → 500.
- Optionally add structured logging and request IDs in a follow-up.
## 8) Architecture documentation
Keep [ARCHITECTURE.md](ARCHITECTURE.md) authoritative for:
- Startup flow and lazy model load
- Multimodal preprocessing (images/videos)
- Streaming, resume, persistence, and cancellation flows
- Error/timeout handling
- Extensibility (persistence strategies, cancellation hooks, scaling patterns)
Update when code paths or data flows change.
## 9) TODO hygiene
Track all planned work in [TODO.md](TODO.md):
- Update statuses immediately when tasks start/complete.
- Add newly discovered tasks as soon as they are identified.
- Keep TODO focused, scoped, and prioritized.
## 10) Operational requirements and environment
Required:
- Python: >= 3.10
- pip
- PyTorch: install a wheel matching platform/CUDA (see [requirements.txt](requirements.txt) notes)
Recommended:
- GPU with sufficient VRAM for the chosen model
- Windows 11 supported; Linux/macOS should also work
Environment variables (see [.env.example](.env.example)):
- PORT=3000
- MODEL_REPO_ID=Qwen/Qwen3-VL-2B-Thinking
- HF_TOKEN=
- MAX_TOKENS=256
- TEMPERATURE=0.7
- MAX_VIDEO_FRAMES=16
- DEVICE_MAP=auto
- TORCH_DTYPE=auto
- PERSIST_SESSIONS=1|0, SESSIONS_DB_PATH, SESSIONS_TTL_SECONDS
- CANCEL_AFTER_DISCONNECT_SECONDS=3600 (0 to disable)
## 11) File responsibilities overview
- Server: [Python.main()](main.py:1)
- API routing, model singleton, inference, streaming, resume, cancel
- Docs: [README.md](README.md), [ARCHITECTURE.md](ARCHITECTURE.md)
- Dev log: [CLAUDE.md](CLAUDE.md)
- Tasks: [TODO.md](TODO.md)
- Config template: [.env.example](.env.example)
- Dependencies: [requirements.txt](requirements.txt)
- Ignores: [.gitignore](.gitignore)
## 12) Workflow example (single iteration)
1) Make a small, isolated change (e.g., enable SQLite persistence).
2) Update docs:
- [CLAUDE.md](CLAUDE.md): what/why/how
- [README.md](README.md): operator usage changes
- [ARCHITECTURE.md](ARCHITECTURE.md): persistence/resume flow
- [TODO.md](TODO.md): status changes
3) Commit and push:
- git add .
- git commit -m "feat(stream): add SQLite persistence for SSE resume"
- git push
4) Verify locally; record any issues or follow-ups in [CLAUDE.md](CLAUDE.md).
## 13) Compliance checklist (pre-merge / pre-push)
- Code runs locally (uvicorn main:app …).
- Docs updated ([README.md](README.md), [CLAUDE.md](CLAUDE.md), [ARCHITECTURE.md](ARCHITECTURE.md), [TODO.md](TODO.md)).
- No large artifacts added to git.
- Commit message follows conventional style.
- Endpoint contract honored (including streaming/resume semantics and cancel extension).
|