File size: 8,883 Bytes
7cd14d8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
# Project Rules and Workflow (Python FastAPI + Transformers)

These rules are binding for every change. Keep code, docs, and behavior synchronized at all times.

Files referenced below:
- [README.md](README.md)
- [ARCHITECTURE.md](ARCHITECTURE.md)
- [TODO.md](TODO.md)
- [CLAUDE.md](CLAUDE.md)
- [.env.example](.env.example)
- [.gitignore](.gitignore)
- [requirements.txt](requirements.txt)
- [Python.main()](main.py:1)

## 1) Documentation rules (must-do on every change)

Always update documentation when code or behavior changes.

Minimum documentation checklist:
- What changed and where (filenames, sections, or callable links like [Python.function chat_completions()](main.py:591)).
- Why the change was made (problem or requirement).
- How to operate or verify (commands, endpoints, examples).
- Follow-ups or known limitations.

Where to update:
- Operator-facing: [README.md](README.md)
- Developer-facing: [CLAUDE.md](CLAUDE.md) (rationale, alternatives, caveats)
- Architecture or flows: [ARCHITECTURE.md](ARCHITECTURE.md)
- Tasks and statuses: [TODO.md](TODO.md)

Never skip documentation. If a change is reverted, document the revert.

## 2) Git discipline (mandatory)

- Always use Git. Every change or progress step MUST be committed and pushed.
  - Windows CMD example:
    - git add .
    - git commit -m "type(scope): short description"
    - git push
- No exceptions. If no remote exists, commit locally and configure a remote as soon as possible. Record any temporary push limitations in [README.md](README.md) and [CLAUDE.md](CLAUDE.md), but commits are still required locally.
- Commit style:
  - Conventional types: chore, docs, feat, fix, refactor, perf, test, build, ci
  - Keep commits small and atomic (one concern per commit).
  - Reference important files in the commit body, for example: updated [Python.function chat_completions()](main.py:591), [README.md](README.md).
- After updating code or docs, commit immediately. Do not batch unrelated changes.

## 2.1) Progress log (mandatory)

- Every commit MUST include a corresponding entry in [CLAUDE.md](CLAUDE.md) under a “Progress Log” section.
- Each entry must include:
  - Date/time (Asia/Jakarta)
  - Scope and short summary of the change
  - The final Git commit hash and commit message
  - Files and exact callable anchors touched (use clickable anchors), e.g. [Python.function chat_completions()](main.py:591), [README.md](README.md:1), [ARCHITECTURE.md](ARCHITECTURE.md:1)
  - Verification steps and results (curl examples, expected vs actual, notes)
- Required sequence:
  1) Make code changes
  2) Update docs: [README.md](README.md), [ARCHITECTURE.md](ARCHITECTURE.md), [TODO.md](TODO.md), and add a new progress log entry in [CLAUDE.md](CLAUDE.md)
  3) Run Git commands:
     - git add .
     - git commit -m "type(scope): short description"
     - git push
  4) Append the final commit hash to the [CLAUDE.md](CLAUDE.md) entry if it was not known at authoring time
- No code change may land without a synchronized progress log entry.

## 3) Large artifacts policy (.gitignore)

Never commit large/generated artifacts. Keep the repository lean and reproducible.

Must be ignored:
- models/ (downloaded by HF/Transformers cache or tools at runtime)
- .venv/, venv/
- __pycache__/
- .cache/
- uploads/, data/, tmp/

See [.gitignore](.gitignore) and extend as needed for new generated outputs. If you add ignores, document the rationale in [CLAUDE.md](CLAUDE.md).

## 4) Model policy (Hugging Face / Transformers)

Target default model:
- Qwen/Qwen3-VL-2B-Thinking (Transformers; multimodal).

Rules:
- Use Hugging Face Transformers (AutoModelForCausalLM + AutoProcessor) with trust_remote_code=True.
- Do not commit model weights or caches. Let from_pretrained() download to local caches.
- Handle authentication for gated models via HF_TOKEN in [.env.example](.env.example).
- The server must remain OpenAI-compatible at /v1/chat/completions and support multimodal inputs (text, images, videos).
- Keep configuration via environment variables (see [Python.os.getenv()](main.py:67)).

## 5) API contract

Provide an OpenAI-compatible endpoint:
- POST /v1/chat/completions

Minimum behavior:
- Accept model and messages per OpenAI schema (we honor messages; model is informational since server is pinned via env).
- Non-streaming JSON response.
- Streaming SSE response when body.stream=true:
  - Emit OpenAI-style chat.completion.chunk deltas.
  - Include SSE id lines "session_id:index" to support resume via Last-Event-ID.

Resume semantics:
- Client provides a session_id (or server generates one).
- Client may reconnect and send Last-Event-ID: session_id:index to replay missed chunks.
- Session data can be persisted (SQLite) if enabled.

Manual cancel (custom extension):
- POST /v1/cancel/{session_id} cancels a streaming generation.
- Note: Not part of legacy OpenAI Chat Completions spec. It mirrors the spirit of the newer OpenAI Responses API cancel endpoint.

All endpoints must validate inputs, handle timeouts/failures, and return structured JSON errors.

## 6) Streaming, persistence, and cancellation

- Streaming is implemented via SSE in [Python.function chat_completions()](main.py:591) with token iteration in [Python.function infer_stream](main.py:375).
- In-memory ring buffer per session and optional SQLite persistence for replay across restarts:
  - In-memory: [Python.class _SSESession](main.py:435), [Python.class _SessionStore](main.py:449)
  - SQLite: [Python.class _SQLiteStore](main.py:482) (enabled with PERSIST_SESSIONS=1)
- Resume:
  - Uses SSE id "session_id:index" and Last-Event-ID header (or ?last_event_id=...).
- Auto-cancel on disconnect:
  - If all clients disconnect, generation is cancelled after CANCEL_AFTER_DISCONNECT_SECONDS (default 3600 sec). Configurable via env.
  - Cooperative stop via StoppingCriteria in [Python.function infer_stream](main.py:375).
- Manual cancel:
  - [Python.function cancel_session](main.py:792) to stop a session on demand.

## 7) Logging and error handling

- Log key lifecycle stages (startup, model load, stream start/stop, resume).
- Redact sensitive fields (e.g., tokens, credentials).
- User errors → 400; model-not-ready → 503; unexpected failures → 500.
- Optionally add structured logging and request IDs in a follow-up.

## 8) Architecture documentation

Keep [ARCHITECTURE.md](ARCHITECTURE.md) authoritative for:
- Startup flow and lazy model load
- Multimodal preprocessing (images/videos)
- Streaming, resume, persistence, and cancellation flows
- Error/timeout handling
- Extensibility (persistence strategies, cancellation hooks, scaling patterns)

Update when code paths or data flows change.

## 9) TODO hygiene

Track all planned work in [TODO.md](TODO.md):
- Update statuses immediately when tasks start/complete.
- Add newly discovered tasks as soon as they are identified.
- Keep TODO focused, scoped, and prioritized.

## 10) Operational requirements and environment

Required:
- Python: >= 3.10
- pip
- PyTorch: install a wheel matching platform/CUDA (see [requirements.txt](requirements.txt) notes)

Recommended:
- GPU with sufficient VRAM for the chosen model
- Windows 11 supported; Linux/macOS should also work

Environment variables (see [.env.example](.env.example)):
- PORT=3000
- MODEL_REPO_ID=Qwen/Qwen3-VL-2B-Thinking
- HF_TOKEN=
- MAX_TOKENS=256
- TEMPERATURE=0.7
- MAX_VIDEO_FRAMES=16
- DEVICE_MAP=auto
- TORCH_DTYPE=auto
- PERSIST_SESSIONS=1|0, SESSIONS_DB_PATH, SESSIONS_TTL_SECONDS
- CANCEL_AFTER_DISCONNECT_SECONDS=3600 (0 to disable)

## 11) File responsibilities overview

- Server: [Python.main()](main.py:1)
  - API routing, model singleton, inference, streaming, resume, cancel
- Docs: [README.md](README.md), [ARCHITECTURE.md](ARCHITECTURE.md)
- Dev log: [CLAUDE.md](CLAUDE.md)
- Tasks: [TODO.md](TODO.md)
- Config template: [.env.example](.env.example)
- Dependencies: [requirements.txt](requirements.txt)
- Ignores: [.gitignore](.gitignore)

## 12) Workflow example (single iteration)

1) Make a small, isolated change (e.g., enable SQLite persistence).
2) Update docs:
   - [CLAUDE.md](CLAUDE.md): what/why/how
   - [README.md](README.md): operator usage changes
   - [ARCHITECTURE.md](ARCHITECTURE.md): persistence/resume flow
   - [TODO.md](TODO.md): status changes
3) Commit and push:
   - git add .
   - git commit -m "feat(stream): add SQLite persistence for SSE resume"
   - git push
4) Verify locally; record any issues or follow-ups in [CLAUDE.md](CLAUDE.md).

## 13) Compliance checklist (pre-merge / pre-push)

- Code runs locally (uvicorn main:app …).
- Docs updated ([README.md](README.md), [CLAUDE.md](CLAUDE.md), [ARCHITECTURE.md](ARCHITECTURE.md), [TODO.md](TODO.md)).
- No large artifacts added to git.
- Commit message follows conventional style.
- Endpoint contract honored (including streaming/resume semantics and cancel extension).