Spaces:

KillerKing93
/

Transformers-InferenceServer-OpenAPI

Running

App Files Files Community

Transformers-InferenceServer-OpenAPI / RULES.md

KillerKing93

Sync from GitHub 8f6d598

7cd14d8 verified 13 days ago

preview code

raw

history blame contribute delete

8.88 kB

	# Project Rules and Workflow (Python FastAPI + Transformers)

	These rules are binding for every change. Keep code, docs, and behavior synchronized at all times.

	Files referenced below:
	- [README.md](README.md)
	- [ARCHITECTURE.md](ARCHITECTURE.md)
	- [TODO.md](TODO.md)
	- [CLAUDE.md](CLAUDE.md)
	- [.env.example](.env.example)
	- [.gitignore](.gitignore)
	- [requirements.txt](requirements.txt)
	- [Python.main()](main.py:1)

	## 1) Documentation rules (must-do on every change)

	Always update documentation when code or behavior changes.

	Minimum documentation checklist:
	- What changed and where (filenames, sections, or callable links like [Python.function chat_completions()](main.py:591)).
	- Why the change was made (problem or requirement).
	- How to operate or verify (commands, endpoints, examples).
	- Follow-ups or known limitations.

	Where to update:
	- Operator-facing: [README.md](README.md)
	- Developer-facing: [CLAUDE.md](CLAUDE.md) (rationale, alternatives, caveats)
	- Architecture or flows: [ARCHITECTURE.md](ARCHITECTURE.md)
	- Tasks and statuses: [TODO.md](TODO.md)

	Never skip documentation. If a change is reverted, document the revert.

	## 2) Git discipline (mandatory)

	- Always use Git. Every change or progress step MUST be committed and pushed.
	- Windows CMD example:
	- git add .
	- git commit -m "type(scope): short description"
	- git push
	- No exceptions. If no remote exists, commit locally and configure a remote as soon as possible. Record any temporary push limitations in [README.md](README.md) and [CLAUDE.md](CLAUDE.md), but commits are still required locally.
	- Commit style:
	- Conventional types: chore, docs, feat, fix, refactor, perf, test, build, ci
	- Keep commits small and atomic (one concern per commit).
	- Reference important files in the commit body, for example: updated [Python.function chat_completions()](main.py:591), [README.md](README.md).
	- After updating code or docs, commit immediately. Do not batch unrelated changes.

	## 2.1) Progress log (mandatory)

	- Every commit MUST include a corresponding entry in [CLAUDE.md](CLAUDE.md) under a “Progress Log” section.
	- Each entry must include:
	- Date/time (Asia/Jakarta)
	- Scope and short summary of the change
	- The final Git commit hash and commit message
	- Files and exact callable anchors touched (use clickable anchors), e.g. [Python.function chat_completions()](main.py:591), [README.md](README.md:1), [ARCHITECTURE.md](ARCHITECTURE.md:1)
	- Verification steps and results (curl examples, expected vs actual, notes)
	- Required sequence:
	1) Make code changes
	2) Update docs: [README.md](README.md), [ARCHITECTURE.md](ARCHITECTURE.md), [TODO.md](TODO.md), and add a new progress log entry in [CLAUDE.md](CLAUDE.md)
	3) Run Git commands:
	- git add .
	- git commit -m "type(scope): short description"
	- git push
	4) Append the final commit hash to the [CLAUDE.md](CLAUDE.md) entry if it was not known at authoring time
	- No code change may land without a synchronized progress log entry.

	## 3) Large artifacts policy (.gitignore)

	Never commit large/generated artifacts. Keep the repository lean and reproducible.

	Must be ignored:
	- models/ (downloaded by HF/Transformers cache or tools at runtime)
	- .venv/, venv/
	- __pycache__/
	- .cache/
	- uploads/, data/, tmp/

	See [.gitignore](.gitignore) and extend as needed for new generated outputs. If you add ignores, document the rationale in [CLAUDE.md](CLAUDE.md).

	## 4) Model policy (Hugging Face / Transformers)

	Target default model:
	- Qwen/Qwen3-VL-2B-Thinking (Transformers; multimodal).

	Rules:
	- Use Hugging Face Transformers (AutoModelForCausalLM + AutoProcessor) with trust_remote_code=True.
	- Do not commit model weights or caches. Let from_pretrained() download to local caches.
	- Handle authentication for gated models via HF_TOKEN in [.env.example](.env.example).
	- The server must remain OpenAI-compatible at /v1/chat/completions and support multimodal inputs (text, images, videos).
	- Keep configuration via environment variables (see [Python.os.getenv()](main.py:67)).

	## 5) API contract

	Provide an OpenAI-compatible endpoint:
	- POST /v1/chat/completions

	Minimum behavior:
	- Accept model and messages per OpenAI schema (we honor messages; model is informational since server is pinned via env).
	- Non-streaming JSON response.
	- Streaming SSE response when body.stream=true:
	- Emit OpenAI-style chat.completion.chunk deltas.
	- Include SSE id lines "session_id:index" to support resume via Last-Event-ID.

	Resume semantics:
	- Client provides a session_id (or server generates one).
	- Client may reconnect and send Last-Event-ID: session_id:index to replay missed chunks.
	- Session data can be persisted (SQLite) if enabled.

	Manual cancel (custom extension):
	- POST /v1/cancel/{session_id} cancels a streaming generation.
	- Note: Not part of legacy OpenAI Chat Completions spec. It mirrors the spirit of the newer OpenAI Responses API cancel endpoint.

	All endpoints must validate inputs, handle timeouts/failures, and return structured JSON errors.

	## 6) Streaming, persistence, and cancellation

	- Streaming is implemented via SSE in [Python.function chat_completions()](main.py:591) with token iteration in [Python.function infer_stream](main.py:375).
	- In-memory ring buffer per session and optional SQLite persistence for replay across restarts:
	- In-memory: [Python.class _SSESession](main.py:435), [Python.class _SessionStore](main.py:449)
	- SQLite: [Python.class _SQLiteStore](main.py:482) (enabled with PERSIST_SESSIONS=1)
	- Resume:
	- Uses SSE id "session_id:index" and Last-Event-ID header (or ?last_event_id=...).
	- Auto-cancel on disconnect:
	- If all clients disconnect, generation is cancelled after CANCEL_AFTER_DISCONNECT_SECONDS (default 3600 sec). Configurable via env.
	- Cooperative stop via StoppingCriteria in [Python.function infer_stream](main.py:375).
	- Manual cancel:
	- [Python.function cancel_session](main.py:792) to stop a session on demand.

	## 7) Logging and error handling

	- Log key lifecycle stages (startup, model load, stream start/stop, resume).
	- Redact sensitive fields (e.g., tokens, credentials).
	- User errors → 400; model-not-ready → 503; unexpected failures → 500.
	- Optionally add structured logging and request IDs in a follow-up.

	## 8) Architecture documentation

	Keep [ARCHITECTURE.md](ARCHITECTURE.md) authoritative for:
	- Startup flow and lazy model load
	- Multimodal preprocessing (images/videos)
	- Streaming, resume, persistence, and cancellation flows
	- Error/timeout handling
	- Extensibility (persistence strategies, cancellation hooks, scaling patterns)

	Update when code paths or data flows change.

	## 9) TODO hygiene

	Track all planned work in [TODO.md](TODO.md):
	- Update statuses immediately when tasks start/complete.
	- Add newly discovered tasks as soon as they are identified.
	- Keep TODO focused, scoped, and prioritized.

	## 10) Operational requirements and environment

	Required:
	- Python: >= 3.10
	- pip
	- PyTorch: install a wheel matching platform/CUDA (see [requirements.txt](requirements.txt) notes)

	Recommended:
	- GPU with sufficient VRAM for the chosen model
	- Windows 11 supported; Linux/macOS should also work

	Environment variables (see [.env.example](.env.example)):
	- PORT=3000
	- MODEL_REPO_ID=Qwen/Qwen3-VL-2B-Thinking
	- HF_TOKEN=
	- MAX_TOKENS=256
	- TEMPERATURE=0.7
	- MAX_VIDEO_FRAMES=16
	- DEVICE_MAP=auto
	- TORCH_DTYPE=auto
	- PERSIST_SESSIONS=1\|0, SESSIONS_DB_PATH, SESSIONS_TTL_SECONDS
	- CANCEL_AFTER_DISCONNECT_SECONDS=3600 (0 to disable)

	## 11) File responsibilities overview

	- Server: [Python.main()](main.py:1)
	- API routing, model singleton, inference, streaming, resume, cancel
	- Docs: [README.md](README.md), [ARCHITECTURE.md](ARCHITECTURE.md)
	- Dev log: [CLAUDE.md](CLAUDE.md)
	- Tasks: [TODO.md](TODO.md)
	- Config template: [.env.example](.env.example)
	- Dependencies: [requirements.txt](requirements.txt)
	- Ignores: [.gitignore](.gitignore)

	## 12) Workflow example (single iteration)

	1) Make a small, isolated change (e.g., enable SQLite persistence).
	2) Update docs:
	- [CLAUDE.md](CLAUDE.md): what/why/how
	- [README.md](README.md): operator usage changes
	- [ARCHITECTURE.md](ARCHITECTURE.md): persistence/resume flow
	- [TODO.md](TODO.md): status changes
	3) Commit and push:
	- git add .
	- git commit -m "feat(stream): add SQLite persistence for SSE resume"
	- git push
	4) Verify locally; record any issues or follow-ups in [CLAUDE.md](CLAUDE.md).

	## 13) Compliance checklist (pre-merge / pre-push)

	- Code runs locally (uvicorn main:app …).
	- Docs updated ([README.md](README.md), [CLAUDE.md](CLAUDE.md), [ARCHITECTURE.md](ARCHITECTURE.md), [TODO.md](TODO.md)).
	- No large artifacts added to git.
	- Commit message follows conventional style.
	- Endpoint contract honored (including streaming/resume semantics and cancel extension).