Spaces:
				
			
			
	
			
			
					
		Running
		
	
	
	
			
			
	
	
	
	
		
		
					
		Running
		
	| # VLM Playground (PreviewSpace) β Product Requirements Document | |
| ## Summary | |
| An internal Gradio Blocks app for rapid, structured experimentation with a Vision-Language Model (initially `dots.ocr`). It mirrors the reference playground but is deliberately minimal: stateless by default, no run history, focused on feeling model performance. Supports PDF/image upload, page preview and navigation, page-range parsing, and result views (Markdown Render, Markdown Raw Text, Current Page JSON) with preserved scroll position. Designed to run locally or on Hugging Face Spaces. | |
| ## Goals | |
| - **Fast iteration**: Upload, prompt, parse, iterate in seconds with minimal ceremony. | |
| - **Model-light**: Start with one model (`dots.ocr`), optional model selector later. No provider switching UI. | |
| - **Structured output**: First-class JSON output and markdown preview. | |
| - **Stateless by default**: No run history or persistence beyond the current browser session unless explicitly downloading. | |
| - **Document-centric UX**: Multi-page PDF preview, page navigation, per-page execution, and page-range parsing. | |
| ## Non-Goals | |
| - Not a full labeling platform or production extraction pipeline. | |
| - Not a dataset hosting service or long-term data store for PHI. | |
| - Not a fine-tuning/training product; inference playground only. | |
| - No bounding-box drawing or manual annotation tools in v1. | |
| ## Primary Users / Personas | |
| - **Applied Researcher / Data Scientist**: Tries different prompts/models, collects structured outputs. | |
| - **ML Engineer**: Prototypes pipelines, compares providers, validates latency/cost. | |
| - **Domain Expert (e.g., Clinical Analyst)**: Uses curated templates to extract specific fields. | |
| ## Key User Stories | |
| - As a user, I can upload a PDF or image, select a template prompt, and click Parse to see Markdown and JSON results. | |
| - As a user, I can preview pages, specify a page range to parse, and run per-page extraction. | |
| - As a user, I can jump to a specific page index in a PDF and use Prev/Next controls. | |
| - As a user, I can switch between result tabs (Markdown Render, Markdown Raw Text, Current Page JSON) without losing scroll position. | |
| - As a user, I can download the results for my current session as a ZIP or JSON/Markdown. | |
| - As a user, I can tweak the prompt and basic model settings and quickly re-run. | |
| ## UX Requirements (inspired by dots.ocr playground) | |
| - **Left Panel β Upload & Select** | |
| - Drag-and-drop or file picker for PNG/JPG/PDF; show file name and size. | |
| - Optional Examples dropdown (curated sample docs and pre-baked prompts). | |
| - File ingestion for PDFs extracts page thumbnails and page count. | |
| - **Left Panel β Prompt & Actions** | |
| - Prompt Template select; Current Prompt editor (multiline with variable chips). | |
| - Actions: Parse (primary), Clear (secondary). | |
| - Show prompt variables, e.g., `bbox`, `category`, `page_number`. | |
| - **Left Panel β Advanced Configuration** | |
| - Preprocessing toggle (fitz-like DPI upsample for low-res images). | |
| - Minimal server/model config: Host/Port for local inference or a dropdown for on-host models. | |
| - Page selection: single page, page range, or all. | |
| - **Center β File Preview** | |
| - Large page preview with pan/zoom; page navigator (Prev/Next and page picker). | |
| - Page jump field to go directly to page N. | |
| - **Right Panel β Result Display** | |
| - Tabs: Markdown Render Preview, Markdown Raw Text, Current Page JSON. | |
| - Preserve scroll position when switching tabs. | |
| - Copy-to-clipboard and a Download Results button. | |
| ## Functional Requirements | |
| - **File Handling** | |
| - Accept PDF (up to 300 pages) and images (PNG/JPG/WebP). Max upload 50 MB (configurable). | |
| - Extract page images for preview; store temp files locally (ephemeral) with TTL. | |
| - Provide page-level selection and batching. | |
| - **Prompting** | |
| - Template library with variables and descriptions. Variables can be sourced from UI state (page, bbox list) or user input. | |
| - System prompt + user prompt fields; allow few-shot examples. | |
| - Presets for common tasks (layout extraction, table extraction, key-value extraction, captioning). | |
| - **Model Support** | |
| - Start with `dots.ocr` via the official parser or REST endpoint. | |
| - Optional: dropdown to switch among `dots.ocr` model variants if present on the host. No cross-provider switching UI. | |
| - **Execution** | |
| - Run per-page or whole-document, controlled by UI. Concurrency limit (default 3). | |
| - Timeouts and retries surfaced to UI; cancellation supported. | |
| - Caching: request hash on (file checksum, page, prompt, params, model) to avoid recomputation. | |
| - **Outputs** | |
| - Markdown Render, Raw Markdown, and Current Page JSON. | |
| - Export: Download button to export combined Markdown, per-page JSONL, and all artifacts as a ZIP. | |
| - **Examples Gallery** | |
| - Preloaded example docs and templates to demonstrate patterns (OCR table, K/V extraction, figure captioning, layout detection). | |
| - **Observability** | |
| - Show basic runtime info (latency, model id) inline; no history or centralized logs in v1. | |
| ## Data Model (high-level) | |
| - In-memory, per-session structures only; no database. | |
| - **Document**: id, name, type, checksum, page count, temp storage path, created_at. | |
| - **Page**: document_id, page_index, image_path, width, height, preview_thumbnail. | |
| - **Template**: id, name, description, model_defaults, prompt_text, output_schema (optional JSON Schema), variables. | |
| ## JSON Output Guidance | |
| - For structured tasks, templates may specify an output schema. The UI validates model JSON and highlights issues. | |
| - All results stored as JSON lines per page with summary aggregation. | |
| ## Security & Compliance | |
| - Internal-only; access requires SSO or VPN. | |
| - Sensitive documents (e.g., PHI) processed only against approved providers/endpoints. Warn when a provider is external. | |
| - Ephemeral storage with TTL auto-clean; configurable retention. Redact logs where needed. | |
| ## Performance Targets | |
| - Cold start to first parse: < 10s on typical PDFs (<= 20 pages) with network providers. | |
| - Per-page preview render: < 500ms after page image generation. | |
| - Concurrency: default 3 parallel page requests; configurable up to 10. | |
| - Throughput: 1,000 pages/day per user on average use without manual scaling. | |
| ## Error States & Edge Cases | |
| - Unsupported file types or oversize files; clear messaging and guardrails. | |
| - Pages with extreme aspect ratios or very small text; suggest preprocessing. | |
| - Provider rate limits; exponential backoff and UI feedback. | |
| - Invalid model JSON; surface diffs and attempt best-effort JSON repair (opt-in). | |
| ## Architecture (proposed) | |
| - **App**: Single Gradio Blocks app (Python). No separate backend required. | |
| - **Execution**: Use `uv run` locally. Designed to run as-is on Hugging Face Spaces. | |
| - **Model**: `dots.ocr` via local parser or REST endpoint; configurable host/port. | |
| - **Storage**: Ephemeral `/tmp/previewspace/*`; cleared at session end or TTL. | |
| - **Caching**: Optional on-disk cache keyed by content hash + prompt + params + model. | |
| ## API Surface (v1) | |
| - Pure Gradio callbacks; no public REST API. Optional: expose simple `/healthz`. | |
| ## Templates (initial set) | |
| - **Layout Extraction**: Return list of elements with `bbox`, `category`, and `text` within bbox. | |
| - **Table Extraction**: Return rows/columns as structured JSON; include confidence and cell bboxes. | |
| - **Key-Value Extraction**: Extract specified fields with locations and normalized values. | |
| - **Captioning/Description**: Summarize or caption selected regions or whole pages. | |
| ## Privacy-by-Design Defaults | |
| - Local processing preferred where possible; clear visual indicator when sending to external APIs. | |
| - Redaction utilities for logs; toggle to disable request logging entirely. | |
| ## Success Metrics | |
| - Time-to-first-result after upload. | |
| - Number of saved runs and templates re-used. | |
| - Reduction in manual extraction time for a representative task. | |
| - User satisfaction (quick pulse after saved runs). | |
| ## Release Plan | |
| - **M1 (v0.1) β Core Playground** | |
| - Upload PDF/image; page preview and navigation. | |
| - Parse with one provider; show Markdown and JSON; save runs; export JSON. | |
| - Basic provider config (host/port/api key) and preprocessing toggle. | |
| - Acceptance: A user can replicate a layout extraction example end-to-end in < 2 minutes. | |
| - **M2 (v0.2) β Templates, Regions, and Examples** | |
| - Template library + editor; draw/save bboxes; per-page runs; examples gallery. | |
| - Multiple providers; concurrency and caching; logs and token usage. | |
| - Acceptance: A user can create a new template with variables and run it across 10 pages with regions in one click. | |
| - **M3 (v0.3) β Projects and Evals** | |
| - Projects grouping; batch runs over documents; dataset export; simple eval harness with spot checks. | |
| - Acceptance: A user can run a project over 100 pages and export an evaluation-ready JSONL in < 10 minutes. | |
| ## Open Questions | |
| - Do we require strict JSON schema validation with auto-repair, or soft validation with warnings? | |
| - What are the approved external providers for sensitive documents? | |
| - Should we include table renderers in the UI, or keep to JSON/Markdown only? | |
| - How long should run artifacts persist by default (e.g., 7 days)? | |
| ## Risks & Mitigations | |
| - **External API variability**: Abstract through connectors; provide stubs/mocks for local dev. | |
| - **Document diversity**: Offer preprocessing toggles and template variables; maintain an examples gallery. | |
| - **Cost visibility**: Track token usage and estimated cost per run; warn when large batches are selected. | |
| ## Appendices | |
| ### Example: Layout Extraction Prompt (concept) | |
| ```text | |
| System: You are a vision-language model that outputs structured JSON only. | |
| User: Please output the layout information from the PDF page image. For each element, return: | |
| - bbox: [x1, y1, x2, y2] in image pixels | |
| - category: string label from {"title","header","paragraph","table","figure","footnote"} | |
| - text: content within bbox | |
| Return JSON: {"elements": [{"bbox": [..], "category": "..", "text": ".."}], "page": <number>}. | |
| ``` | |