Spaces:

ElektrikSpark
/

VLM-playground

Running

App Files Files Community

VLM-playground / PRD.md

trevorpfiz

iterating to get pdfs ocr working

ae9d014 3 months ago

preview code

raw

history blame contribute delete

9.96 kB

	# VLM Playground (PreviewSpace) — Product Requirements Document

	## Summary

	An internal Gradio Blocks app for rapid, structured experimentation with a Vision-Language Model (initially `dots.ocr`). It mirrors the reference playground but is deliberately minimal: stateless by default, no run history, focused on feeling model performance. Supports PDF/image upload, page preview and navigation, page-range parsing, and result views (Markdown Render, Markdown Raw Text, Current Page JSON) with preserved scroll position. Designed to run locally or on Hugging Face Spaces.

	## Goals

	- Fast iteration: Upload, prompt, parse, iterate in seconds with minimal ceremony.
	- Model-light: Start with one model (`dots.ocr`), optional model selector later. No provider switching UI.
	- Structured output: First-class JSON output and markdown preview.
	- Stateless by default: No run history or persistence beyond the current browser session unless explicitly downloading.
	- Document-centric UX: Multi-page PDF preview, page navigation, per-page execution, and page-range parsing.

	## Non-Goals

	- Not a full labeling platform or production extraction pipeline.
	- Not a dataset hosting service or long-term data store for PHI.
	- Not a fine-tuning/training product; inference playground only.
	- No bounding-box drawing or manual annotation tools in v1.

	## Primary Users / Personas

	- Applied Researcher / Data Scientist: Tries different prompts/models, collects structured outputs.
	- ML Engineer: Prototypes pipelines, compares providers, validates latency/cost.
	- Domain Expert (e.g., Clinical Analyst): Uses curated templates to extract specific fields.

	## Key User Stories

	- As a user, I can upload a PDF or image, select a template prompt, and click Parse to see Markdown and JSON results.
	- As a user, I can preview pages, specify a page range to parse, and run per-page extraction.
	- As a user, I can jump to a specific page index in a PDF and use Prev/Next controls.
	- As a user, I can switch between result tabs (Markdown Render, Markdown Raw Text, Current Page JSON) without losing scroll position.
	- As a user, I can download the results for my current session as a ZIP or JSON/Markdown.
	- As a user, I can tweak the prompt and basic model settings and quickly re-run.

	## UX Requirements (inspired by dots.ocr playground)

	- Left Panel — Upload & Select
	- Drag-and-drop or file picker for PNG/JPG/PDF; show file name and size.
	- Optional Examples dropdown (curated sample docs and pre-baked prompts).
	- File ingestion for PDFs extracts page thumbnails and page count.
	- Left Panel — Prompt & Actions
	- Prompt Template select; Current Prompt editor (multiline with variable chips).
	- Actions: Parse (primary), Clear (secondary).
	- Show prompt variables, e.g., `bbox`, `category`, `page_number`.
	- Left Panel — Advanced Configuration
	- Preprocessing toggle (fitz-like DPI upsample for low-res images).
	- Minimal server/model config: Host/Port for local inference or a dropdown for on-host models.
	- Page selection: single page, page range, or all.
	- Center — File Preview
	- Large page preview with pan/zoom; page navigator (Prev/Next and page picker).
	- Page jump field to go directly to page N.
	- Right Panel — Result Display
	- Tabs: Markdown Render Preview, Markdown Raw Text, Current Page JSON.
	- Preserve scroll position when switching tabs.
	- Copy-to-clipboard and a Download Results button.

	## Functional Requirements

	- File Handling
	- Accept PDF (up to 300 pages) and images (PNG/JPG/WebP). Max upload 50 MB (configurable).
	- Extract page images for preview; store temp files locally (ephemeral) with TTL.
	- Provide page-level selection and batching.
	- Prompting
	- Template library with variables and descriptions. Variables can be sourced from UI state (page, bbox list) or user input.
	- System prompt + user prompt fields; allow few-shot examples.
	- Presets for common tasks (layout extraction, table extraction, key-value extraction, captioning).
	- Model Support
	- Start with `dots.ocr` via the official parser or REST endpoint.
	- Optional: dropdown to switch among `dots.ocr` model variants if present on the host. No cross-provider switching UI.
	- Execution
	- Run per-page or whole-document, controlled by UI. Concurrency limit (default 3).
	- Timeouts and retries surfaced to UI; cancellation supported.
	- Caching: request hash on (file checksum, page, prompt, params, model) to avoid recomputation.
	- Outputs
	- Markdown Render, Raw Markdown, and Current Page JSON.
	- Export: Download button to export combined Markdown, per-page JSONL, and all artifacts as a ZIP.
	- Examples Gallery
	- Preloaded example docs and templates to demonstrate patterns (OCR table, K/V extraction, figure captioning, layout detection).
	- Observability
	- Show basic runtime info (latency, model id) inline; no history or centralized logs in v1.

	## Data Model (high-level)

	- In-memory, per-session structures only; no database.
	- Document: id, name, type, checksum, page count, temp storage path, created_at.
	- Page: document_id, page_index, image_path, width, height, preview_thumbnail.
	- Template: id, name, description, model_defaults, prompt_text, output_schema (optional JSON Schema), variables.

	## JSON Output Guidance

	- For structured tasks, templates may specify an output schema. The UI validates model JSON and highlights issues.
	- All results stored as JSON lines per page with summary aggregation.

	## Security & Compliance

	- Internal-only; access requires SSO or VPN.
	- Sensitive documents (e.g., PHI) processed only against approved providers/endpoints. Warn when a provider is external.
	- Ephemeral storage with TTL auto-clean; configurable retention. Redact logs where needed.

	## Performance Targets

	- Cold start to first parse: < 10s on typical PDFs (<= 20 pages) with network providers.
	- Per-page preview render: < 500ms after page image generation.
	- Concurrency: default 3 parallel page requests; configurable up to 10.
	- Throughput: 1,000 pages/day per user on average use without manual scaling.

	## Error States & Edge Cases

	- Unsupported file types or oversize files; clear messaging and guardrails.
	- Pages with extreme aspect ratios or very small text; suggest preprocessing.
	- Provider rate limits; exponential backoff and UI feedback.
	- Invalid model JSON; surface diffs and attempt best-effort JSON repair (opt-in).

	## Architecture (proposed)

	- App: Single Gradio Blocks app (Python). No separate backend required.
	- Execution: Use `uv run` locally. Designed to run as-is on Hugging Face Spaces.
	- Model: `dots.ocr` via local parser or REST endpoint; configurable host/port.
	- Storage: Ephemeral `/tmp/previewspace/*`; cleared at session end or TTL.
	- Caching: Optional on-disk cache keyed by content hash + prompt + params + model.

	## API Surface (v1)

	- Pure Gradio callbacks; no public REST API. Optional: expose simple `/healthz`.

	## Templates (initial set)

	- Layout Extraction: Return list of elements with `bbox`, `category`, and `text` within bbox.
	- Table Extraction: Return rows/columns as structured JSON; include confidence and cell bboxes.
	- Key-Value Extraction: Extract specified fields with locations and normalized values.
	- Captioning/Description: Summarize or caption selected regions or whole pages.

	## Privacy-by-Design Defaults

	- Local processing preferred where possible; clear visual indicator when sending to external APIs.
	- Redaction utilities for logs; toggle to disable request logging entirely.

	## Success Metrics

	- Time-to-first-result after upload.
	- Number of saved runs and templates re-used.
	- Reduction in manual extraction time for a representative task.
	- User satisfaction (quick pulse after saved runs).

	## Release Plan

	- M1 (v0.1) — Core Playground
	- Upload PDF/image; page preview and navigation.
	- Parse with one provider; show Markdown and JSON; save runs; export JSON.
	- Basic provider config (host/port/api key) and preprocessing toggle.
	- Acceptance: A user can replicate a layout extraction example end-to-end in < 2 minutes.
	- M2 (v0.2) — Templates, Regions, and Examples
	- Template library + editor; draw/save bboxes; per-page runs; examples gallery.
	- Multiple providers; concurrency and caching; logs and token usage.
	- Acceptance: A user can create a new template with variables and run it across 10 pages with regions in one click.
	- M3 (v0.3) — Projects and Evals
	- Projects grouping; batch runs over documents; dataset export; simple eval harness with spot checks.
	- Acceptance: A user can run a project over 100 pages and export an evaluation-ready JSONL in < 10 minutes.

	## Open Questions

	- Do we require strict JSON schema validation with auto-repair, or soft validation with warnings?
	- What are the approved external providers for sensitive documents?
	- Should we include table renderers in the UI, or keep to JSON/Markdown only?
	- How long should run artifacts persist by default (e.g., 7 days)?

	## Risks & Mitigations

	- External API variability: Abstract through connectors; provide stubs/mocks for local dev.
	- Document diversity: Offer preprocessing toggles and template variables; maintain an examples gallery.
	- Cost visibility: Track token usage and estimated cost per run; warn when large batches are selected.

	## Appendices

	### Example: Layout Extraction Prompt (concept)

	```text
	System: You are a vision-language model that outputs structured JSON only.
	User: Please output the layout information from the PDF page image. For each element, return:
	- bbox: [x1, y1, x2, y2] in image pixels
	- category: string label from {"title","header","paragraph","table","figure","footnote"}
	- text: content within bbox
	Return JSON: {"elements": [{"bbox": [..], "category": "..", "text": ".."}], "page": <number>}.
	```