Update README.md
Browse files
README.md
CHANGED
|
@@ -33,6 +33,46 @@ We're releasing these models in two different sizes:
|
|
| 33 |
- **Input**: Cleaned or raw HTML and a JSON Schema
|
| 34 |
- **Output**: Strict JSON that conforms to the provided schema
|
| 35 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
## Minimal Quickstart
|
| 37 |
Use these local snippets to prepare HTML and compose a schema‑guided prompt. The model returns strictly valid JSON; validate it against your schema downstream.
|
| 38 |
|
|
|
|
| 33 |
- **Input**: Cleaned or raw HTML and a JSON Schema
|
| 34 |
- **Output**: Strict JSON that conforms to the provided schema
|
| 35 |
|
| 36 |
+
## Benchmarks
|
| 37 |
+
|
| 38 |
+
### HTML-to-JSON Extraction Quality
|
| 39 |
+
|
| 40 |
+
We evaluated extraction quality using Gemini 2.5 Pro as a judge, scoring extractions from 1-5 where 5 represents perfect extraction.
|
| 41 |
+
|
| 42 |
+
| Model | LLM-as-Judge Score |
|
| 43 |
+
|-------|-------------------|
|
| 44 |
+
| GPT-4.1 | 4.74 |
|
| 45 |
+
| **Schematron-8B** | **4.64** |
|
| 46 |
+
| **Schematron-3B** | **4.41** |
|
| 47 |
+
| Gemini-3B-Base | 2.24 |
|
| 48 |
+
|
| 49 |
+
### Web-Augmented Factuality on SimpleQA
|
| 50 |
+
|
| 51 |
+
We evaluated Schematron's real-world impact on LLM factuality using SimpleQA.
|
| 52 |
+
|
| 53 |
+
**Test Pipeline:**
|
| 54 |
+
1. **Query Generation**: Primary LLM (GPT-5 Nano or GPT-4.1) generates search queries and defines extraction schema
|
| 55 |
+
2. **Web Search**: Search provider (SERP or Exa) retrieves relevant pages
|
| 56 |
+
3. **Structured Extraction**: Schematron extracts JSON data from retrieved pages using the schema
|
| 57 |
+
4. **Answer Synthesis**: Primary LLM produces final answer from structured data
|
| 58 |
+
|
| 59 |
+
| Base Model | Configuration | SimpleQA Accuracy |
|
| 60 |
+
|:-----------|:--------------|------------------:|
|
| 61 |
+
| GPT-5 Nano | Solo | 8.54% |
|
| 62 |
+
| GPT-5 Nano | + SERP + Schematron-8B | 64.15% |
|
| 63 |
+
| GPT-5 Nano | + Exa + **Schematron-3B** | **75.47%** |
|
| 64 |
+
| GPT-5 Nano | + Exa + Gemini 2.5 Flash | 80.61% |
|
| 65 |
+
| GPT-5 Nano | + Exa + **Schematron-8B** | **82.87%** |
|
| 66 |
+
| GPT-4.1 | Solo | 41.60% |
|
| 67 |
+
| GPT-4.1 | + Exa + **Schematron-8B** | **85.58%** |
|
| 68 |
+
|
| 69 |
+
**Key findings:**
|
| 70 |
+
- Web search paired with JSON extraction improves factuality: Adding Schematron with web retrieval improves GPT-5 Nano's accuracy from 8.54% to 82.87%—nearly a 10x improvement
|
| 71 |
+
- Search provider matters: Exa (82.9%) significantly outperforms SERP (64.2%) for factual retrieval, while also being more cost-effective
|
| 72 |
+
- Structured extraction beats raw HTML: Processing raw HTML would require 100k+ tokens for 10 searches; Schematron's JSON extraction reduces this by orders of magnitude
|
| 73 |
+
- Small specialized models win: Schematron-8B (82.87%) outperforms the much larger Gemini 2.5 Flash (80.61%) on this task, showing that fine-tuning for well-defined tasks beats general purpose models
|
| 74 |
+
- Performance scales with model quality: When paired with GPT-4.1, Schematron achieves 85.58% accuracy, showing the approach benefits from stronger base models
|
| 75 |
+
|
| 76 |
## Minimal Quickstart
|
| 77 |
Use these local snippets to prepare HTML and compose a schema‑guided prompt. The model returns strictly valid JSON; validate it against your schema downstream.
|
| 78 |
|