Request for Training Data Examples
#1
by
ldslcpt
- opened
Hi! I'm working on fine-tuning the dots.ocr model for document understanding tasks and would like to better understand the expected data format to ensure my implementation is correct.
Could you please provide some sample data to help me understand the correct format? Specifically:
JSONL training data samples
- A few lines from a working training dataset
- This would help me verify my data preparation pipeline
PAGEXML + JPEG pairs
- Sample PAGEXML files with corresponding images
- This would help me understand the annotation structure and coordinate system
Data preparation guidelines
- Any additional best practices for data preparation
- Common pitfalls to avoid