Technical report Β· OCR-1
Thai procurement OCR β extraction benchmark
We compare two pipelines on a held-out set of Thai invoice / purchase-order documents.
Fast is a deep-learning hybrid (XGBoost + LiLT + regex). Pro replaces the NER stage with
qwen3.5-27b. Reported metric is field-level fuzzy accuracy
(containment OR CER < 0.15 for text; normalised exact match for numerics).
Β§1 Headline results
| Pipeline | Backend | Accuracy | CER | Latency | Ξ vs Fast |
|---|---|---|---|---|---|
| loading benchmark.json β¦ | |||||
Β§2 Accuracy vs latency, per document
Β§3 Methodology
- Held-out set: first 30 documents of
data/golden/golden.jsonl. - Fuzzy text match: for
.name / .address / .contact / .description, hit if the prediction is a substring of the ground truth (or vice versa) or the character error rate is < 0.15. - Tax IDs: all non-digits stripped, first 13 digits compared.
- Dates: separators normalised (slashβdot) and Buddhist year converted (2569 β 2026).
- Numerics: commas and trailing zeros dropped, then exact match.
- Items: evaluated row-by-row across 8 sub-fields; empty ground-truth cells are skipped.
scoring methodology mirrors the historical 80%+ benchmarks for direct comparison
π² Sample documents β click to extract
Doc type
-
Fields
-
Items
-
OCR spans
-
Latency
-
πΌοΈ Document
π Extracted Fields
π Raw JSON
π OCR spans
π·οΈ Manual relabel
Extracting fields... (may take 10-30s)