OCR Extract — Research Lab

Technical report · OCR-1

Thai procurement OCR — extraction benchmark

—

We compare two pipelines on a held-out set of Thai invoice / purchase-order documents. Fast is a deep-learning hybrid (XGBoost + LiLT + regex). Pro replaces the NER stage with qwen3.5-27b. Reported metric is field-level fuzzy accuracy (containment OR CER < 0.15 for text; normalised exact match for numerics).

§1 Headline results

Pipeline	Backend	Accuracy	CER	Latency	Δ vs Fast
loading benchmark.json …

§2 Accuracy vs latency, per document

Figure 1. Per-document accuracy plotted against end-to-end latency. ★ = mean. Higher and further left is better.

§3 Methodology

Held-out set: first 30 documents of data/golden/golden.jsonl.
Fuzzy text match: for .name / .address / .contact / .description, hit if the prediction is a substring of the ground truth (or vice versa) or the character error rate is < 0.15.
Tax IDs: all non-digits stripped, first 13 digits compared.
Dates: separators normalised (slash↔dot) and Buddhist year converted (2569 ↔ 2026).
Numerics: commas and trailing zeros dropped, then exact match.
Items: evaluated row-by-row across 8 sub-fields; empty ground-truth cells are skipped.

scoring methodology mirrors the historical 80%+ benchmarks for direct comparison

📄 OCR Extract — Test UI

Thai procurement OCR — extraction benchmark

§1 Headline results

§2 Accuracy vs latency, per document

§3 Methodology

🎲 Sample documents — click to extract

🖼️ Document

📋 Extracted Fields