πŸ“„ OCR Extract β€” Test UI

Thai invoice / PO β†’ 33 fields with bounding-box overlays
Technical report Β· OCR-1

Thai procurement OCR β€” extraction benchmark

β€”
We compare two pipelines on a held-out set of Thai invoice / purchase-order documents. Fast is a deep-learning hybrid (XGBoost + LiLT + regex). Pro replaces the NER stage with qwen3.5-27b. Reported metric is field-level fuzzy accuracy (containment OR CER < 0.15 for text; normalised exact match for numerics).

Β§1 Headline results

Pipeline Backend Accuracy CER Latency Ξ” vs Fast
loading benchmark.json …

Β§2 Accuracy vs latency, per document

Figure 1. Per-document accuracy plotted against end-to-end latency. β˜… = mean. Higher and further left is better.

Β§3 Methodology

  • Held-out set: first 30 documents of data/golden/golden.jsonl.
  • Fuzzy text match: for .name / .address / .contact / .description, hit if the prediction is a substring of the ground truth (or vice versa) or the character error rate is < 0.15.
  • Tax IDs: all non-digits stripped, first 13 digits compared.
  • Dates: separators normalised (slash↔dot) and Buddhist year converted (2569 ↔ 2026).
  • Numerics: commas and trailing zeros dropped, then exact match.
  • Items: evaluated row-by-row across 8 sub-fields; empty ground-truth cells are skipped.

scoring methodology mirrors the historical 80%+ benchmarks for direct comparison

🎲 Sample documents β€” click to extract