Back · document-intel / api

API reference

Document Intelligence API

A local Python pipeline that turns PDFs, Word documents and images into structured JSON. Six endpoints, server-sent events for progress, nothing leaves your machine.

Endpoints

POST /api/process-document
Upload a file as multipart/form-data in field file. Optional form field htr_mode: auto (default), force, off. Returns {"task_id": "...", "status": "queued"}.
GET /api/status/<task_id>
Current task state. Includes htr_attempted and htr_used diagnostics; error and hint on failure.
GET /api/stream/<task_id>
Server-sent events emitted every 500 ms until completed, failed or cancelled.
GET /api/results/<task_id>
Full JSON result. metadata.timings holds per-stage durations in seconds; metadata.extraction.pages[] gives per-page OCR confidence and HTR flags.
GET /api/samples
Lists sample files served from static/samples/.
GET /api/admin/htr-status
Reports installed HTR backends (Kraken / Calamari) and whether configured model files exist on disk.

Quickstart

curl -F "file=@/path/to/doc.pdf" \
     -F "htr_mode=auto" \
     http://localhost:8000/api/process-document
# → {"task_id": "abcd-1234", "status": "queued"}

curl http://localhost:8000/api/results/abcd-1234

Handwriting recognition

Pages with low OCR confidence (default threshold 65) are retried through a local HTR backend.

  • autoOCR first, HTR fallback per page when confidence is low. Recommended for mixed documents.
  • forceTry HTR first on every page. Use when you know the document is handwritten.
  • offDisable HTR entirely. Faster, useful when no models are installed.

Enable HTR locally:

export HTR_KRAKEN_MODEL=/path/to/kraken/model.checkpoint
# or
export HTR_CALAMARI_MODEL=/path/to/calamari/model.zip

Limits

Contact

Interested in running this internally or production deployment? Reach out at guch79@gmail.com.