document-intel
Demo mode · 1 upload per IP per hour · uploads auto-deleted after processing
Live demo · Self-hosted · No API keys

Extract text, tables, and structure from any document.

Self-hosted OCR pipeline. PDF, DOCX, or image → clean JSON in roughly 2 seconds. No external APIs. No per-call billing. Call it from anywhere.

pdf · docx · png · jpg no API key needed ~2s avg response runs anywhere Python runs
Drop a file or click to browse
See what the pipeline extracts in ~2 seconds
PDF · DOCX · PNG · JPG · max 10 MB
Handwriting (HTR) Max size 10 MB Timeout 30 s

Or try one of the sample documents:

loading…

What it extracts

~2s typical · all stages run locally
classify ocr tables entities key-values layout summary

Powered by pdfplumber, pytesseract, spaCy, and scikit-learn. Handwriting fallback when OCR confidence drops.

Use it from anywhere

No UI required. Same pipeline, callable from any HTTP client. Upload returns a task_id; poll /api/status/<id> and fetch /api/results/<id> when done.

# Upload — returns task_id
curl -F "file=@invoice.pdf" \
     -F "htr_mode=auto" \
     http://document-intelligence-pipeline-production.up.railway.app/api/process-document
# → {"task_id":"abc-123","status":"queued"}

# Poll status
curl http://document-intelligence-pipeline-production.up.railway.app/api/status/abc-123
# → {"task_id":"abc-123","status":"completed","progress":100}

# Fetch results
curl http://document-intelligence-pipeline-production.up.railway.app/api/results/abc-123 > result.json