Back · document-intel / about

How it works

A pipeline of well-chosen libraries.

Each step uses the most boring, reliable Python tool for the job — that's why it runs in seconds, not minutes. No transformer model, no cloud call, no surprise dependencies.

Pipeline stages

classify python-magic · pdfplumber
mime = magic.from_file(path)
sample = pdf.pages[:3]
is_scanned = avg_chars(sample) < 50

Detects MIME type and samples the first three pages to decide whether the PDF has selectable text or needs OCR.

ocr pytesseract · pdf2image · Pillow
img = preprocess(image)   # grayscale,
                          # upscale, sharpen
data = pytesseract.image_to_data(img)
text, conf = reconstruct(data)

Renders pages at 300 DPI, preprocesses for contrast and runs Tesseract with per-word confidence scoring.

htr kraken · calamari (optional)
if ocr_conf < 65:
    text, conf = htr.transcribe(
        image, model=KRAKEN_MODEL)

Falls back to handwriting recognition on low-confidence pages. Configurable via HTR_KRAKEN_MODEL env var.

tables pdfplumber · python-docx
for page in pdf.pages:
    for tbl in page.extract_tables():
        tables.append(clean(tbl))

Uses pdfplumber's lattice-based table detection on native PDFs and python-docx for Word documents.

entities spaCy · en_core_web_sm
doc = nlp(text)
entities = [(e.text, e.label_)
            for e in doc.ents]

Small spaCy model (~50 MB) loaded at startup; recognises persons, organisations, locations, dates and money.

key-values re (stdlib)
patterns = {"email": r"...",
            "phone": r"...",
            "amount": r"[$€£]\s*\d+..."}
results = {k: re.findall(p, text)}

Plain compiled regex. Picks up emails, phones, URLs, dates and multi-currency amounts — plus invoice-specific fields when applicable.

layout heuristics
for p in paragraphs:
    if short(p) and (
       p.isupper() or p.endswith(":")):
        headings.append(p)

Cheap heading detection by paragraph shape — no ML required for what is essentially a typography problem.

summary scikit-learn · TF-IDF
tfidf = TfidfVectorizer(
    stop_words="english"
).fit_transform(sentences)
top = sents[scores.argsort()[-5:]]

Extractive summary: score sentences by sum of TF-IDF weights, return the top five in original order.

language langid (norm_probs)
ident = LanguageIdentifier
    .from_modelstring(model,
                      norm_probs=True)
lang, p = ident.classify(text)

Normalised 0–1 confidence rather than raw log-probability, so the score you see in the UI is actually meaningful.

runtime stack

runtimepython 3.14
serverquart + hypercorn (asgi)
concurrencyasyncio.to_thread
progressserver-sent events
deployrailway · docker
cleanup1h ttl on uploads

Why these choices

Transformer-based pipelines exist for every one of these stages, but they cost orders of magnitude more compute for marginal gains on common business documents. The libraries listed above are the same ones used at production scale in larger systems — they just happen to also fit comfortably on a single 512 MB Railway container.