How it works
A pipeline of well-chosen libraries.
Each step uses the most boring, reliable Python tool for the job — that's why it runs in seconds, not minutes. No transformer model, no cloud call, no surprise dependencies.
Pipeline stages
mime = magic.from_file(path) sample = pdf.pages[:3] is_scanned = avg_chars(sample) < 50
Detects MIME type and samples the first three pages to decide whether the PDF has selectable text or needs OCR.
img = preprocess(image) # grayscale,
# upscale, sharpen
data = pytesseract.image_to_data(img)
text, conf = reconstruct(data)
Renders pages at 300 DPI, preprocesses for contrast and runs Tesseract with per-word confidence scoring.
if ocr_conf < 65:
text, conf = htr.transcribe(
image, model=KRAKEN_MODEL)
Falls back to handwriting recognition on low-confidence pages. Configurable via
HTR_KRAKEN_MODEL env var.
for page in pdf.pages:
for tbl in page.extract_tables():
tables.append(clean(tbl))
Uses pdfplumber's lattice-based table detection on native PDFs and python-docx for Word documents.
doc = nlp(text)
entities = [(e.text, e.label_)
for e in doc.ents]
Small spaCy model (~50 MB) loaded at startup; recognises persons, organisations, locations, dates and money.
patterns = {"email": r"...",
"phone": r"...",
"amount": r"[$€£]\s*\d+..."}
results = {k: re.findall(p, text)}
Plain compiled regex. Picks up emails, phones, URLs, dates and multi-currency amounts — plus invoice-specific fields when applicable.
for p in paragraphs:
if short(p) and (
p.isupper() or p.endswith(":")):
headings.append(p)
Cheap heading detection by paragraph shape — no ML required for what is essentially a typography problem.
tfidf = TfidfVectorizer(
stop_words="english"
).fit_transform(sentences)
top = sents[scores.argsort()[-5:]]
Extractive summary: score sentences by sum of TF-IDF weights, return the top five in original order.
ident = LanguageIdentifier
.from_modelstring(model,
norm_probs=True)
lang, p = ident.classify(text)
Normalised 0–1 confidence rather than raw log-probability, so the score you see in the UI is actually meaningful.
runtime stack
Why these choices
Transformer-based pipelines exist for every one of these stages, but they cost orders of magnitude more compute for marginal gains on common business documents. The libraries listed above are the same ones used at production scale in larger systems — they just happen to also fit comfortably on a single 512 MB Railway container.