OCR & LLM-based solutions Pipeline for Historical Documents
Table of Contents
Introduction
Evaluation Metrics
Models
Results
API Reference
Pipeline OCR : Dictionnaire de Trévoux
OCR & LLM-based solutions Pipeline for Historical Documents
Welcome to the OCR & LLM-based solution Pipeline Documentation for Historical Documents (Dictionnaire de Trévoux)!
View page source
Welcome to the OCR & LLM-based solution Pipeline Documentation for Historical Documents (Dictionnaire de Trévoux)!
Table of Contents
Introduction
OCR Pipeline Description
Evaluation Metrics
Character Error Rate (CER)
Word Error Rate (WER)
Normalisation Before Evaluation
Models
Mistral OCR
Qwen2.5-VL 7b (Ollama)
Post-Correction: Mistral Large
Model Comparison Summary
Kraken+Ciaconna
Results
Experimental Setup
Results: Mistral OCR + Mistral Large
Results: Qwen2.5-VL 7b
Cross-Model Comparison (Page Pe)
Key Observations
Results: GLM-OCR (Ziphu AI)
API Reference
run_mistral_ocr(image_path)
clean_markdown(text)
run_mistral_correction(ocr_text)
normalize(text)
cer(gold, pred)
wer(gold, pred)
levenshtein_chars(a, b)
levenshtein_words(a, b)
Pipeline OCR : Dictionnaire de Trévoux
Architecture
Moteurs OCR testés
Problème identifié : lecture en colonnes
Correction LLM
Évaluation : CER et WER
Prochaines étapes