OCR & LLM-based solutions Pipeline for Historical Documents

Table of Contents

  • Introduction
  • Evaluation Metrics
  • Models
  • Results
  • API Reference
  • Pipeline OCR : Dictionnaire de Trévoux
OCR & LLM-based solutions Pipeline for Historical Documents
  • Welcome to the OCR & LLM-based solution Pipeline Documentation for Historical Documents (Dictionnaire de Trévoux)!
  • View page source

Welcome to the OCR & LLM-based solution Pipeline Documentation for Historical Documents (Dictionnaire de Trévoux)!

Dictio

Table of Contents

  • Introduction
    • OCR Pipeline Description
  • Evaluation Metrics
    • Character Error Rate (CER)
    • Word Error Rate (WER)
    • Normalisation Before Evaluation
  • Models
    • Mistral OCR
    • Qwen2.5-VL 7b (Ollama)
    • Post-Correction: Mistral Large
    • Model Comparison Summary
    • Kraken+Ciaconna
  • Results
    • Experimental Setup
    • Results: Mistral OCR + Mistral Large
    • Results: Qwen2.5-VL 7b
    • Cross-Model Comparison (Page Pe)
    • Key Observations
    • Results: GLM-OCR (Ziphu AI)
  • API Reference
    • run_mistral_ocr(image_path)
    • clean_markdown(text)
    • run_mistral_correction(ocr_text)
    • normalize(text)
    • cer(gold, pred)
    • wer(gold, pred)
    • levenshtein_chars(a, b)
    • levenshtein_words(a, b)
  • Pipeline OCR : Dictionnaire de Trévoux
    • Architecture
    • Moteurs OCR testés
    • Problème identifié : lecture en colonnes
    • Correction LLM
    • Évaluation : CER et WER
    • Prochaines étapes
Next

© Copyright 2026, LIRIS, Lyon.

Built with Sphinx using a theme provided by Read the Docs.