Introduction

This project aims to evaluate the performance of an OCR system applied to historical French documents from the 18th century (Dictionnaire de Trévoux Tomme 5).

The main challenges include:

Old orthography (e.g “portoient”, “appelloit”).
Non-standard punctuation and spelling.
Degraded scan quality.
Historical vocabulary.

OCR Pipeline Description

We propose a pipeline combining:

1. Mistral OCR for raw text extraction:

Input: scanned historical document images.
Output: raw OCR transcription.
Preserves basic layout (lines, spacing when available).
Handles complex structures (columns, footnotes, dictionary entries).

2. Cleaning and normalization:

Removes OCR noise (artifacts, markdown, broken characters).
Standardizes whitespace and line breaks.
Unicode normalization and special character handling (e.g., ſ → s, æ → ae).
Output: cleaned text ready for correction.

3. LLM-based correction using Mistral Large:

Input: cleaned OCR text.
Corrects OCR errors while preserving historical spelling.
Fixes misrecognitions and word splits.
Preserves archaic French forms (portoient, appelloit, …).
Output: corrected transcription.

4. Evaluation using CER and WER metrics:

Reference: gold standard text.
CER: character-level error measurement.
WER: word-level error measurement.
Comparison: OCR vs gold, corrected vs gold.