Introduction ============ This project aims to evaluate the performance of an OCR system applied to historical French documents from the 18th century (Dictionnaire de Trévoux Tomme 5). The main challenges include: - **Old orthography** (e.g "portoient", "appelloit"). - **Non-standard punctuation and spelling**. - **Degraded scan quality**. - **Historical vocabulary**. OCR Pipeline Description --------------------- We propose a pipeline combining: .. figure:: /Documentation/Images/New_pipeline.png :width: 100% :align: center :alt: Alternative text for the image :name: New_pipeline **1. Mistral OCR for raw text extraction:** - **Input**: scanned historical document images. - **Output**: raw OCR transcription. - Preserves basic layout (lines, spacing when available). - Handles complex structures (columns, footnotes, dictionary entries). **2. Cleaning and normalization:** - Removes OCR noise (artifacts, markdown, broken characters). - Standardizes whitespace and line breaks. - Unicode normalization and special character handling (e.g., ſ → s, æ → ae). - **Output**: cleaned text ready for correction. **3. LLM-based correction using Mistral Large:** - **Input**: cleaned OCR text. - Corrects OCR errors while preserving historical spelling. - Fixes misrecognitions and word splits. - Preserves archaic French forms (portoient, appelloit, ...). - **Output**: corrected transcription. **4. Evaluation using CER and WER metrics:** - **Reference**: gold standard text. - **CER**: character-level error measurement. - **WER**: word-level error measurement. - **Comparison**: OCR vs gold, corrected vs gold.