API Reference

This page documents all functions in the pipeline script MistralOCR.py.

run_mistral_ocr(image_path)

def run_mistral_ocr(image_path: str) -> str

Sends a PNG image to the Mistral OCR API and returns the raw Markdown transcription.

Parameters:

  • image_path (str) — absolute or relative path to the PNG image file

Returns:

  • (str) — raw Markdown text concatenated from all pages returned by the API

Raises:

  • ValueError — if the API response does not contain a pages field

Example:

ocr_raw = run_mistral_ocr("page_1905.png")

clean_markdown(text)

def clean_markdown(text: str) -> str

Removes all Markdown markup from a string and returns plain continuous text on a single line.

Specifically removes: embedded images (![](...)), heading markers (#), bold and italic markers (*, _), HTML entities (&, <, >), inline links, and inline code. All newlines are replaced with spaces.

Parameters:

  • text (str) — raw Markdown string from the OCR stage

Returns:

  • (str) — cleaned plain text, single line, no extra whitespace

Example:

ocr_clean = clean_markdown(ocr_raw)

run_mistral_correction(ocr_text)

def run_mistral_correction(ocr_text: str) -> str

Sends the cleaned OCR text to Mistral Large (mistral-large-latest, temperature 0) for post-correction. The model fixes OCR errors while preserving archaic 18th-century French orthography.

Corrections applied:

  • OCR character substitutions (VestigalVectigal)

  • Semantic word substitutions (entre deux foiresentre deux soleils)

  • Erroneous modernisation of archaic spelling (portaientportoient)

  • Corrupted proper nouns (BouleonsBoileau, GeniliGeneroso)

  • Wrong dates (15871387, 16881388)

  • Words split across line breaks

Parameters:

  • ocr_text (str) — cleaned OCR text from clean_markdown()

Returns:

  • (str) — corrected plain text, single continuous line, no newlines

Raises:

  • ValueError — if the API response does not contain a choices field

Example:

corrected = run_mistral_correction(ocr_clean)

normalize(text)

def normalize(text: str) -> str

Normalises a text string before computing evaluation metrics. Applied identically to both the gold standard and the predictions to ensure fair comparison.

Operations performed:

  • Unicode NFKD normalisation

  • Lowercase conversion

  • Long ſs, œoe, æae

  • Removal of punctuation (.,;:!?()[]"'&)

  • Collapsing of multiple spaces into one

Parameters:

  • text (str) — raw or corrected text string

Returns:

  • (str) — normalised lowercase string, single line

Example:

gold_norm = normalize(gold_raw)
ocr_norm  = normalize(ocr_clean)
corr_norm = normalize(corrected)

cer(gold, pred)

def cer(gold: str, pred: str) -> float

Computes the Character Error Rate between a gold string and a predicted string using Levenshtein edit distance at character level.

\[\text{CER} = \frac{\text{editdistance}_{char}(gold, pred)}{|gold|}\]

Parameters:

  • gold (str) — normalised gold standard string

  • pred (str) — normalised prediction string

Returns:

  • (float) — CER value between 0.0 (perfect) and 1.0+ (very poor)

Example:

cer_ocr  = cer(gold_norm, ocr_norm)
cer_corr = cer(gold_norm, corr_norm)
print(f"CER : {cer_corr:.4f}")

wer(gold, pred)

def wer(gold: str, pred: str) -> float

Computes the Word Error Rate between a gold string and a predicted string using Levenshtein edit distance at word (token) level.

\[\text{WER} = \frac{\text{editdistance}_{word}(gold.split(), pred.split())}{|gold.split()|}\]

Parameters:

  • gold (str) — normalised gold standard string

  • pred (str) — normalised prediction string

Returns:

  • (float) — WER value between 0.0 (perfect) and 1.0+ (very poor)

Example:

wer_ocr  = wer(gold_norm, ocr_norm)
wer_corr = wer(gold_norm, corr_norm)
print(f"WER : {wer_corr:.4f}")

levenshtein_chars(a, b)

def levenshtein_chars(a: str, b: str) -> int

Computes the Levenshtein edit distance between two strings at character level. Used internally by cer().

Parameters:

  • a (str) — first string

  • b (str) — second string

Returns:

  • (int) — minimum number of single-character edits (insertions, deletions, substitutions)

levenshtein_words(a, b)

def levenshtein_words(a: list, b: list) -> int

Computes the Levenshtein edit distance between two token lists. Used internally by wer().

Parameters:

  • a (list) — first token list (from str.split())

  • b (list) — second token list

Returns:

  • (int) — minimum number of word-level edits (insertions, deletions, substitutions)