API Reference

This page documents all functions in the pipeline script MistralOCR.py.

run_mistral_ocr(image_path)

def run_mistral_ocr(image_path: str) -> str

Sends a PNG image to the Mistral OCR API and returns the raw Markdown transcription.

Parameters:

image_path (str) — absolute or relative path to the PNG image file

Returns:

(str) — raw Markdown text concatenated from all pages returned by the API

Raises:

ValueError — if the API response does not contain a pages field

Example:

ocr_raw = run_mistral_ocr("page_1905.png")

—

clean_markdown(text)

def clean_markdown(text: str) -> str

Removes all Markdown markup from a string and returns plain continuous text on a single line.

Specifically removes: embedded images (![](...)), heading markers (#), bold and italic markers (*, _), HTML entities (&, <, >), inline links, and inline code. All newlines are replaced with spaces.

Parameters:

text (str) — raw Markdown string from the OCR stage

Returns:

(str) — cleaned plain text, single line, no extra whitespace

Example:

ocr_clean = clean_markdown(ocr_raw)

—

run_mistral_correction(ocr_text)

def run_mistral_correction(ocr_text: str) -> str

Sends the cleaned OCR text to Mistral Large (mistral-large-latest, temperature 0) for post-correction. The model fixes OCR errors while preserving archaic 18th-century French orthography.

Corrections applied:

OCR character substitutions (Vestigal → Vectigal)
Semantic word substitutions (entre deux foires → entre deux soleils)
Erroneous modernisation of archaic spelling (portaient → portoient)
Corrupted proper nouns (Bouleons → Boileau, Genili → Generoso)
Wrong dates (1587 → 1387, 1688 → 1388)
Words split across line breaks

Parameters:

ocr_text (str) — cleaned OCR text from clean_markdown()

Returns:

(str) — corrected plain text, single continuous line, no newlines

Raises:

ValueError — if the API response does not contain a choices field

Example:

corrected = run_mistral_correction(ocr_clean)

—

normalize(text)

def normalize(text: str) -> str

Normalises a text string before computing evaluation metrics. Applied identically to both the gold standard and the predictions to ensure fair comparison.

Operations performed:

Unicode NFKD normalisation
Lowercase conversion
Long ſ → s, œ → oe, æ → ae
Removal of punctuation (.,;:!?()[]"'&)
Collapsing of multiple spaces into one

Parameters:

text (str) — raw or corrected text string

Returns:

(str) — normalised lowercase string, single line

Example:

gold_norm = normalize(gold_raw)
ocr_norm  = normalize(ocr_clean)
corr_norm = normalize(corrected)

—

cer(gold, pred)

def cer(gold: str, pred: str) -> float

Computes the Character Error Rate between a gold string and a predicted string using Levenshtein edit distance at character level.

\[\text{CER} = \frac{\text{editdistance}_{char}(gold, pred)}{|gold|}\]

Parameters:

gold (str) — normalised gold standard string
pred (str) — normalised prediction string

Returns:

(float) — CER value between 0.0 (perfect) and 1.0+ (very poor)

Example:

cer_ocr  = cer(gold_norm, ocr_norm)
cer_corr = cer(gold_norm, corr_norm)
print(f"CER : {cer_corr:.4f}")

—

wer(gold, pred)

def wer(gold: str, pred: str) -> float

Computes the Word Error Rate between a gold string and a predicted string using Levenshtein edit distance at word (token) level.

\[\text{WER} = \frac{\text{editdistance}_{word}(gold.split(), pred.split())}{|gold.split()|}\]

Parameters:

gold (str) — normalised gold standard string
pred (str) — normalised prediction string

Returns:

(float) — WER value between 0.0 (perfect) and 1.0+ (very poor)

Example:

wer_ocr  = wer(gold_norm, ocr_norm)
wer_corr = wer(gold_norm, corr_norm)
print(f"WER : {wer_corr:.4f}")

—

levenshtein_chars(a, b)

def levenshtein_chars(a: str, b: str) -> int

Computes the Levenshtein edit distance between two strings at character level. Used internally by cer().

Parameters:

a (str) — first string
b (str) — second string

Returns:

(int) — minimum number of single-character edits (insertions, deletions, substitutions)

—

levenshtein_words(a, b)

def levenshtein_words(a: list, b: list) -> int

Computes the Levenshtein edit distance between two token lists. Used internally by wer().

Parameters:

a (list) — first token list (from str.split())
b (list) — second token list

Returns:

(int) — minimum number of word-level edits (insertions, deletions, substitutions)