API Reference
This page documents all functions in the pipeline script MistralOCR.py.
run_mistral_ocr(image_path)
def run_mistral_ocr(image_path: str) -> str
Sends a PNG image to the Mistral OCR API and returns the raw Markdown transcription.
Parameters:
image_path(str) — absolute or relative path to the PNG image file
Returns:
(str) — raw Markdown text concatenated from all pages returned by the API
Raises:
ValueError— if the API response does not contain apagesfield
Example:
ocr_raw = run_mistral_ocr("page_1905.png")
—
clean_markdown(text)
def clean_markdown(text: str) -> str
Removes all Markdown markup from a string and returns plain continuous text on a single line.
Specifically removes: embedded images (), heading markers (#),
bold and italic markers (*, _), HTML entities (&, <, >),
inline links, and inline code. All newlines are replaced with spaces.
Parameters:
text(str) — raw Markdown string from the OCR stage
Returns:
(str) — cleaned plain text, single line, no extra whitespace
Example:
ocr_clean = clean_markdown(ocr_raw)
—
run_mistral_correction(ocr_text)
def run_mistral_correction(ocr_text: str) -> str
Sends the cleaned OCR text to Mistral Large (mistral-large-latest, temperature 0)
for post-correction. The model fixes OCR errors while preserving archaic
18th-century French orthography.
Corrections applied:
OCR character substitutions (
Vestigal→Vectigal)Semantic word substitutions (
entre deux foires→entre deux soleils)Erroneous modernisation of archaic spelling (
portaient→portoient)Corrupted proper nouns (
Bouleons→Boileau,Genili→Generoso)Wrong dates (
1587→1387,1688→1388)Words split across line breaks
Parameters:
ocr_text(str) — cleaned OCR text fromclean_markdown()
Returns:
(str) — corrected plain text, single continuous line, no newlines
Raises:
ValueError— if the API response does not contain achoicesfield
Example:
corrected = run_mistral_correction(ocr_clean)
—
normalize(text)
def normalize(text: str) -> str
Normalises a text string before computing evaluation metrics. Applied identically to both the gold standard and the predictions to ensure fair comparison.
Operations performed:
Unicode NFKD normalisation
Lowercase conversion
Long
ſ→s,œ→oe,æ→aeRemoval of punctuation (
.,;:!?()[]"'&)Collapsing of multiple spaces into one
Parameters:
text(str) — raw or corrected text string
Returns:
(str) — normalised lowercase string, single line
Example:
gold_norm = normalize(gold_raw)
ocr_norm = normalize(ocr_clean)
corr_norm = normalize(corrected)
—
cer(gold, pred)
def cer(gold: str, pred: str) -> float
Computes the Character Error Rate between a gold string and a predicted string using Levenshtein edit distance at character level.
Parameters:
gold(str) — normalised gold standard stringpred(str) — normalised prediction string
Returns:
(float) — CER value between 0.0 (perfect) and 1.0+ (very poor)
Example:
cer_ocr = cer(gold_norm, ocr_norm)
cer_corr = cer(gold_norm, corr_norm)
print(f"CER : {cer_corr:.4f}")
—
wer(gold, pred)
def wer(gold: str, pred: str) -> float
Computes the Word Error Rate between a gold string and a predicted string using Levenshtein edit distance at word (token) level.
Parameters:
gold(str) — normalised gold standard stringpred(str) — normalised prediction string
Returns:
(float) — WER value between 0.0 (perfect) and 1.0+ (very poor)
Example:
wer_ocr = wer(gold_norm, ocr_norm)
wer_corr = wer(gold_norm, corr_norm)
print(f"WER : {wer_corr:.4f}")
—
levenshtein_chars(a, b)
def levenshtein_chars(a: str, b: str) -> int
Computes the Levenshtein edit distance between two strings at character level.
Used internally by cer().
Parameters:
a(str) — first stringb(str) — second string
Returns:
(int) — minimum number of single-character edits (insertions, deletions, substitutions)
—
levenshtein_words(a, b)
def levenshtein_words(a: list, b: list) -> int
Computes the Levenshtein edit distance between two token lists.
Used internally by wer().
Parameters:
a(list) — first token list (fromstr.split())b(list) — second token list
Returns:
(int) — minimum number of word-level edits (insertions, deletions, substitutions)