Models
This project compares two OCR/vision models for the transcription stage, all followed by the same Mistral Large post-correction step.
Mistral OCR
Type: Dedicated OCR API (cloud)
Endpoint: https://api.mistral.ai/v1/ocr
Model ID: mistral-ocr-latest
Mistral OCR is a specialised optical character recognition model developed by Mistral AI. Unlike general vision-language models, it is optimised for document transcription and returns structured Markdown output with page-level segmentation.
Strengths:
High accuracy on printed text including historical fonts
Returns structured
pagesobjects with per-page MarkdownHandles multi-column layouts better than general vision models
No prompt required: document is sent directly
Best CER/WER scores observed across all tested models
Limitations:
Returns Markdown markup (italics, bold, headers) that must be cleaned before evaluation
Cloud API: requires internet access and a valid Mistral API key
Paid service
Input format:
{
"model": "mistral-ocr-latest",
"document": {
"type": "image_url",
"image_url": "data:image/png;base64,<base64>"
}
}
Output format: JSON with a pages list, each page containing a markdown field. Or Raw text.
—
Qwen2.5-VL 7b (Ollama)
Type: Vision-language model (local, open-source)
Endpoint: http://localhost:11434/api/chat
Model ID: qwen2.5vl:7b
Qwen2.5-VL is an open-source vision-language model developed by Alibaba DAMO Academy. The 7b variant is run locally via Ollama, requiring no API key or internet connection after the initial model download.
Strengths:
Fully local: no data leaves the machine, no API cost
Strong multilingual capability including French and Latin
Free and open-source (Apache 2.0 licence)
Correctly transcribes most running text
Limitations:
Requires a machine with sufficient VRAM (minimum 8 GB recommended)
Slower inference than cloud APIs on CPU-only setups
Post-correction step had no effect on this model (ΔCER: 0.000, ΔWER: 0.000), suggesting the LLM correction prompt needs to be adapted for Qwen outputs
Produces hallucinated artefacts on degraded page regions (e.g.
"us de F Provence, paye quoral, po Les Enfantras")Includes page headers and column markers in output (e.g.
"Tome V. P E A.")
Setup:
# Install Ollama (https://ollama.com)
ollama pull qwen2.5vl:7b
ollama serve # starts the local API on port 11434
Input format:
{
"model": "qwen2.5vl:7b",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "<prompt>"},
{"type": "image_url", "image_url": "data:image/png;base64,<base64>"}
]
}
],
"stream": False
}
Output format: message.content string in the Ollama chat response.
—
Post-Correction: Mistral Large
Type: LLM text correction (cloud)
Endpoint: https://api.mistral.ai/v1/chat/completions
Model ID: mistral-large-latest
Transcriptions pass through Mistral Large for post-correction. This step fixes:
OCR character substitutions (
Vestigal→Vectigal)Semantic errors (
entre deux foires→entre deux soleils)Erroneous modernisation of archaic spelling (
portaient→portoient)Corrupted proper nouns and dates
Words split across line breaks
Note
The post-correction step proved highly effective on Mistral OCR output (ΔWER up to +0.0422).
Model Comparison Summary
Model |
Type |
Local / Cloud |
API Key needed |
Cost |
|---|---|---|---|---|
Mistral OCR |
Dedicated OCR |
Cloud |
Yes (Mistral) |
Paid |
Qwen2.5-VL 7B |
Vision LLM |
Local |
No |
Free |
Kraken+Ciaconna
Kraken+Ciaconna has shown remarkable performance on Latin and polytonic Greek scripts.
Results of Matteo Romanello, Sven Najem-Meyer, and Bruce Robertson. 2021. Optical Character Recognition of 19th Century Classical Commentaries: the Current State of Affairs.