Models

This project compares two OCR/vision models for the transcription stage, all followed by the same Mistral Large post-correction step.

Mistral OCR

Type: Dedicated OCR API (cloud)

Endpoint: https://api.mistral.ai/v1/ocr

Model ID: mistral-ocr-latest

Mistral OCR is a specialised optical character recognition model developed by Mistral AI. Unlike general vision-language models, it is optimised for document transcription and returns structured Markdown output with page-level segmentation.

Strengths:

High accuracy on printed text including historical fonts
Returns structured pages objects with per-page Markdown
Handles multi-column layouts better than general vision models
No prompt required: document is sent directly
Best CER/WER scores observed across all tested models

Limitations:

Returns Markdown markup (italics, bold, headers) that must be cleaned before evaluation
Cloud API: requires internet access and a valid Mistral API key
Paid service

Input format:

{
  "model": "mistral-ocr-latest",
  "document": {
    "type": "image_url",
    "image_url": "data:image/png;base64,<base64>"
  }
}

Output format: JSON with a pages list, each page containing a markdown field. Or Raw text.

—

Qwen2.5-VL 7b (Ollama)

Type: Vision-language model (local, open-source)

Endpoint: http://localhost:11434/api/chat

Model ID: qwen2.5vl:7b

Qwen2.5-VL is an open-source vision-language model developed by Alibaba DAMO Academy. The 7b variant is run locally via Ollama, requiring no API key or internet connection after the initial model download.

Strengths:

Fully local: no data leaves the machine, no API cost
Strong multilingual capability including French and Latin
Free and open-source (Apache 2.0 licence)
Correctly transcribes most running text

Limitations:

Requires a machine with sufficient VRAM (minimum 8 GB recommended)
Slower inference than cloud APIs on CPU-only setups
Post-correction step had no effect on this model (ΔCER: 0.000, ΔWER: 0.000), suggesting the LLM correction prompt needs to be adapted for Qwen outputs
Produces hallucinated artefacts on degraded page regions (e.g. "us de F Provence, paye quoral, po Les Enfantras")
Includes page headers and column markers in output (e.g. "Tome V. P E A.")

Setup:

# Install Ollama (https://ollama.com)
ollama pull qwen2.5vl:7b
ollama serve   # starts the local API on port 11434

Input format:

{
  "model": "qwen2.5vl:7b",
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "<prompt>"},
        {"type": "image_url", "image_url": "data:image/png;base64,<base64>"}
      ]
    }
  ],
  "stream": False
}

Output format: message.content string in the Ollama chat response.

—

Post-Correction: Mistral Large

Type: LLM text correction (cloud)

Endpoint: https://api.mistral.ai/v1/chat/completions

Model ID: mistral-large-latest

Transcriptions pass through Mistral Large for post-correction. This step fixes:

OCR character substitutions (Vestigal → Vectigal)
Semantic errors (entre deux foires → entre deux soleils)
Erroneous modernisation of archaic spelling (portaient → portoient)
Corrupted proper nouns and dates
Words split across line breaks

Note

The post-correction step proved highly effective on Mistral OCR output (ΔWER up to +0.0422).

Model Comparison Summary

Model	Type	Local / Cloud	API Key needed	Cost
Mistral OCR	Dedicated OCR	Cloud	Yes (Mistral)	Paid
Qwen2.5-VL 7B	Vision LLM	Local	No	Free

Kraken+Ciaconna

Kraken+Ciaconna has shown remarkable performance on Latin and polytonic Greek scripts.

Results of Matteo Romanello, Sven Najem-Meyer, and Bruce Robertson. 2021. Optical Character Recognition of 19th Century Classical Commentaries: the Current State of Affairs.