Models

This project compares two OCR/vision models for the transcription stage, all followed by the same Mistral Large post-correction step.

Mistral OCR

Type: Dedicated OCR API (cloud)

Endpoint: https://api.mistral.ai/v1/ocr

Model ID: mistral-ocr-latest

Mistral OCR is a specialised optical character recognition model developed by Mistral AI. Unlike general vision-language models, it is optimised for document transcription and returns structured Markdown output with page-level segmentation.

Strengths:

  • High accuracy on printed text including historical fonts

  • Returns structured pages objects with per-page Markdown

  • Handles multi-column layouts better than general vision models

  • No prompt required: document is sent directly

  • Best CER/WER scores observed across all tested models

Limitations:

  • Returns Markdown markup (italics, bold, headers) that must be cleaned before evaluation

  • Cloud API: requires internet access and a valid Mistral API key

  • Paid service

Input format:

{
  "model": "mistral-ocr-latest",
  "document": {
    "type": "image_url",
    "image_url": "data:image/png;base64,<base64>"
  }
}

Output format: JSON with a pages list, each page containing a markdown field. Or Raw text.

Qwen2.5-VL 7b (Ollama)

Type: Vision-language model (local, open-source)

Endpoint: http://localhost:11434/api/chat

Model ID: qwen2.5vl:7b

Qwen2.5-VL is an open-source vision-language model developed by Alibaba DAMO Academy. The 7b variant is run locally via Ollama, requiring no API key or internet connection after the initial model download.

Strengths:

  • Fully local: no data leaves the machine, no API cost

  • Strong multilingual capability including French and Latin

  • Free and open-source (Apache 2.0 licence)

  • Correctly transcribes most running text

Limitations:

  • Requires a machine with sufficient VRAM (minimum 8 GB recommended)

  • Slower inference than cloud APIs on CPU-only setups

  • Post-correction step had no effect on this model (ΔCER: 0.000, ΔWER: 0.000), suggesting the LLM correction prompt needs to be adapted for Qwen outputs

  • Produces hallucinated artefacts on degraded page regions (e.g. "us de F Provence, paye quoral, po Les Enfantras")

  • Includes page headers and column markers in output (e.g. "Tome V. P E A.")

Setup:

# Install Ollama (https://ollama.com)
ollama pull qwen2.5vl:7b
ollama serve   # starts the local API on port 11434

Input format:

{
  "model": "qwen2.5vl:7b",
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "<prompt>"},
        {"type": "image_url", "image_url": "data:image/png;base64,<base64>"}
      ]
    }
  ],
  "stream": False
}

Output format: message.content string in the Ollama chat response.

Post-Correction: Mistral Large

Type: LLM text correction (cloud)

Endpoint: https://api.mistral.ai/v1/chat/completions

Model ID: mistral-large-latest

Transcriptions pass through Mistral Large for post-correction. This step fixes:

  • OCR character substitutions (VestigalVectigal)

  • Semantic errors (entre deux foiresentre deux soleils)

  • Erroneous modernisation of archaic spelling (portaientportoient)

  • Corrupted proper nouns and dates

  • Words split across line breaks

Note

The post-correction step proved highly effective on Mistral OCR output (ΔWER up to +0.0422).

Model Comparison Summary

Model

Type

Local / Cloud

API Key needed

Cost

Mistral OCR

Dedicated OCR

Cloud

Yes (Mistral)

Paid

Qwen2.5-VL 7B

Vision LLM

Local

No

Free

Kraken+Ciaconna

Kraken+Ciaconna has shown remarkable performance on Latin and polytonic Greek scripts.

Alternative text for the image

Results of Matteo Romanello, Sven Najem-Meyer, and Bruce Robertson. 2021. Optical Character Recognition of 19th Century Classical Commentaries: the Current State of Affairs.