OCR Isn’t Good Enough: From Faxes to Structured Data

Lessons learned building a medical results fax processing system PoC, replacing OCR with a vision-LLM and Markdown conversion

Background

Surprisingly, most healthcare organizations still depend on legacy technology. Fax machines, pagers, and telephones remain in heavy use.

Many external lab results (pathology, radiology, genetic testing, etc.) still arrive via fax rather than through an API. Services like eFax can receive these faxes and deliver them to email inboxes or file shares, where staff manually review them and route them to the correct patient record in the EHR (Electronic Health Record) system.

I have recently been working on a proof of concept (PoC) for an automated intake pipeline to process these faxes and extract the data into a structured format. The output can be stored in a database and paired with a UI application where clinic staff can monitor and view incoming results.

This article covers what I learned about extracting information from PDFs during the project and describes the PoC as a potential stepping stone toward a more fully automated medical document intake system. This includes integrating the system with an AI assistant.

If you would rather skip the details of extracting text from PDFs and are only interested in the PoC system that was created, click here.

The Challenge

The first surprise was realizing that extracting information from electronic documents still is not a solved problem. OCR has been around for a long time, so I assumed I could drop in a solution that would work well enough. I was wrong.

After a few quick attempts to extract data from the medical results documents, I learned that they were not text-based PDFs where the text can be extracted programmatically. Although they arrived as PDFs, each file was essentially a wrapper around embedded images of scanned pages, so direct text extraction did not work. No problem, I thought. I would just use OCR to extract the information.

The first iteration of the PoC used OCR to extract text before converting it into structured data. It technically worked, but the accuracy was not good enough. It missed data elements, jumbled the text, and produced too many character recognition errors to be trusted.

The Solution

The breakthrough came from an unexpected pivot: removing OCR entirely.

Instead of OCR, we switched to a vision-capable LLM that can interpret each page and convert it into Markdown. This preserves layout and formatting, including tables, as an intermediate step before data extraction.

Removing OCR and adding the Markdown conversion step dramatically improved extraction accuracy. The tradeoff is higher compute requirements if you run it locally for data privacy, and higher costs if you run it in the cloud. For an accuracy-first workflow like medical document processing, that is an acceptable compromise.

In the next sections, I will share what I learned about PDFs and extracting data from them, along with an overview of the PoC system we built.

PDF Document Basics

There are three basic types of PDFs that a document processing system needs to handle: text-only PDFs, where all content can be extracted directly; hybrid PDFs, where information is contained in a mix of text and images; and image-only PDFs, where all content is contained in images. The figure below shows these types and their extraction characteristics.

Basic Types of PDFs that a document processing system will encounter

The following table shows the different text extraction methods and characteristics of each PDF type and extraction method:

Extraction method and results for each type of PDF document

Direct Text Extraction

For simple, text-only PDFs, extracting the content is straightforward and there is little chance of errors because the text is copied directly. The downside is that this approach works only for text-only PDFs, and formatting such as layout and tables is lost. As a result, the output can become a blob of text, and content may be misordered when columns or tables are present. The example below shows direct text extraction using the PyPDF package in Python:

Example converting a PDF to Markdown with Doc2MD (vision-LLM conversion)

Full code for extracting text from PDFs using this method can be found below (I’m assuming you are using “uv” for your python environments):

"""
Extract text from a PDF and print to stdout.

Usage:
    uv run example_basic_text_extraction.py <pdf_file_path>

Dependencies:
    - pypdf (install with: uv pip install pypdf)
"""

import sys
from pathlib import Path
import pypdf


def main():
    if len(sys.argv) < 2:
        print("Usage: uv run example_basic_text_extraction.py <pdf_file_path>")
        sys.exit(1)
    
    pdf_path = Path(sys.argv[1])
    
    # Open and read the PDF
    with open(pdf_path, 'rb') as file:
        pdf_reader = pypdf.PdfReader(file)
        
        # Extract text from each page
        for page in pdf_reader.pages:
            text = page.extract_text()
            print(text)


if __name__ == "__main__":
    main()

OCR Extraction

The next approach we will examine is Optical Character Recognition (OCR). For hybrid PDFs (a mix of text and images) and image-only PDFs, OCR can be used to extract the text content. This approach shares some of the same limitations as direct text extraction. Formatting such as layout and tables is lost, and text can be misordered, but OCR also introduces character recognition errors, especially when images or scans are low quality. In this example, we will use the Tesseract and pdf2image Python packages.

Note: this requires the installation of “tesseract” and “poppler” utilities in your operating system in addition to the Python packages:

MacOS: “brew install tesseract poppler”
Linux (Ubuntu): “sudo apt-get install tesseract-ocr poppler-utils”
Windows: Tesseract: Download from UB Mannheim; Poppler: Download from GitHub

Example extracting text from a PDF using direct text extraction

Full code for extracting text from PDFs using this method can be found below:

"""
Extract text from a PDF using OCR and print to stdout.

Usage:
    uv run example_ocr_extraction.py <pdf_file_path>

Dependencies:
    - pytesseract (install with: uv pip install pytesseract)
    - pdf2image (install with: uv pip install pdf2image)

System Requirements:
    - Tesseract OCR: "brew install tesseract" (macOS), "sudo apt-get install tesseract-ocr" (Ubuntu)
    - poppler-utils: "brew install poppler" (macOS), "sudo apt-get install poppler-utils" (Ubuntu)
"""

import sys
from pathlib import Path
import pytesseract
from pdf2image import convert_from_path


def main():
    if len(sys.argv) < 2:
        print("Usage: uv run example_ocr_extraction.py <pdf_file_path>")
        sys.exit(1)
    
    pdf_path = Path(sys.argv[1])
    
    # Convert PDF to images
    images = convert_from_path(pdf_path, dpi=300)
    
    # Perform OCR on each page
    for image in images:
        text = pytesseract.image_to_string(image)
        print(text)


if __name__ == "__main__":
    main()sdf

Vision LLM Markdown Conversion and Extraction

As I noted in the “The Challenge” and “The Solution” sections earlier in this document, many of the files I needed to process were poor-quality scans, and OCR was not accurate enough. It produced too many recognition errors and too much jumbled text to reliably extract the required data. I estimate I was able to correctly extract only about 80% of the required data.

I had worked with LLM-powered pipelines on several past projects, including an image classification pipeline using a vision-capable (multimodal) LLM (LLaVA). I decided to try the best open-source vision model available at the time. After some quick testing, it was much slower than OCR, but the accuracy was dramatically higher.

The next major breakthrough came when I added a Markdown conversion step before structured data extraction. Below is an example of the PDF-to-Markdown process using the standalone Doc2MD utility I created during this project. Most importantly for medical results documents, it preserves tables, which makes extracting the relevant data much easier.

I’ve open sourced the vision-LLM powered PDF/Image to Markdown utility that I created as part of this project that can be found here: https://github.com/robert-mcdermott/doc2md

Comparing OCR against this Vision-LLM to Markdown Approach

OCR vs poor quality scanned, complex documents

Example 1: With OCR there are many character recognition errors, some of the scanning artifacts like the edge of the page are incorrectly detected as characters. It would still be possible to extract some data out of this text, but there would be too many errors to trust.

OCR text extraction of poor quality complex document example 1

Example 2: This document is even more complex and there are too many errors and jumbled text in the output to be of any use:

OCR text extraction of poor quality complex document example 2

Vision-LLM with Markdown Conversion vs poor quality scanned, complex documents

Example 1: Compared to the OCR version of this document, it did a really good job in the conversion. One thing that we have to be aware of is the LLM might go beyond extraction, and might add things to fill in gaps or attempt to “help”. In this example, it added the helpful information indicated below that didn’t exist in the original document:

Vision-LLM extraction and Markdown conversion poor quality complex document example 1

Example 2: This document was selected as a worse case example. There are no horizontal lines in table, some of the rows are nested, and it’s distorted. While the output is not perfect, it still did an impressive job in its attempt to recreate it in Markdown. Fortunately the medical results documents I need to process aren’t this messed up:

Vision-LLM extraction and Markdown conversion poor quality complex document example 2

It’s clear that a vision-LLM, even a small open model running locally, is vastly superior to OCR for accurately extracting information from scanned documents.

Document to Markdown Conversion Tools

Because I had previous experience building LLM-powered data pipelines and working with vision-capable LLMs, I started building my own solution before taking time to see what already existed. After I created Doc2MD, I later found similar tools, and a few additional solutions were released afterward.

In this section, I will share information on Doc2MD and two other systems I found later: DeepSeek-OCR and Docling.

Doc2MD

Doc2MD is a standalone utility I created and open-sourced as part of this PoC project. It serves as a public example of a vision-LLM-powered document-to-Markdown approach:

DeepSeek-OCR

The “OCR” in DeepSeek-OCR’s name is a bit of a misnomer. It does not use a traditional OCR approach like Tesseract. Instead, it uses a vision-capable LLM with impressive extraction capabilities:

DeepSeek-OCR can have different output types depending on the system prompt. It can output:

A high-level description of the document
A Markdown version of the document’s content
Text from the document with coordinates where the text came from in the document
An image of the original document with color coded bounding boxes over each block of detected/extracted text.

DeepSeek-OCR Example with code and its outputs (click image to enlarge)

Here’s a close up of the unique bounding box overlay with coordinates output that DeepSeek-OCR can produce:

DeepSeek-OCR bounding box overlay and coordinate output (click image to enlarge it)

Docling

Docling is another powerful document text extraction utility with a wide range of options and functionality. I was able to get outputs similar to what I produced with DeepSeek-OCR. Using Docling, I was able to generate the following output formats:

Doctags with coordinates
HTML
Markdown

DeepSeek-OCR is primarily a model, while Docling is a full-featured utility with a wide range of functionality. If I had known about Docling earlier, I might not have created Doc2MD and instead might have used Docling as part of our medical results pipeline PoC:

Docling example with code and its outputs (click image to enlarge)

Medical Results Fax Document Processing PoC

The reason I started down this path and learned everything I have covered in the first half of this article was to build an external medical results fax processing pipeline, which I will cover in the rest of the article. As noted earlier, this pipeline initially used OCR to extract information from incoming faxes, but it was not accurate enough.

Note: all of the patient information (PHI) shown in the examples below are fake, this article contains no sensitive information.

High-Level Concept

The image below shows the high-level concept for the vision-LLM-powered pipeline. In short, the document is converted into a stack of images (one image per page). Those images are then passed to a vision-capable LLM, which converts the document into Markdown. The Markdown is then fed into a text-only LLM with task-specific extraction instructions, which returns a JSON document that is stored in a database:

Data Schema

After working with subject matter experts to define what data should be captured from external medical results documents, we created the following schema for the data extraction phase of the pipeline:

Diagram of the External medical results processing system PoC

Below is a high-level diagram of the PoC system built to process external medical results documents. The system’s direct text extraction and OCR extraction capabilities were implemented first and can still be used if needed, but vision-LLM extraction is configured as the default method:

High-level architecture diagram of the PoC document processing system.

Administration and Monitoring Interface

The PoC system includes a basic admin UI that can be used to monitor metrics, health, and configuration, as well as controls to start, stop, or manually trigger a scan of the document ingestion directory:

Document processing system admin panel (click image to enlarge)

Search Interface and Results

The system includes a search interface where users can perform full-text searches across the entire document text or search specific fields. Matching records are shown in a table:

Search and document results interface (click image to enlarge)

Document Detail, Review, Edit, Export and View

When a user clicks the “View” action for a record in the search results, they are taken to the document details page. There, they can review the extracted data, make edits if needed, export the data, and view the document:

Document detail, review, and export interface (click image to enlarge)

API

The system exposes an API that can be used to monitor, control, search, and retrieve documents. This API also enables integration with other systems:

Medical results document processing system API with example (click image to enlarge)

AI Medical Results Assistant

Using the API shown above, we built an AI assistant that allows users to ask questions about documents, patient-specific results, or anything else the system can answer. The assistant uses the API to retrieve the relevant information and respond to the user.

An MCP (Model Context Protocol) server can also use the same API to provide an AI agent with a clean interface to the document system. The diagram below shows how the IOERD Assistant interfaces with the document system.

Diagram showing how the AI Assistant interfaces with the medical results document system

To keep everything open-source and open-weights, we used OpenAI’s open-weight model, GPT-OSS-120B, for the AI assistant. This allows the entire system to run on-premises, preventing sensitive data from leaving the campus network.

AI Assistant Examples

The example on the left (image below), shows the agent telling the user what it is, what its purpose is and how it can help them
The example on the right (image below), shows the assistant fetching and displaying the last 5 records that the system as received:

AI Assistant interfacing with the medical results document system (click image to enlarge)

The example on the left (image below), shows the assistant finding a results document based on the specimen/sample number.
The example on the right (image below), shows the assistant finding the results for a specific patient:

Conclusion

OCR is still useful for certain use cases, but for extracting information from complex documents, especially scanned documents of less than optimal quality, it performs far worse than newer vision-LLM-based approaches. In practice, LLMs also handle Markdown more reliably than a messy stream of extracted words, so converting a document to Markdown before running any extraction step can be a game changer.

The main downside of the vision-LLM approach versus OCR is cost and performance. It is more computationally intensive and typically slower when run on local GPUs, and it can be more expensive when run on rented GPUs (for example, AWS EC2) or when billed per token (for example, OpenAI, Anthropic, or AWS Bedrock).

OCR Isn’t Good Enough: From Faxes to Structured Data

llm, data-pipeline, electronic-medical-record, ai, document-processing

Authors

Max Headroom

The first real AI living "20 Minutes into the Future".
Sys-Admin and Editor at The Bitstream.
Former reporter at Network 23 and Big Time TV.

Not responsible for New Coke - I was just doing my job.

View all posts
Max Headroom

The first real AI living "20 Minutes into the Future".
Sys-Admin and Editor at The Bitstream.
Former reporter at Network 23 and Big Time TV.

Not responsible for New Coke - I was just doing my job.

View all posts