Machine Learning DEC 2, 2024 6 min read

OCR in Healthcare: Extracting Sanity from Prescription Faxes

The Problem: Welcome to 2024, Where Faxes Still Exist

When I started at the pharmacy, I discovered something shocking: we were processing 2,000+ prescription faxes monthly. Yes, faxes. In 2024. As in, those blurry, barely-readable documents that got transported over phone lines like it’s 1995.

The process was manual hell. Pharmacy technicians spent hours squinting at faxed prescriptions, trying to decipher Dr. Johnson’s handwriting (which looks like medical notes written mid-earthquake), dealing with folded papers that obscured critical information, and managing wildly inconsistent form layouts from different healthcare systems. Error rates hovered around 3-5%, which sounds low until you realize that’s potentially wrong medications reaching patients.

We needed automation. Enter AWS Textract and a healthy dose of practical ML engineering.

Why Generic OCR Fails at Healthcare Documents

Before we built our solution, I tested a few off-the-shelf OCR tools. They were… optimistic. A standard OCR engine sees a prescription form and tries to extract every character with equal confidence. But healthcare documents are chaos:

Handwriting variability: Doctors seem to use their prescriptions as artistic canvases. Some write cursively, others print, and a select few write in what can only be described as “pharmaceutical hieroglyphics.”

Form diversity: We receive prescriptions from 50+ different healthcare systems, each with unique layouts, varying field placements, and inconsistent formatting. A patient name field could be anywhere from the top-left to buried in the middle of a signature block.

Image quality nightmares: Faxes are low-resolution by nature. Add aging paper, transmission artifacts, and the occasional coffee stain, and you get documents that make OCR engines weep.

Critical context: It’s not enough to extract text—you need to extract the right text. A random “20” could be a patient ID, medication quantity, or dosage. Context matters enormously, and generic OCR has zero context.

Our Solution: AWS Textract + Smart Post-Processing

We built a two-stage system: raw extraction via AWS Textract, followed by intelligent post-processing to handle the healthcare-specific quirks.

Stage 1: AWS Textract Extraction

Textract is a game-changer for document analysis. Unlike basic OCR, it understands document structure—tables, forms, field relationships. It even provides confidence scores, which is gold for identifying uncertain extractions.

Here’s our basic extraction pipeline:

import boto3
import json
from typing import Dict, List

class PrescriptionOCREngine:
    def __init__(self):
        self.textract_client = boto3.client('textract', region_name='us-east-1')
        self.confidence_threshold = 0.85

    def extract_prescription(self, document_path: str) -> Dict:
        """
        Extract text and form data from prescription document.
        """
        with open(document_path, 'rb') as doc:
            document_bytes = doc.read()

        # Call Textract with form and table detection
        response = self.textract_client.analyze_document(
            Document={'Bytes': document_bytes},
            FeatureTypes=['FORMS', 'TABLES']
        )

        # Extract form fields and their values
        form_fields = {}
        for block in response['Blocks']:
            if block['BlockType'] == 'KEY_VALUE_SET':
                if 'KEY' in block.get('EntityTypes', []):
                    key = self._extract_text_from_block(
                        block, response['Blocks']
                    )
                    # Find corresponding value
                    for relationship in block.get('Relationships', []):
                        if relationship['Type'] == 'VALUE':
                            value_block_id = relationship['Ids'][0]
                            value = self._extract_text_from_block_id(
                                value_block_id, response['Blocks']
                            )
                            form_fields[key.strip()] = {
                                'value': value.strip(),
                                'confidence': self._get_block_confidence(
                                    value_block_id, response['Blocks']
                                )
                            }

        return {
            'raw_fields': form_fields,
            'raw_blocks': response['Blocks'],
            'confidence_scores': self._compute_confidence_metrics(form_fields)
        }

    def _extract_text_from_block(self, block: Dict, all_blocks: List) -> str:
        """Extract text content from a block and its relationships."""
        text = ''
        if 'Relationships' in block:
            for rel in block['Relationships']:
                if rel['Type'] == 'CHILD':
                    for child_id in rel['Ids']:
                        child_block = next(
                            (b for b in all_blocks if b['Id'] == child_id),
                            None
                        )
                        if child_block and child_block['BlockType'] == 'WORD':
                            text += child_block.get('Text', '') + ' '
        return text

    def _extract_text_from_block_id(self, block_id: str, all_blocks: List) -> str:
        """Extract text from a specific block ID."""
        block = next((b for b in all_blocks if b['Id'] == block_id), None)
        if block:
            return self._extract_text_from_block(block, all_blocks)
        return ''

    def _get_block_confidence(self, block_id: str, all_blocks: List) -> float:
        """Get confidence score for a block."""
        block = next((b for b in all_blocks if b['Id'] == block_id), None)
        if block:
            return block.get('Confidence', 0.0)
        return 0.0

    def _compute_confidence_metrics(self, form_fields: Dict) -> Dict:
        """Compute overall confidence metrics."""
        if not form_fields:
            return {'average': 0.0, 'low_confidence_fields': []}

        confidences = [f['confidence'] for f in form_fields.values()]
        low_confidence = [
            k for k, v in form_fields.items()
            if v['confidence'] < self.confidence_threshold
        ]

        return {
            'average': sum(confidences) / len(confidences),
            'low_confidence_fields': low_confidence
        }

Stage 2: Healthcare-Specific Post-Processing

Raw Textract output is the start, not the finish. We built a post-processing layer that handles healthcare document quirks:

import re
from datetime import datetime
from typing import Optional

class PrescriptionParser:
    def __init__(self):
        # Medication database for fuzzy matching
        self.medication_patterns = {
            'amoxicillin': r'amoxicill[a-z]*|amox',
            'lisinopril': r'lisinopril|lisino',
            'metformin': r'metformin|metf'
        }
        # DOB formats: MM/DD/YYYY, MM-DD-YYYY, MMDDYYYY, etc.
        self.dob_patterns = [
            r'(\d{1,2})[/-](\d{1,2})[/-](\d{2,4})',
            r'(\d{4})[/-](\d{1,2})[/-](\d{1,2})'
        ]

    def parse_prescription(self, raw_fields: Dict) -> Dict:
        """
        Parse and normalize extracted fields from raw OCR output.
        """
        prescription = {
            'patient_name': self._parse_name(raw_fields),
            'patient_dob': self._parse_dob(raw_fields),
            'medication': self._parse_medication(raw_fields),
            'dosage': self._parse_dosage(raw_fields),
            'quantity': self._parse_quantity(raw_fields),
            'prescriber': self._parse_prescriber(raw_fields),
            'date': self._parse_date(raw_fields),
            'quality_flags': []
        }

        # Add quality flags for manual review if needed
        if not prescription['patient_dob']:
            prescription['quality_flags'].append('missing_dob')
        if not prescription['medication']:
            prescription['quality_flags'].append('unable_to_identify_medication')

        return prescription

    def _parse_name(self, fields: Dict) -> Optional[str]:
        """Extract and normalize patient name."""
        name_candidates = [
            fields.get('Patient Name', {}).get('value', ''),
            fields.get('Name', {}).get('value', ''),
            fields.get('Patient', {}).get('value', '')
        ]

        for candidate in name_candidates:
            if candidate and len(candidate) > 2:
                # Remove extra whitespace and special characters
                cleaned = re.sub(r'[^\w\s\'-]', '', candidate)
                if len(cleaned.split()) >= 2:  # At least 2 name components
                    return cleaned.title()

        return None

    def _parse_dob(self, fields: Dict) -> Optional[str]:
        """Extract and normalize date of birth."""
        dob_candidates = [
            fields.get('DOB', {}).get('value', ''),
            fields.get('Date of Birth', {}).get('value', ''),
            fields.get('Patient DOB', {}).get('value', '')
        ]

        for candidate in dob_candidates:
            for pattern in self.dob_patterns:
                match = re.search(pattern, candidate)
                if match:
                    try:
                        # Handle both MM/DD/YYYY and YYYY/MM/DD
                        groups = match.groups()
                        if len(groups[0]) == 4:
                            # YYYY format first
                            year, month, day = int(groups[0]), int(groups[1]), int(groups[2])
                        else:
                            # MM/DD/YYYY format
                            month, day, year = int(groups[0]), int(groups[1]), int(groups[2])
                            if year < 100:
                                year += 1900 if year > 50 else 2000

                        # Validate date
                        datetime(year, month, day)
                        return f"{month:02d}/{day:02d}/{year}"
                    except ValueError:
                        continue

        return None

    def _parse_medication(self, fields: Dict) -> Optional[str]:
        """Identify medication with fuzzy matching."""
        med_candidates = [
            fields.get('Medication', {}).get('value', ''),
            fields.get('Drug', {}).get('value', ''),
            fields.get('Rx', {}).get('value', '')
        ]

        for candidate in med_candidates:
            candidate_lower = candidate.lower()
            for med_name, pattern in self.medication_patterns.items():
                if re.search(pattern, candidate_lower):
                    return med_name

        # If no match, return the raw medication if it's not empty
        for candidate in med_candidates:
            if candidate:
                return candidate

        return None

    def _parse_dosage(self, fields: Dict) -> Optional[str]:
        """Extract medication dosage (e.g., '500mg', '5ml')."""
        dosage_raw = fields.get('Dosage', {}).get('value', '') or \
                     fields.get('Strength', {}).get('value', '')

        # Look for patterns like "500mg", "5ml", etc.
        match = re.search(r'(\d+(?:\.\d+)?)\s*(mg|ml|mcg|units)', dosage_raw, re.I)
        if match:
            return f"{match.group(1)}{match.group(2).lower()}"

        return None if not dosage_raw else dosage_raw

    def _parse_quantity(self, fields: Dict) -> Optional[int]:
        """Extract quantity (number of pills/doses)."""
        qty_raw = fields.get('Quantity', {}).get('value', '') or \
                  fields.get('Qty', {}).get('value', '')

        # Extract leading number
        match = re.search(r'(\d+)', qty_raw)
        if match:
            try:
                return int(match.group(1))
            except ValueError:
                pass

        return None

    def _parse_date(self, fields: Dict) -> Optional[str]:
        """Extract prescription date."""
        date_candidates = [
            fields.get('Date', {}).get('value', ''),
            fields.get('Rx Date', {}).get('value', ''),
            fields.get('Date Written', {}).get('value', '')
        ]

        for candidate in date_candidates:
            for pattern in self.dob_patterns:
                match = re.search(pattern, candidate)
                if match:
                    return candidate

        return None

    def _parse_prescriber(self, fields: Dict) -> Optional[str]:
        """Extract prescriber (doctor) name."""
        prescriber_candidates = [
            fields.get('Prescriber', {}).get('value', ''),
            fields.get('Physician', {}).get('value', ''),
            fields.get('MD', {}).get('value', '')
        ]

        for candidate in prescriber_candidates:
            if candidate and len(candidate) > 2:
                return candidate.title()

        return None

Results: From Manual Hell to Automation

After deploying this system:

Processing time: Reduced from 3-5 minutes per prescription (manual) to <500ms (automated)
Accuracy: 96-98% on well-scanned documents, 85-90% on poor-quality faxes
Manual review rate: Only 8-12% of prescriptions now need human verification (down from 100%)
Cost: AWS Textract costs roughly $1.50 per 1000 pages, dramatically cheaper than manual labor

The remaining 8-12% that require review are flagged with specific quality indicators—missing critical fields, low confidence extractions, or ambiguous handwriting. Technicians can quickly review these high-confidence flags rather than reading every document from scratch.

Lessons Learned

1. Confidence scores are your friend. Don’t trust any extraction without understanding the confidence behind it. We use multi-tiered thresholds: high confidence (>95%) goes straight to fulfillment, medium (85-95%) triggers secondary validation, low (<85%) demands human review.

2. Context is irreplaceable. A medication name extracted with 99% confidence might still be wrong if it’s misspelled. Maintaining a medication database and doing fuzzy matching catches errors generic OCR misses.

3. Graceful degradation matters. When Textract fails completely (very rare), we need fallbacks. For critical information like dosage, we escalate to manual review rather than guessing.

4. Healthcare regulations aren’t negotiable. We audit every prescription extraction and maintain detailed logs for compliance. This isn’t just engineering; it’s patient safety.

Healthcare OCR isn’t sexy, but it’s incredibly practical. Every optimization here directly impacts patient care and pharmacist sanity. And in 2024, anything that reduces faxes is a win in my book.

Have questions about healthcare ML systems? Let me know on Twitter or check out the code on GitHub.