March 4, 2026

Receipt OCR with Claude Vision

How we built a two-pass receipt scanning pipeline for a Telegram bot using Tesseract for detection and Claude Vision for structured extraction — running in production at ~$0.024 per scan.

guide claude-vision ocr fridgekit

The Problem

FridgeKit is a Telegram bot for fridge and pantry inventory management. Users photograph their grocery receipts and the bot adds every item to their inventory automatically. The challenge: receipts are one of the worst inputs for OCR. Thermal paper fades. Store printers compress product names into cryptic abbreviations. Users photograph receipts at angles, in bad lighting, sometimes crumpled. And the text is multilingual — a Polish receipt says MasloEkstra200g where you need Masło ekstra 200g.

Traditional OCR gives you raw text with confidence scores. That is necessary but not sufficient. What the bot actually needs is structured data: item names expanded into readable form, quantities, unit prices, the store name, a purchase date in ISO format, and a currency code. Regex will not get you there. The gap between OCR text and structured product data is where Claude Vision comes in.

Architecture: Isolated OCR Microservice

The receipt scanner runs as a separate Docker container from the main bot. User-uploaded images are untrusted input — processing them in an isolated service limits the blast radius of any image-parsing exploit. The bot and the OCR service communicate over an internal Docker network. Port 3001 is never exposed to the internet.

fridgekit-bot → POST /detect   (quick receipt detection, local OCR only)
             → POST /process  (full AI parsing pipeline)
             ← structured JSON response

The bot calls two endpoints. /detect runs Tesseract locally — no AI, no cost — and returns a confidence score for whether the image is a receipt at all. /process runs the full pipeline: download the photo from Telegram, preprocess it, extract structured data through Claude Vision, and return clean JSON. The bot handles all user interaction, inventory writes, and payment logic. The OCR service is stateless.

Docker Networking

Both containers share a fridgekit-internal bridge network. The OCR service creates it; the bot joins it as external. The bot reaches the service at http://glacierphonk-fridgekit-ocr:3001 via Docker DNS. No host port mapping required.

Image Preprocessing with Sharp

Raw photos from Telegram are often 3000+ pixels on a side. Sending them directly to Tesseract wastes CPU time and to Claude Vision wastes tokens. Two separate preprocessing pipelines handle the different consumers.

For Tesseract (used in detection), the image is resized to fit within 2048px, converted to grayscale, normalized for contrast, sharpened, and output as PNG:

import sharp from "sharp";

export async function preprocessReceipt(buffer: Buffer): Promise<Buffer> {
  return sharp(buffer)
    .resize(2048, 2048, { fit: "inside", withoutEnlargement: true })
    .grayscale()
    .normalize()
    .sharpen()
    .png()
    .toBuffer();
}

For Claude Vision, the image gets a lighter touch. Resize to 1536px max, optional grayscale (enabled by default to reduce payload size), and JPEG at 85% quality. Vision models handle color and imperfect contrast well enough — aggressive normalization can actually strip useful visual context like colored price tags or highlighted discounts.

export async function prepareForVision(buffer: Buffer): Promise<Buffer> {
  let pipeline = sharp(buffer).resize(1536, 1536, {
    fit: "inside",
    withoutEnlargement: true,
  });
  if (config.VISION_IMAGE_GRAYSCALE) {
    pipeline = pipeline.grayscale();
  }
  return pipeline.jpeg({ quality: 85 }).toBuffer();
}

The distinction matters for cost. A 3000px color JPEG burns through more Vision API tokens than a 1536px grayscale one. At production volume, those savings compound.

Tesseract Pass: Detection & Confidence

Before spending $0.024 on AI parsing, the service checks whether the image is actually a receipt. Tesseract.js runs in-process via WASM — no external binary, no subprocess. A singleton worker is lazily initialized and reused across requests, with automatic re-creation when the OCR language changes.

import { createWorker, type Worker } from "tesseract.js";

let worker: Worker | null = null;
let currentLangs: string | null = null;

async function getWorker(langs: string): Promise<Worker> {
  if (worker && currentLangs === langs) return worker;
  if (worker) await worker.terminate();

  worker = await createWorker(langs);
  currentLangs = langs;
  return worker;
}

export async function ocrExtract(
  imageBuffer: Buffer,
  langs = "eng",
): Promise<{ text: string; confidence: number }> {
  const w = await getWorker(langs);
  const { data } = await w.recognize(imageBuffer);
  return { text: data.text, confidence: data.confidence };
}

The raw OCR text feeds into a scoring function — pure regex, no AI. It checks for price patterns (\d+[.,]\d{2}), currency symbols, total/sum keywords in 12 languages, tax keywords, date formats, and itemized line structures. Each signal contributes to a 0–100 confidence score. Anything above 50 is treated as a receipt.

const PRICE_PATTERN = /\d+[.,]\d{2}/g;
const TOTAL_KEYWORDS = /\b(total|suma|razem|subtotal|summe|gesamt|
  somme|totale|összesen|celkem|totalt)\b/gi;
const ITEMIZED_LINE = /^.{3,}\s+\d+[.,]\d{2}\s*$/gm;

export function detectReceipt(ocrText: string): DetectionResult {
  let score = 0;

  const prices = ocrText.match(PRICE_PATTERN);
  if (prices && prices.length >= 3) score += 25;

  const totals = ocrText.match(TOTAL_KEYWORDS);
  if (totals && totals.length >= 1) score += 20;

  // ... currency, tax, date, itemized line checks

  return { isReceipt: score >= 50, confidence: Math.min(score, 100) };
}

This gate is critical. Without it, every photo message would trigger an AI call — selfies, memes, screenshots, everything. The regex detector costs nothing and catches the obvious non-receipts before any API spend.

Claude Vision Pass: Structured Extraction

The actual parsing happens in two AI passes. This is the core design decision that makes the system work at production quality without production cost.

Pass 1: Raw Extraction with Haiku Vision

Claude Haiku 4.5 receives the preprocessed image and extracts everything on the receipt as-is. No cleanup, no intelligence about product names, no category assignment. Just faithful transcription into structured JSON.

The extraction uses the Vercel AI SDK’s generateObject() with a Zod schema, which guarantees the response conforms to the expected shape. No JSON parsing, no schema validation code, no “please respond in JSON format” prompt hacking.

import { generateObject } from "ai";
import { anthropic } from "@ai-sdk/anthropic";
import { z } from "zod";

const receiptItemSchema = z.object({
  name: z.string().describe("Full product name"),
  quantity: z.number().describe("Quantity purchased"),
  unit: z.string().describe("Unit of measurement (kg, L, pcs, szt)"),
  price: z.number().describe("Total line price after discounts"),
  category: z.string().nullable().describe("Best matching category"),
  confidence: z.enum(["high", "medium", "low"]),
});

const parsedReceiptSchema = z.object({
  storeName: z.string().nullable(),
  storeAddress: z.string().nullable(),
  purchaseDate: z.string().nullable().describe("ISO YYYY-MM-DD"),
  items: z.array(receiptItemSchema),
  totalAmount: z.number().nullable(),
  currency: z.string().nullable().describe("3-letter code"),
});

The system prompt is terse and directive. Tell the model what the receipt language likely is (derived from the user’s locale), tell it to extract everything including discount lines, and tell it what to ignore (tax summary rows, tax category letters). No examples needed for this pass — Haiku Vision handles receipt images well when you are explicit about the output shape.

const EXTRACT_PROMPT = `You are a receipt OCR parser.
Extract raw data from this grocery receipt image.
The receipt is likely in {locale} language.
Extract ALL line items exactly as printed.
Include quantities, units, and per-item total prices.
Include discount lines as separate items.
Tax summary sections (PTU, PODATEK, MwSt, TVA) are NOT items.
Dates: ISO YYYY-MM-DD. Currency: 3-letter codes.`;

export async function parseReceiptImage(
  imageBase64: string,
  userLocale: string,
): Promise<ParsedReceipt> {
  const { object } = await generateObject({
    model: anthropic("claude-haiku-4-5-20251001"),
    schema: parsedReceiptSchema,
    system: EXTRACT_PROMPT.replace("{locale}", userLocale),
    messages: [
      {
        role: "user",
        content: [
          { type: "image", image: imageBase64 },
          { type: "text", text: "Extract all line items from this receipt." },
        ],
      },
    ],
  });
  return object;
}

Cost per scan for Pass 1: approximately $0.015. Haiku is fast — typical response time is 2–4 seconds including image upload.

Pass 2: Refinement with Sonnet

Pass 1 gives you raw data. MasloEkstra200g is still MasloEkstra200g. Non-food items (bags, cosmetics, cleaning products) are still in the list. Discounts appear as separate line items rather than being applied to their products.

Pass 2 sends the extracted JSON (text only, no image) to Claude Sonnet for cleanup. This is where domain knowledge matters. The model expands abbreviations (ML.SW 2% 1L becomes Mleko świeże 2% 1L), removes non-food items, applies discount lines to their products, and assigns each item to a category from the user’s inventory.

const REFINE_PROMPT = `You are a grocery receipt post-processor.

TASKS:
1. REMOVE non-food items: bags, cosmetics, toiletries,
   household items, cleaning products, pet supplies.
2. APPLY discounts: subtract discount lines from their
   product's price, then remove the discount line.
3. EXPAND abbreviated names into full, readable names
   in their original language.
4. MAP each item to the best category from: {categories}
5. Set confidence: high/medium/low based on
   abbreviation severity.`;

export async function refineReceiptItems(
  raw: ParsedReceipt,
  userLocale: string,
  categories: string[],
): Promise<ParsedReceipt> {
  const system = REFINE_PROMPT
    .replace("{locale}", userLocale)
    .replace("{categories}", categories.join(", "));

  const { object } = await generateObject({
    model: anthropic(config.REFINE_MODEL),
    schema: parsedReceiptSchema,
    system,
    prompt: JSON.stringify(raw),
  });
  return object;
}

The category list is passed from the bot at request time. The OCR service never touches the database — it remains entirely stateless. Categories like Dairy, Milk, Cheese, Yogurt, Meat, Bread come from the bot’s product_categories table. This means category taxonomy can evolve without redeploying the OCR service.

Cost per scan for Pass 2: approximately $0.009 (text-only, no image tokens).

Why Two Passes Instead of One

The obvious question: why not send the image to Sonnet directly and do everything in one call?

Three reasons. First, cost. Sonnet Vision costs significantly more per image token than Haiku Vision. Running Haiku for raw extraction and Sonnet for text-only refinement is cheaper than running Sonnet Vision for both. At $0.024 combined versus $0.04+ for a single Sonnet Vision call, the savings matter at scale.

Second, debuggability. When a receipt parses incorrectly, you can inspect the Pass 1 output and immediately determine whether the problem is extraction (the model misread the image) or refinement (the model misinterpreted the data). Single-pass errors are harder to attribute.

Third, independent optimization. You can swap the refinement model without touching the extraction logic. The REFINE_MODEL is configurable via environment variable — testing a new Sonnet release is a one-line config change with zero code deployment.

Handling Edge Cases

Multilingual Receipts

Tesseract needs language hints to perform well. A mapping of 30+ countries to Tesseract language codes ensures Polish receipts use pol+eng, German receipts use deu+eng, Japanese receipts use jpn+eng. English is always included as a fallback because many receipts mix languages (brand names, product codes).

const COUNTRY_LANGS: Record<string, string> = {
  poland: "pol+eng",
  germany: "deu+eng",
  japan: "jpn+eng",
  china: "chi_sim+eng",
  // ... 30+ mappings
};

function getLangs(country: string): string {
  return COUNTRY_LANGS[country.toLowerCase().trim()] ?? "eng";
}

Claude Vision handles multilingual receipts natively — no language configuration needed on the AI side. The locale hint in the system prompt improves accuracy but is not strictly required.

Blurry Photos & Partial Receipts

The confidence field on each item serves as a quality signal. When Haiku cannot clearly read a product name, it marks it low confidence. The bot can surface this to the user: “Some items could not be read clearly. Please review.” This is better than silently adding wrong data to someone’s inventory.

Partial receipts — where the photo cuts off mid-list — are handled gracefully by the schema. All top-level fields (storeName, totalAmount, purchaseDate) are nullable. If the bottom of the receipt is missing, you still get the items that were visible. The total will be null, which the bot handles as “total not available.”

Duplicate Items & Discount Lines

Polish receipts frequently print a product, then an OPUST (discount) line immediately after. Pass 2 is explicitly instructed to detect these patterns, subtract the discount from the product price, and remove the discount line from the output. The same logic handles RABAT, DISCOUNT, REMISE, and other regional discount keywords.

Actual duplicate items (someone bought two cartons of milk as separate line items) are preserved. The model distinguishes between “same product purchased twice” and “discount applied to previous product” through positional context on the receipt.

The Full Pipeline

End to end, a single receipt scan follows this flow:

export async function processReceipt(req: ProcessRequest): Promise<ProcessResult> {
  // 1. Download from Telegram
  const raw = await downloadFile(req.fileId);

  // 2. Preprocess for Tesseract (detection metrics)
  const processed = await preprocessReceipt(raw);
  const langs = getLangs(req.userCountry);
  const { confidence: ocrConfidence } = await ocrExtract(processed, langs);

  // 3. Preprocess for Vision (lighter touch)
  const visionReady = await prepareForVision(raw);
  const base64 = imageToBase64(visionReady);

  // 4. Pass 1: Haiku Vision extraction
  const extracted = await parseReceiptImage(base64, req.userLocale);

  // 5. Pass 2: Sonnet refinement
  const refined = await refineReceiptItems(
    extracted, req.userLocale, req.categories
  );

  return {
    storeName: refined.storeName,
    storeAddress: refined.storeAddress,
    purchaseDate: refined.purchaseDate,
    items: refined.items,
    totalAmount: refined.totalAmount,
    currency: refined.currency,
    tier: 3,
    ocrConfidence,
  };
}

The Tesseract OCR still runs on every /process call, but only for metrics and logging. The actual parsing comes from Vision. The ocrConfidence field in the response tells the bot how readable the image was — useful for analytics and for deciding whether to prompt the user to retake the photo.

Performance & Cost

Production numbers from FridgeKit’s deployment on a t3.medium EC2 instance in eu-north-1:

Total response time: 4–8 seconds (dominated by two sequential AI calls)
Tesseract detection: 200–600ms (depends on image size and language model)
Image preprocessing: 50–150ms (sharp is fast)
Pass 1 (Haiku Vision): 2–4 seconds, ~$0.015
Pass 2 (Sonnet text): 1–3 seconds, ~$0.009
Total cost per scan: ~$0.024
Memory footprint: 180–250MB (Tesseract WASM + language data)

At $0.024 per scan, a user scanning 30 receipts per month costs $0.72 in API fees. That leaves healthy margin on a subscription model. The detection endpoint (/detect) costs nothing — it runs only local Tesseract and regex, so false-positive photos never hit the AI budget.

What We Would Change

The system works. But two things would improve it. First, caching: if a user photographs the same receipt twice (common when the first scan partially fails), the service processes it from scratch. Hashing the preprocessed image and caching results for 24 hours would eliminate duplicate API spend.

Second, streaming. The current implementation waits for Pass 1 to complete entirely before starting Pass 2. The Vercel AI SDK supports streaming generateObject, which could allow starting refinement before extraction fully completes. In practice the latency savings would be small (1–2 seconds), but for a Telegram bot where users watch a typing indicator, every second counts.

The receipt scanner is one component of FridgeKit — a Telegram bot for managing your fridge, pantry, and grocery budget. If you are building something similar or want to discuss the architecture, reach out through the GlacierPhonk™ inquiry bot.