Skip to content

OCR / Extraction Module

The OCR / Extraction Module in DocView Capture is responsible for converting scanned images into machine-readable text and extracting structured data.

This is where documents are transformed from raw page images into usable searchable text and field values that feed downstream indexing and workflows.

OCR / Extraction Workflow

1. Batch Initialization

  • The system identifies the batch using its Batch ID.
  • Example: Batch 27781, DocView Demo Batch 26-02-2022 13:48:42.

2. OCR Process Start

  • Each document in the batch is processed sequentially.
  • File names (e.g., 1.xml, 3.xml) are generated to store OCR output.

3. Text Recognition

  • The OCR engine analyses page images.
  • Converts scanned text, numbers, and characters into machine-readable text.
  • Produces XML output containing both recognised text and positional metadata.

4. Data Extraction

  • After OCR is complete, the extraction engine identifies fields based on defined templates and rules.
  • Example fields: Invoice Number, Date, Vendor Name, Total Amount.
  • Recognition phase validates extracted values against regular expressions or lookup tables.

5. Results Storage

  • Extracted data is linked to the document’s XML structure.
  • Both the raw OCR text and extracted fields are stored for downstream modules (Review, Index, Export).

Key Operations Seen in Log

  • Start OCR Documents1.xml, 3.xml: OCR process begins, generating text output for each page.
  • OCR Documents Generated – OCR completed successfully, output saved in structured XML.
  • Start Recognition1.xml, 3.xml: Data extraction begins, pulling field values from OCR results.

Toolbar Overview

Home Tab

  • Start OCR – Run OCR on selected batch.
  • Stop Process – Halt ongoing OCR session if needed.
  • Reprocess – Rerun OCR on selected documents for better accuracy.

View Tab

  • OCR Output Viewer – Preview recognised text.
  • Error Log – Shows OCR/recognition errors.

Settings Tab

  • OCR Engine Selection – Choose engine (Tesseract, ABBYY, etc., depending on environment).
  • Language Packs – Configure language-specific OCR.
  • Extraction Rules – Define field rules, regex, and lookup references.

Status Indicators

  • Processing – OCR in progress.
  • Completed – OCR/extraction finished for the document.
  • Failed – OCR unable to process (low quality scan or unsupported format).

OCR / Extraction Module Screenshot