OCR / Extraction Module

The OCR / Extraction Module in DocView Capture is responsible for converting scanned images into machine-readable text and extracting structured data.

This is where documents are transformed from raw page images into usable searchable text and field values that feed downstream indexing and workflows.

OCR / Extraction Workflow

1. Batch Initialization

The system identifies the batch using its Batch ID.
Example: Batch 27781, DocView Demo Batch 26-02-2022 13:48:42.

2. OCR Process Start

Each document in the batch is processed sequentially.
File names (e.g., 1.xml, 3.xml) are generated to store OCR output.

3. Text Recognition

The OCR engine analyses page images.
Converts scanned text, numbers, and characters into machine-readable text.
Produces XML output containing both recognised text and positional metadata.

4. Data Extraction

After OCR is complete, the extraction engine identifies fields based on defined templates and rules.
Example fields: Invoice Number, Date, Vendor Name, Total Amount.
Recognition phase validates extracted values against regular expressions or lookup tables.

5. Results Storage

Extracted data is linked to the document’s XML structure.
Both the raw OCR text and extracted fields are stored for downstream modules (Review, Index, Export).

Key Operations Seen in Log

Start OCR Documents – 1.xml, 3.xml: OCR process begins, generating text output for each page.
OCR Documents Generated – OCR completed successfully, output saved in structured XML.
Start Recognition – 1.xml, 3.xml: Data extraction begins, pulling field values from OCR results.

Home Tab

Start OCR – Run OCR on selected batch.
Stop Process – Halt ongoing OCR session if needed.
Reprocess – Rerun OCR on selected documents for better accuracy.

View Tab

OCR Output Viewer – Preview recognised text.
Error Log – Shows OCR/recognition errors.

Settings Tab

OCR Engine Selection – Choose engine (Tesseract, ABBYY, etc., depending on environment).
Language Packs – Configure language-specific OCR.
Extraction Rules – Define field rules, regex, and lookup references.

Status Indicators

Processing – OCR in progress.
Completed – OCR/extraction finished for the document.
Failed – OCR unable to process (low quality scan or unsupported format).

OCR / Extraction Module Screenshot