Breadcrumbs

How to Write Regular Expressions: A Practical Guide

Regular expressions (often shortened to regex) are powerful patterns used to search, validate, extract, or manipulate text. They are widely used in document processing, OCR workflows, barcode validation, metadata extraction, and automation tools such as OptimiDoc, where regex patterns often appear in OCR zones, output filenames, and redaction rules.

This article introduces the essentials of writing regex patterns, with examples you can reuse.


1. What Is a Regular Expression?

A regular expression is a pattern that defines a set of matching strings.
You can use regex to:

  • validate content (e.g., check if a barcode follows a structure)

  • extract key information from a text or OCR zone

  • manipulate filenames or paths

  • define redaction rules

  • split and route documents based on patterns (e.g., barcode changes as described in internal communications).


2. Essential Building Blocks

2.1. Literal Characters

Anything typed normally is matched literally.

ABC

Matches the sequence “ABC” anywhere in the text.


2.2. Character Classes

Character classes let you match one out of many possible characters.

Pattern

Meaning

[A-Z]

Any uppercase letter

[a-z]

Any lowercase letter

[0-9] or \d

Any digit

[A-Za-z0-9]

Alphanumeric

[^0-9]

Any character except digits

Example:

[A-Za-z]{4,12}

Matches 4 to 12 letters – this pattern also appears in internal regex usage in OCR and output processing.


2.3. Quantifiers

Quantifiers define how many times something must occur.

Pattern

Meaning

?

0 or 1 occurrence

*

0 or more

+

1 or more

{n}

exactly n times

{n,}

n or more times

{n,m}

between n and m

Example:

\d{2}\s[A-Za-z]{4,12}\s20\d\d

Matches a pattern like “22 April 2023”, similar to examples used in real OCR workflows.


2.4. Anchors

Anchors match positions, not characters.

Anchor

Meaning

^

Start of string

$

End of string

\b

Word boundary

Example:

^[A-Z]{3}\d{4}$

Matches a whole string such as ABC1234.


3. Writing Useful Regex Patterns

3.1. Validating a Barcode Format

Let’s say your barcode must contain:

  • 3 letters

  • a dash

  • 6 digits

Pattern:

^[A-Z]{3}-\d{6}$

This ensures processing workflows validate only barcodes that follow the expected structure—important for separation and routing features.


3.2. Extracting a Year from OCR Zone

Given text like:

22 April 2024

Pattern to capture the year:

\d{2}\s[A-Za-z]{4,12}\s(20\d\d)

Return group 3 = the year.


3.3. Splitting Documents When Barcode Changes

Even though some systems may not support barcode‑change splitting directly, you can still use regex to detect the pattern of the barcode value itself (as described in customer queries about missing barcode splitting features).

Example to capture a full barcode value:

([A-Z0-9]+)

Combine this with workflow logic to compare previous vs. new values.


3.4. Creating Dictionaries with Regex

Some OCR engines (e.g., ABBYY) allow custom dictionaries driven by regex patterns to limit acceptable character sets, as mentioned in a support email regarding character‑set control and OCR accuracy improvements.

Example dictionary regex:

[A-Z0-9\-]{8,12}

4. Practical Tips for Writing Regex

✔ Start simple

Begin with literal text and add complexity incrementally.

✔ Test continuously

Use tools like http://regex101.com or integrated validators.

✔ Escape properly

Many systems require double escaping:
\\d instead of \d (Evidence of this occurs in OptimiDoc examples where double slashes appear inside configuration fields).

✔ Document your patterns

Especially when used in OCR zones, redactions, or multi‑step workflows (which OptimiDoc documentation highlights as areas where regex is used).

✔ Keep performance in mind

Some engines slow down with overly complex patterns or multiple active barcode types—similar to your earlier comment regarding performance impact.


5. Summary

Regular expressions are an essential tool for:

  • document automation

  • barcode validation

  • OCR post‑processing

  • metadata extraction

  • file naming and routing

  • text redaction