Back

How Do Regular Expressions Work? (Backtracking and Performance)

Every developer has likely used Regular Expressions (Regex) at some point for email validation or string parsing.

While it is a powerful and convenient tool, did you know that a single poorly written regex can bring down an entire server?

In this article, we will explore the internal mechanics of regex engines and discuss Backtracking, the primary culprit behind performance issues.

1. Types of Regex Engines

Regex engines generally fall into two categories:

  1. DFA (Deterministic Finite Automaton): Text-directed and fast, but lacks support for advanced features like backreferences. (e.g., awk, lex)
  2. NFA (Nondeterministic Finite Automaton): Pattern-directed and feature-rich, but can be extremely slow in worst-case scenarios. (e.g., Java, Python, Ruby, JavaScript, and most modern languages)

Most languages we use employ NFA engines. And the core mechanism of NFA is Backtracking.

2. What is Backtracking?

Backtracking is the process where the regex engine, upon failing to find a match, "goes back to try a different path."

For example, let's match the pattern /A.*B/ against the string A...BC.

  1. Matches A. (Success)
  2. .* is greedy, so it consumes the rest of the string (...BC) until the end.
  3. Now it tries to match B, but it has reached the end of the string, so it fails.
  4. Backtracking occurs: The engine makes .* give up (backtrack) the last character (C) and tries again.
  5. Still not B, so it backtracks again.
  6. This repeats until it finds B.

3. The Catastrophe: ReDoS

If the pattern is complex and the string is long, the number of backtracking steps can increase exponentially.

This phenomenon, where processing time explodes for specific inputs due to patterns like (a+)+ (Nested Quantifiers), is called ReDoS (Regular Expression Denial of Service).

An attacker can exploit this by sending malicious strings to your server, causing CPU usage to spike to 100% and causing a denial of service.

Tips to Prevent ReDoS

  1. Avoid Nested Quantifiers: Use a+ instead of (a+)+.
  2. Use Lazy Quantifiers: Use .*? instead of .* to match only as much as needed.
  3. Be Specific: Instead of .* (any character), use restricted ranges like [^"]* (any character except a quote).

Conclusion

Regular Expressions are a double-edged sword.

Used correctly, they are magic that turns dozens of lines of code into one. Used blindly, they can be a ticking time bomb.

When writing complex regex, always consider the potential for backtracking and make it a habit to perform performance testing.

TechRegexPerformanceAlgorithm

Explore Related Tools

Try these free developer tools from Pockit