Blame third_party/rust/regex/HACKING.md

Packit f0b94e
Your friendly guide to hacking and navigating the regex library.
Packit f0b94e
Packit f0b94e
This guide assumes familiarity with Rust and Cargo, and at least a perusal of
Packit f0b94e
the user facing documentation for this crate.
Packit f0b94e
Packit f0b94e
If you're looking for background on the implementation in this library, then
Packit f0b94e
you can do no better than Russ Cox's article series on implementing regular
Packit f0b94e
expressions using finite automata: https://swtch.com/~rsc/regexp/
Packit f0b94e
Packit f0b94e
Packit f0b94e
## Architecture overview
Packit f0b94e
Packit f0b94e
As you probably already know, this library executes regular expressions using
Packit f0b94e
finite automata. In particular, a design goal is to make searching linear
Packit f0b94e
with respect to both the regular expression and the text being searched.
Packit f0b94e
Meeting that design goal on its own is not so hard and can be done with an
Packit f0b94e
implementation of the Pike VM (similar to Thompson's construction, but supports
Packit f0b94e
capturing groups), as described in: https://swtch.com/~rsc/regexp/regexp2.html
Packit f0b94e
--- This library contains such an implementation in src/pikevm.rs.
Packit f0b94e
Packit f0b94e
Making it fast is harder. One of the key problems with the Pike VM is that it
Packit f0b94e
can be in more than one state at any point in time, and must shuffle capture
Packit f0b94e
positions between them. The Pike VM also spends a lot of time following the
Packit f0b94e
same epsilon transitions over and over again. We can employ one trick to
Packit f0b94e
speed up the Pike VM: extract one or more literal prefixes from the regular
Packit f0b94e
expression and execute specialized code to quickly find matches of those
Packit f0b94e
prefixes in the search text. The Pike VM can then be avoided for most the
Packit f0b94e
search, and instead only executed when a prefix is found. The code to find
Packit f0b94e
prefixes is in the regex-syntax crate (in this repository). The code to search
Packit f0b94e
for literals is in src/literals.rs. When more than one literal prefix is found,
Packit f0b94e
we fall back to an Aho-Corasick DFA using the aho-corasick crate. For one
Packit f0b94e
literal, we use a variant of the Boyer-Moore algorithm. Both Aho-Corasick and
Packit f0b94e
Boyer-Moore use `memchr` when appropriate. The Boyer-Moore variant in this
Packit f0b94e
library also uses elementary frequency analysis to choose the right byte to run
Packit f0b94e
`memchr` with.
Packit f0b94e
Packit f0b94e
Of course, detecting prefix literals can only take us so far. Not all regular
Packit f0b94e
expressions have literal prefixes. To remedy this, we try another approach
Packit f0b94e
to executing the Pike VM: backtracking, whose implementation can be found in
Packit f0b94e
src/backtrack.rs. One reason why backtracking can be faster is that it avoids
Packit f0b94e
excessive shuffling of capture groups. Of course, backtracking is susceptible
Packit f0b94e
to exponential runtimes, so we keep track of every state we've visited to make
Packit f0b94e
sure we never visit it again. This guarantees linear time execution, but we
Packit f0b94e
pay for it with the memory required to track visited states. Because of the
Packit f0b94e
memory requirement, we only use this engine on small search strings *and* small
Packit f0b94e
regular expressions.
Packit f0b94e
Packit f0b94e
Lastly, the real workhorse of this library is the "lazy" DFA in src/dfa.rs.
Packit f0b94e
It is distinct from the Pike VM in that the DFA is explicitly represented in
Packit f0b94e
memory and is only ever in one state at a time. It is said to be "lazy" because
Packit f0b94e
the DFA is computed as text is searched, where each byte in the search text
Packit f0b94e
results in at most one new DFA state. It is made fast by caching states. DFAs
Packit f0b94e
are susceptible to exponential state blow up (where the worst case is computing
Packit f0b94e
a new state for every input byte, regardless of what's in the state cache). To
Packit f0b94e
avoid using a lot of memory, the lazy DFA uses a bounded cache. Once the cache
Packit f0b94e
is full, it is wiped and state computation starts over again. If the cache is
Packit f0b94e
wiped too frequently, then the DFA gives up and searching falls back to one of
Packit f0b94e
the aforementioned algorithms.
Packit f0b94e
Packit f0b94e
All of the above matching engines expose precisely the same matching semantics.
Packit f0b94e
This is indeed tested. (See the section below about testing.)
Packit f0b94e
Packit f0b94e
The following sub-sections describe the rest of the library and how each of the
Packit f0b94e
matching engines are actually used.
Packit f0b94e
Packit f0b94e
### Parsing
Packit f0b94e
Packit f0b94e
Regular expressions are parsed using the regex-syntax crate, which is
Packit f0b94e
maintained in this repository. The regex-syntax crate defines an abstract
Packit f0b94e
syntax and provides very detailed error messages when a parse error is
Packit f0b94e
encountered. Parsing is done in a separate crate so that others may benefit
Packit f0b94e
from its existence, and because it is relatively divorced from the rest of the
Packit f0b94e
regex library.
Packit f0b94e
Packit f0b94e
The regex-syntax crate also provides sophisticated support for extracting
Packit f0b94e
prefix and suffix literals from regular expressions.
Packit f0b94e
Packit f0b94e
### Compilation
Packit f0b94e
Packit f0b94e
The compiler is in src/compile.rs. The input to the compiler is some abstract
Packit f0b94e
syntax for a regular expression and the output is a sequence of opcodes that
Packit f0b94e
matching engines use to execute a search. (One can think of matching engines as
Packit f0b94e
mini virtual machines.) The sequence of opcodes is a particular encoding of a
Packit f0b94e
non-deterministic finite automaton. In particular, the opcodes explicitly rely
Packit f0b94e
on epsilon transitions.
Packit f0b94e
Packit f0b94e
Consider a simple regular expression like `a|b`. Its compiled form looks like
Packit f0b94e
this:
Packit f0b94e
Packit f0b94e
    000 Save(0)
Packit f0b94e
    001 Split(2, 3)
Packit f0b94e
    002 'a' (goto: 4)
Packit f0b94e
    003 'b'
Packit f0b94e
    004 Save(1)
Packit f0b94e
    005 Match
Packit f0b94e
Packit f0b94e
The first column is the instruction pointer and the second column is the
Packit f0b94e
instruction. Save instructions indicate that the current position in the input
Packit f0b94e
should be stored in a captured location. Split instructions represent a binary
Packit f0b94e
branch in the program (i.e., epsilon transitions). The instructions `'a'` and
Packit f0b94e
`'b'` indicate that the literal bytes `'a'` or `'b'` should match.
Packit f0b94e
Packit f0b94e
In older versions of this library, the compilation looked like this:
Packit f0b94e
Packit f0b94e
    000 Save(0)
Packit f0b94e
    001 Split(2, 3)
Packit f0b94e
    002 'a'
Packit f0b94e
    003 Jump(5)
Packit f0b94e
    004 'b'
Packit f0b94e
    005 Save(1)
Packit f0b94e
    006 Match
Packit f0b94e
Packit f0b94e
In particular, empty instructions that merely served to move execution from one
Packit f0b94e
point in the program to another were removed. Instead, every instruction has a
Packit f0b94e
`goto` pointer embedded into it. This resulted in a small performance boost for
Packit f0b94e
the Pike VM, because it was one fewer epsilon transition that it had to follow.
Packit f0b94e
Packit f0b94e
There exist more instructions and they are defined and documented in
Packit f0b94e
src/prog.rs.
Packit f0b94e
Packit f0b94e
Compilation has several knobs and a few unfortunately complicated invariants.
Packit f0b94e
Namely, the output of compilation can be one of two types of programs: a
Packit f0b94e
program that executes on Unicode scalar values or a program that executes
Packit f0b94e
on raw bytes. In the former case, the matching engine is responsible for
Packit f0b94e
performing UTF-8 decoding and executing instructions using Unicode codepoints.
Packit f0b94e
In the latter case, the program handles UTF-8 decoding implicitly, so that the
Packit f0b94e
matching engine can execute on raw bytes. All matching engines can execute
Packit f0b94e
either Unicode or byte based programs except for the lazy DFA, which requires
Packit f0b94e
byte based programs. In general, both representations were kept because (1) the
Packit f0b94e
lazy DFA requires byte based programs so that states can be encoded in a memory
Packit f0b94e
efficient manner and (2) the Pike VM benefits greatly from inlining Unicode
Packit f0b94e
character classes into fewer instructions as it results in fewer epsilon
Packit f0b94e
transitions.
Packit f0b94e
Packit f0b94e
N.B. UTF-8 decoding is built into the compiled program by making use of the
Packit f0b94e
utf8-ranges crate. The compiler in this library factors out common suffixes to
Packit f0b94e
reduce the size of huge character classes (e.g., `\pL`).
Packit f0b94e
Packit f0b94e
A regrettable consequence of this split in instruction sets is we generally
Packit f0b94e
need to compile two programs; one for NFA execution and one for the lazy DFA.
Packit f0b94e
Packit f0b94e
In fact, it is worse than that: the lazy DFA is not capable of finding the
Packit f0b94e
starting location of a match in a single scan, and must instead execute a
Packit f0b94e
backwards search after finding the end location. To execute a backwards search,
Packit f0b94e
we must have compiled the regular expression *in reverse*.
Packit f0b94e
Packit f0b94e
This means that every compilation of a regular expression generally results in
Packit f0b94e
three distinct programs. It would be possible to lazily compile the Unicode
Packit f0b94e
program, since it is never needed if (1) the regular expression uses no word
Packit f0b94e
boundary assertions and (2) the caller never asks for sub-capture locations.
Packit f0b94e
Packit f0b94e
### Execution
Packit f0b94e
Packit f0b94e
At the time of writing, there are four matching engines in this library:
Packit f0b94e
Packit f0b94e
1. The Pike VM (supports captures).
Packit f0b94e
2. Bounded backtracking (supports captures).
Packit f0b94e
3. Literal substring or multi-substring search.
Packit f0b94e
4. Lazy DFA (no support for Unicode word boundary assertions).
Packit f0b94e
Packit f0b94e
Only the first two matching engines are capable of executing every regular
Packit f0b94e
expression program. They also happen to be the slowest, which means we need
Packit f0b94e
some logic that (1) knows various facts about the regular expression and (2)
Packit f0b94e
knows what the caller wants. Using this information, we can determine which
Packit f0b94e
engine (or engines) to use.
Packit f0b94e
Packit f0b94e
The logic for choosing which engine to execute is in src/exec.rs and is
Packit f0b94e
documented on the Exec type. Exec values contain regular expression Programs
Packit f0b94e
(defined in src/prog.rs), which contain all the necessary tidbits for actually
Packit f0b94e
executing a regular expression on search text.
Packit f0b94e
Packit f0b94e
For the most part, the execution logic is straight-forward and follows the
Packit f0b94e
limitations of each engine described above pretty faithfully. The hairiest
Packit f0b94e
part of src/exec.rs by far is the execution of the lazy DFA, since it requires
Packit f0b94e
a forwards and backwards search, and then falls back to either the Pike VM or
Packit f0b94e
backtracking if the caller requested capture locations.
Packit f0b94e
Packit f0b94e
The Exec type also contains mutable scratch space for each type of matching
Packit f0b94e
engine. This scratch space is used during search (for example, for the lazy
Packit f0b94e
DFA, it contains compiled states that are reused on subsequent searches).
Packit f0b94e
Packit f0b94e
### Programs
Packit f0b94e
Packit f0b94e
A regular expression program is essentially a sequence of opcodes produced by
Packit f0b94e
the compiler plus various facts about the regular expression (such as whether
Packit f0b94e
it is anchored, its capture names, etc.).
Packit f0b94e
Packit f0b94e
### The regex! macro
Packit f0b94e
Packit f0b94e
The `regex!` macro no longer exists. It was developed in a bygone era as a
Packit f0b94e
compiler plugin during the infancy of the regex crate. Back then, then only
Packit f0b94e
matching engine in the crate was the Pike VM. The `regex!` macro was, itself,
Packit f0b94e
also a Pike VM. The only advantages it offered over the dynamic Pike VM that
Packit f0b94e
was built at runtime were the following:
Packit f0b94e
Packit f0b94e
  1. Syntax checking was done at compile time. Your Rust program wouldn't
Packit f0b94e
     compile if your regex didn't compile.
Packit f0b94e
  2. Reduction of overhead that was proportional to the size of the regex.
Packit f0b94e
     For the most part, this overhead consisted of heap allocation, which
Packit f0b94e
     was nearly eliminated in the compiler plugin.
Packit f0b94e
Packit f0b94e
The main takeaway here is that the compiler plugin was a marginally faster
Packit f0b94e
version of a slow regex engine. As the regex crate evolved, it grew other regex
Packit f0b94e
engines (DFA, bounded backtracker) and sophisticated literal optimizations.
Packit f0b94e
The regex macro didn't keep pace, and it therefore became (dramatically) slower
Packit f0b94e
than the dynamic engines. The only reason left to use it was for the compile
Packit f0b94e
time guarantee that your regex is correct. Fortunately, Clippy (the Rust lint
Packit f0b94e
tool) has a lint that checks your regular expression validity, which mostly
Packit f0b94e
replaces that use case.
Packit f0b94e
Packit f0b94e
Additionally, the regex compiler plugin stopped receiving maintenance. Nobody
Packit f0b94e
complained. At that point, it seemed prudent to just remove it.
Packit f0b94e
Packit f0b94e
Will a compiler plugin be brought back? The future is murky, but there is
Packit f0b94e
definitely an opportunity there to build something that is faster than the
Packit f0b94e
dynamic engines in some cases. But it will be challenging! As of now, there
Packit f0b94e
are no plans to work on this.
Packit f0b94e
Packit f0b94e
Packit f0b94e
## Testing
Packit f0b94e
Packit f0b94e
A key aspect of any mature regex library is its test suite. A subset of the
Packit f0b94e
tests in this library come from Glenn Fowler's AT&T test suite (its online
Packit f0b94e
presence seems gone at the time of writing). The source of the test suite is
Packit f0b94e
located in src/testdata. The scripts/regex-match-tests.py takes the test suite
Packit f0b94e
in src/testdata and generates tests/matches.rs.
Packit f0b94e
Packit f0b94e
There are also many other manually crafted tests and regression tests in
Packit f0b94e
tests/tests.rs. Some of these tests were taken from RE2.
Packit f0b94e
Packit f0b94e
The biggest source of complexity in the tests is related to answering this
Packit f0b94e
question: how can we reuse the tests to check all of our matching engines? One
Packit f0b94e
approach would have been to encode every test into some kind of format (like
Packit f0b94e
the AT&T test suite) and code generate tests for each matching engine. The
Packit f0b94e
approach we use in this library is to create a Cargo.toml entry point for each
Packit f0b94e
matching engine we want to test. The entry points are:
Packit f0b94e
Packit f0b94e
* `tests/test_default.rs` - tests `Regex::new`
Packit f0b94e
* `tests/test_default_bytes.rs` - tests `bytes::Regex::new`
Packit f0b94e
* `tests/test_nfa.rs` - tests `Regex::new`, forced to use the NFA
Packit f0b94e
  algorithm on every regex.
Packit f0b94e
* `tests/test_nfa_bytes.rs` - tests `Regex::new`, forced to use the NFA
Packit f0b94e
  algorithm on every regex and use *arbitrary* byte based programs.
Packit f0b94e
* `tests/test_nfa_utf8bytes.rs` - tests `Regex::new`, forced to use the NFA
Packit f0b94e
  algorithm on every regex and use *UTF-8* byte based programs.
Packit f0b94e
* `tests/test_backtrack.rs` - tests `Regex::new`, forced to use
Packit f0b94e
  backtracking on every regex.
Packit f0b94e
* `tests/test_backtrack_bytes.rs` - tests `Regex::new`, forced to use
Packit f0b94e
  backtracking on every regex and use *arbitrary* byte based programs.
Packit f0b94e
* `tests/test_backtrack_utf8bytes.rs` - tests `Regex::new`, forced to use
Packit f0b94e
  backtracking on every regex and use *UTF-8* byte based programs.
Packit f0b94e
* `tests/test_crates_regex.rs` - tests to make sure that all of the
Packit f0b94e
  backends behave in the same way against a number of quickcheck
Packit f0b94e
  generated random inputs. These tests need to be enabled through
Packit f0b94e
  the `RUST_REGEX_RANDOM_TEST` environment variable (see
Packit f0b94e
  below).
Packit f0b94e
Packit f0b94e
The lazy DFA and pure literal engines are absent from this list because
Packit f0b94e
they cannot be used on every regular expression. Instead, we rely on
Packit f0b94e
`tests/test_dynamic.rs` to test the lazy DFA and literal engines when possible.
Packit f0b94e
Packit f0b94e
Since the tests are repeated several times, and because `cargo test` runs all
Packit f0b94e
entry points, it can take a while to compile everything. To reduce compile
Packit f0b94e
times slightly, try using `cargo test --test default`, which will only use the
Packit f0b94e
`tests/test_default.rs` entry point.
Packit f0b94e
Packit f0b94e
The random testing takes quite a while, so it is not enabled by default.
Packit f0b94e
In order to run the random testing you can set the
Packit f0b94e
`RUST_REGEX_RANDOM_TEST` environment variable to anything before
Packit f0b94e
invoking `cargo test`. Note that this variable is inspected at compile
Packit f0b94e
time, so if the tests don't seem to be running, you may need to run
Packit f0b94e
`cargo clean`.
Packit f0b94e
Packit f0b94e
## Benchmarking
Packit f0b94e
Packit f0b94e
The benchmarking in this crate is made up of many micro-benchmarks. Currently,
Packit f0b94e
there are two primary sets of benchmarks: the benchmarks that were adopted
Packit f0b94e
at this library's inception (in `bench/src/misc.rs`) and a newer set of
Packit f0b94e
benchmarks meant to test various optimizations. Specifically, the latter set
Packit f0b94e
contain some analysis and are in `bench/src/sherlock.rs`. Also, the latter
Packit f0b94e
set are all executed on the same lengthy input whereas the former benchmarks
Packit f0b94e
are executed on strings of varying length.
Packit f0b94e
Packit f0b94e
There is also a smattering of benchmarks for parsing and compilation.
Packit f0b94e
Packit f0b94e
Benchmarks are in a separate crate so that its dependencies can be managed
Packit f0b94e
separately from the main regex crate.
Packit f0b94e
Packit f0b94e
Benchmarking follows a similarly wonky setup as tests. There are multiple entry
Packit f0b94e
points:
Packit f0b94e
Packit f0b94e
* `bench_rust.rs` - benchmarks `Regex::new`
Packit f0b94e
* `bench_rust_bytes.rs` benchmarks `bytes::Regex::new`
Packit f0b94e
* `bench_pcre.rs` - benchmarks PCRE
Packit f0b94e
* `bench_onig.rs` - benchmarks Oniguruma
Packit f0b94e
Packit f0b94e
The PCRE and Oniguruma benchmarks exist as a comparison point to a mature
Packit f0b94e
regular expression library. In general, this regex library compares favorably
Packit f0b94e
(there are even a few benchmarks that PCRE simply runs too slowly on or
Packit f0b94e
outright can't execute at all). I would love to add other regular expression
Packit f0b94e
library benchmarks (especially RE2).
Packit f0b94e
Packit f0b94e
If you're hacking on one of the matching engines and just want to see
Packit f0b94e
benchmarks, then all you need to run is:
Packit f0b94e
Packit f0b94e
    $ ./bench/run rust
Packit f0b94e
Packit f0b94e
If you want to compare your results with older benchmarks, then try:
Packit f0b94e
Packit f0b94e
    $ ./bench/run rust | tee old
Packit f0b94e
    $ ... make it faster
Packit f0b94e
    $ ./bench/run rust | tee new
Packit f0b94e
    $ cargo benchcmp old new --improvements
Packit f0b94e
Packit f0b94e
The `cargo-benchcmp` utility is available here:
Packit f0b94e
https://github.com/BurntSushi/cargo-benchcmp
Packit f0b94e
Packit f0b94e
The `./bench/run` utility can run benchmarks for PCRE and Oniguruma too. See
Packit f0b94e
`./bench/bench --help`.
Packit f0b94e
Packit f0b94e
## Dev Docs
Packit f0b94e
Packit f0b94e
When digging your teeth into the codebase for the first time, the
Packit f0b94e
crate documentation can be a great resource. By default `rustdoc`
Packit f0b94e
will strip out all documentation of private crate members in an
Packit f0b94e
effort to help consumers of the crate focus on the *interface*
Packit f0b94e
without having to concern themselves with the *implementation*.
Packit f0b94e
Normally this is a great thing, but if you want to start hacking
Packit f0b94e
on regex internals it is not what you want. Many of the private members
Packit f0b94e
of this crate are well documented with rustdoc style comments, and
Packit f0b94e
it would be a shame to miss out on the opportunity that presents.
Packit f0b94e
You can generate the private docs with:
Packit f0b94e
Packit f0b94e
```
Packit f0b94e
$ rustdoc --crate-name docs src/lib.rs -o target/doc -L target/debug/deps --no-defaults --passes collapse-docs --passes unindent-comments
Packit f0b94e
```
Packit f0b94e
Packit f0b94e
Then just point your browser at `target/doc/regex/index.html`.
Packit f0b94e
Packit f0b94e
See https://github.com/rust-lang/rust/issues/15347 for more info
Packit f0b94e
about generating developer docs for internal use.