Tree - source-git/mozjs60 - CentOS Git server

source-git / mozjs60

Blame third_party/rust/regex/HACKING.md

Blob History Raw

Packit	f0b94e	`Your friendly guide to hacking and navigating the regex library.`
Packit	f0b94e
Packit	f0b94e	`This guide assumes familiarity with Rust and Cargo, and at least a perusal of`
Packit	f0b94e	`the user facing documentation for this crate.`
Packit	f0b94e
Packit	f0b94e	`If you're looking for background on the implementation in this library, then`
Packit	f0b94e	`you can do no better than Russ Cox's article series on implementing regular`
Packit	f0b94e	`expressions using finite automata: https://swtch.com/~rsc/regexp/`
Packit	f0b94e
Packit	f0b94e
Packit	f0b94e	`## Architecture overview`
Packit	f0b94e
Packit	f0b94e	`As you probably already know, this library executes regular expressions using`
Packit	f0b94e	`finite automata. In particular, a design goal is to make searching linear`
Packit	f0b94e	`with respect to both the regular expression and the text being searched.`
Packit	f0b94e	`Meeting that design goal on its own is not so hard and can be done with an`
Packit	f0b94e	`implementation of the Pike VM (similar to Thompson's construction, but supports`
Packit	f0b94e	`capturing groups), as described in: https://swtch.com/~rsc/regexp/regexp2.html`
Packit	f0b94e	`--- This library contains such an implementation in src/pikevm.rs.`
Packit	f0b94e
Packit	f0b94e	`Making it fast is harder. One of the key problems with the Pike VM is that it`
Packit	f0b94e	`can be in more than one state at any point in time, and must shuffle capture`
Packit	f0b94e	`positions between them. The Pike VM also spends a lot of time following the`
Packit	f0b94e	`same epsilon transitions over and over again. We can employ one trick to`
Packit	f0b94e	`speed up the Pike VM: extract one or more literal prefixes from the regular`
Packit	f0b94e	`expression and execute specialized code to quickly find matches of those`
Packit	f0b94e	`prefixes in the search text. The Pike VM can then be avoided for most the`
Packit	f0b94e	`search, and instead only executed when a prefix is found. The code to find`
Packit	f0b94e	`prefixes is in the regex-syntax crate (in this repository). The code to search`
Packit	f0b94e	`for literals is in src/literals.rs. When more than one literal prefix is found,`
Packit	f0b94e	`we fall back to an Aho-Corasick DFA using the aho-corasick crate. For one`
Packit	f0b94e	`literal, we use a variant of the Boyer-Moore algorithm. Both Aho-Corasick and`
Packit	f0b94e	Boyer-Moore use `memchr` when appropriate. The Boyer-Moore variant in this
Packit	f0b94e	`library also uses elementary frequency analysis to choose the right byte to run`
Packit	f0b94e	`memchr` with.
Packit	f0b94e
Packit	f0b94e	`Of course, detecting prefix literals can only take us so far. Not all regular`
Packit	f0b94e	`expressions have literal prefixes. To remedy this, we try another approach`
Packit	f0b94e	`to executing the Pike VM: backtracking, whose implementation can be found in`
Packit	f0b94e	`src/backtrack.rs. One reason why backtracking can be faster is that it avoids`
Packit	f0b94e	`excessive shuffling of capture groups. Of course, backtracking is susceptible`
Packit	f0b94e	`to exponential runtimes, so we keep track of every state we've visited to make`
Packit	f0b94e	`sure we never visit it again. This guarantees linear time execution, but we`
Packit	f0b94e	`pay for it with the memory required to track visited states. Because of the`
Packit	f0b94e	`memory requirement, we only use this engine on small search strings and small`
Packit	f0b94e	`regular expressions.`
Packit	f0b94e
Packit	f0b94e	`Lastly, the real workhorse of this library is the "lazy" DFA in src/dfa.rs.`
Packit	f0b94e	`It is distinct from the Pike VM in that the DFA is explicitly represented in`
Packit	f0b94e	`memory and is only ever in one state at a time. It is said to be "lazy" because`
Packit	f0b94e	`the DFA is computed as text is searched, where each byte in the search text`
Packit	f0b94e	`results in at most one new DFA state. It is made fast by caching states. DFAs`
Packit	f0b94e	`are susceptible to exponential state blow up (where the worst case is computing`
Packit	f0b94e	`a new state for every input byte, regardless of what's in the state cache). To`
Packit	f0b94e	`avoid using a lot of memory, the lazy DFA uses a bounded cache. Once the cache`
Packit	f0b94e	`is full, it is wiped and state computation starts over again. If the cache is`
Packit	f0b94e	`wiped too frequently, then the DFA gives up and searching falls back to one of`
Packit	f0b94e	`the aforementioned algorithms.`
Packit	f0b94e
Packit	f0b94e	`All of the above matching engines expose precisely the same matching semantics.`
Packit	f0b94e	`This is indeed tested. (See the section below about testing.)`
Packit	f0b94e
Packit	f0b94e	`The following sub-sections describe the rest of the library and how each of the`
Packit	f0b94e	`matching engines are actually used.`
Packit	f0b94e
Packit	f0b94e	`### Parsing`
Packit	f0b94e
Packit	f0b94e	`Regular expressions are parsed using the regex-syntax crate, which is`
Packit	f0b94e	`maintained in this repository. The regex-syntax crate defines an abstract`
Packit	f0b94e	`syntax and provides very detailed error messages when a parse error is`
Packit	f0b94e	`encountered. Parsing is done in a separate crate so that others may benefit`
Packit	f0b94e	`from its existence, and because it is relatively divorced from the rest of the`
Packit	f0b94e	`regex library.`
Packit	f0b94e
Packit	f0b94e	`The regex-syntax crate also provides sophisticated support for extracting`
Packit	f0b94e	`prefix and suffix literals from regular expressions.`
Packit	f0b94e
Packit	f0b94e	`### Compilation`
Packit	f0b94e
Packit	f0b94e	`The compiler is in src/compile.rs. The input to the compiler is some abstract`
Packit	f0b94e	`syntax for a regular expression and the output is a sequence of opcodes that`
Packit	f0b94e	`matching engines use to execute a search. (One can think of matching engines as`
Packit	f0b94e	`mini virtual machines.) The sequence of opcodes is a particular encoding of a`
Packit	f0b94e	`non-deterministic finite automaton. In particular, the opcodes explicitly rely`
Packit	f0b94e	`on epsilon transitions.`
Packit	f0b94e
Packit	f0b94e	Consider a simple regular expression like `a\|b`. Its compiled form looks like
Packit	f0b94e	`this:`
Packit	f0b94e
Packit	f0b94e	`000 Save(0)`
Packit	f0b94e	`001 Split(2, 3)`
Packit	f0b94e	`002 'a' (goto: 4)`
Packit	f0b94e	`003 'b'`
Packit	f0b94e	`004 Save(1)`
Packit	f0b94e	`005 Match`
Packit	f0b94e
Packit	f0b94e	`The first column is the instruction pointer and the second column is the`
Packit	f0b94e	`instruction. Save instructions indicate that the current position in the input`
Packit	f0b94e	`should be stored in a captured location. Split instructions represent a binary`
Packit	f0b94e	branch in the program (i.e., epsilon transitions). The instructions `'a'` and
Packit	f0b94e	`'b'` indicate that the literal bytes `'a'` or `'b'` should match.
Packit	f0b94e
Packit	f0b94e	`In older versions of this library, the compilation looked like this:`
Packit	f0b94e
Packit	f0b94e	`000 Save(0)`
Packit	f0b94e	`001 Split(2, 3)`
Packit	f0b94e	`002 'a'`
Packit	f0b94e	`003 Jump(5)`
Packit	f0b94e	`004 'b'`
Packit	f0b94e	`005 Save(1)`
Packit	f0b94e	`006 Match`
Packit	f0b94e
Packit	f0b94e	`In particular, empty instructions that merely served to move execution from one`
Packit	f0b94e	`point in the program to another were removed. Instead, every instruction has a`
Packit	f0b94e	`goto` pointer embedded into it. This resulted in a small performance boost for
Packit	f0b94e	`the Pike VM, because it was one fewer epsilon transition that it had to follow.`
Packit	f0b94e
Packit	f0b94e	`There exist more instructions and they are defined and documented in`
Packit	f0b94e	`src/prog.rs.`
Packit	f0b94e
Packit	f0b94e	`Compilation has several knobs and a few unfortunately complicated invariants.`
Packit	f0b94e	`Namely, the output of compilation can be one of two types of programs: a`
Packit	f0b94e	`program that executes on Unicode scalar values or a program that executes`
Packit	f0b94e	`on raw bytes. In the former case, the matching engine is responsible for`
Packit	f0b94e	`performing UTF-8 decoding and executing instructions using Unicode codepoints.`
Packit	f0b94e	`In the latter case, the program handles UTF-8 decoding implicitly, so that the`
Packit	f0b94e	`matching engine can execute on raw bytes. All matching engines can execute`
Packit	f0b94e	`either Unicode or byte based programs except for the lazy DFA, which requires`
Packit	f0b94e	`byte based programs. In general, both representations were kept because (1) the`
Packit	f0b94e	`lazy DFA requires byte based programs so that states can be encoded in a memory`
Packit	f0b94e	`efficient manner and (2) the Pike VM benefits greatly from inlining Unicode`
Packit	f0b94e	`character classes into fewer instructions as it results in fewer epsilon`
Packit	f0b94e	`transitions.`
Packit	f0b94e
Packit	f0b94e	`N.B. UTF-8 decoding is built into the compiled program by making use of the`
Packit	f0b94e	`utf8-ranges crate. The compiler in this library factors out common suffixes to`
Packit	f0b94e	reduce the size of huge character classes (e.g., `\pL`).
Packit	f0b94e
Packit	f0b94e	`A regrettable consequence of this split in instruction sets is we generally`
Packit	f0b94e	`need to compile two programs; one for NFA execution and one for the lazy DFA.`
Packit	f0b94e
Packit	f0b94e	`In fact, it is worse than that: the lazy DFA is not capable of finding the`
Packit	f0b94e	`starting location of a match in a single scan, and must instead execute a`
Packit	f0b94e	`backwards search after finding the end location. To execute a backwards search,`
Packit	f0b94e	`we must have compiled the regular expression in reverse.`
Packit	f0b94e
Packit	f0b94e	`This means that every compilation of a regular expression generally results in`
Packit	f0b94e	`three distinct programs. It would be possible to lazily compile the Unicode`
Packit	f0b94e	`program, since it is never needed if (1) the regular expression uses no word`
Packit	f0b94e	`boundary assertions and (2) the caller never asks for sub-capture locations.`
Packit	f0b94e
Packit	f0b94e	`### Execution`
Packit	f0b94e
Packit	f0b94e	`At the time of writing, there are four matching engines in this library:`
Packit	f0b94e
Packit	f0b94e	`1. The Pike VM (supports captures).`
Packit	f0b94e	`2. Bounded backtracking (supports captures).`
Packit	f0b94e	`3. Literal substring or multi-substring search.`
Packit	f0b94e	`4. Lazy DFA (no support for Unicode word boundary assertions).`
Packit	f0b94e
Packit	f0b94e	`Only the first two matching engines are capable of executing every regular`
Packit	f0b94e	`expression program. They also happen to be the slowest, which means we need`
Packit	f0b94e	`some logic that (1) knows various facts about the regular expression and (2)`
Packit	f0b94e	`knows what the caller wants. Using this information, we can determine which`
Packit	f0b94e	`engine (or engines) to use.`
Packit	f0b94e
Packit	f0b94e	`The logic for choosing which engine to execute is in src/exec.rs and is`
Packit	f0b94e	`documented on the Exec type. Exec values contain regular expression Programs`
Packit	f0b94e	`(defined in src/prog.rs), which contain all the necessary tidbits for actually`
Packit	f0b94e	`executing a regular expression on search text.`
Packit	f0b94e
Packit	f0b94e	`For the most part, the execution logic is straight-forward and follows the`
Packit	f0b94e	`limitations of each engine described above pretty faithfully. The hairiest`
Packit	f0b94e	`part of src/exec.rs by far is the execution of the lazy DFA, since it requires`
Packit	f0b94e	`a forwards and backwards search, and then falls back to either the Pike VM or`
Packit	f0b94e	`backtracking if the caller requested capture locations.`
Packit	f0b94e
Packit	f0b94e	`The Exec type also contains mutable scratch space for each type of matching`
Packit	f0b94e	`engine. This scratch space is used during search (for example, for the lazy`
Packit	f0b94e	`DFA, it contains compiled states that are reused on subsequent searches).`
Packit	f0b94e
Packit	f0b94e	`### Programs`
Packit	f0b94e
Packit	f0b94e	`A regular expression program is essentially a sequence of opcodes produced by`
Packit	f0b94e	`the compiler plus various facts about the regular expression (such as whether`
Packit	f0b94e	`it is anchored, its capture names, etc.).`
Packit	f0b94e
Packit	f0b94e	`### The regex! macro`
Packit	f0b94e
Packit	f0b94e	The `regex!` macro no longer exists. It was developed in a bygone era as a
Packit	f0b94e	`compiler plugin during the infancy of the regex crate. Back then, then only`
Packit	f0b94e	matching engine in the crate was the Pike VM. The `regex!` macro was, itself,
Packit	f0b94e	`also a Pike VM. The only advantages it offered over the dynamic Pike VM that`
Packit	f0b94e	`was built at runtime were the following:`
Packit	f0b94e
Packit	f0b94e	`1. Syntax checking was done at compile time. Your Rust program wouldn't`
Packit	f0b94e	`compile if your regex didn't compile.`
Packit	f0b94e	`2. Reduction of overhead that was proportional to the size of the regex.`
Packit	f0b94e	`For the most part, this overhead consisted of heap allocation, which`
Packit	f0b94e	`was nearly eliminated in the compiler plugin.`
Packit	f0b94e
Packit	f0b94e	`The main takeaway here is that the compiler plugin was a marginally faster`
Packit	f0b94e	`version of a slow regex engine. As the regex crate evolved, it grew other regex`
Packit	f0b94e	`engines (DFA, bounded backtracker) and sophisticated literal optimizations.`
Packit	f0b94e	`The regex macro didn't keep pace, and it therefore became (dramatically) slower`
Packit	f0b94e	`than the dynamic engines. The only reason left to use it was for the compile`
Packit	f0b94e	`time guarantee that your regex is correct. Fortunately, Clippy (the Rust lint`
Packit	f0b94e	`tool) has a lint that checks your regular expression validity, which mostly`
Packit	f0b94e	`replaces that use case.`
Packit	f0b94e
Packit	f0b94e	`Additionally, the regex compiler plugin stopped receiving maintenance. Nobody`
Packit	f0b94e	`complained. At that point, it seemed prudent to just remove it.`
Packit	f0b94e
Packit	f0b94e	`Will a compiler plugin be brought back? The future is murky, but there is`
Packit	f0b94e	`definitely an opportunity there to build something that is faster than the`
Packit	f0b94e	`dynamic engines in some cases. But it will be challenging! As of now, there`
Packit	f0b94e	`are no plans to work on this.`
Packit	f0b94e
Packit	f0b94e
Packit	f0b94e	`## Testing`
Packit	f0b94e
Packit	f0b94e	`A key aspect of any mature regex library is its test suite. A subset of the`
Packit	f0b94e	`tests in this library come from Glenn Fowler's AT&T test suite (its online`
Packit	f0b94e	`presence seems gone at the time of writing). The source of the test suite is`
Packit	f0b94e	`located in src/testdata. The scripts/regex-match-tests.py takes the test suite`
Packit	f0b94e	`in src/testdata and generates tests/matches.rs.`
Packit	f0b94e
Packit	f0b94e	`There are also many other manually crafted tests and regression tests in`
Packit	f0b94e	`tests/tests.rs. Some of these tests were taken from RE2.`
Packit	f0b94e
Packit	f0b94e	`The biggest source of complexity in the tests is related to answering this`
Packit	f0b94e	`question: how can we reuse the tests to check all of our matching engines? One`
Packit	f0b94e	`approach would have been to encode every test into some kind of format (like`
Packit	f0b94e	`the AT&T test suite) and code generate tests for each matching engine. The`
Packit	f0b94e	`approach we use in this library is to create a Cargo.toml entry point for each`
Packit	f0b94e	`matching engine we want to test. The entry points are:`
Packit	f0b94e
Packit	f0b94e	* `tests/test_default.rs` - tests `Regex::new`
Packit	f0b94e	* `tests/test_default_bytes.rs` - tests `bytes::Regex::new`
Packit	f0b94e	* `tests/test_nfa.rs` - tests `Regex::new`, forced to use the NFA
Packit	f0b94e	`algorithm on every regex.`
Packit	f0b94e	* `tests/test_nfa_bytes.rs` - tests `Regex::new`, forced to use the NFA
Packit	f0b94e	`algorithm on every regex and use arbitrary byte based programs.`
Packit	f0b94e	* `tests/test_nfa_utf8bytes.rs` - tests `Regex::new`, forced to use the NFA
Packit	f0b94e	`algorithm on every regex and use UTF-8 byte based programs.`
Packit	f0b94e	* `tests/test_backtrack.rs` - tests `Regex::new`, forced to use
Packit	f0b94e	`backtracking on every regex.`
Packit	f0b94e	* `tests/test_backtrack_bytes.rs` - tests `Regex::new`, forced to use
Packit	f0b94e	`backtracking on every regex and use arbitrary byte based programs.`
Packit	f0b94e	* `tests/test_backtrack_utf8bytes.rs` - tests `Regex::new`, forced to use
Packit	f0b94e	`backtracking on every regex and use UTF-8 byte based programs.`
Packit	f0b94e	* `tests/test_crates_regex.rs` - tests to make sure that all of the
Packit	f0b94e	`backends behave in the same way against a number of quickcheck`
Packit	f0b94e	`generated random inputs. These tests need to be enabled through`
Packit	f0b94e	the `RUST_REGEX_RANDOM_TEST` environment variable (see
Packit	f0b94e	`below).`
Packit	f0b94e
Packit	f0b94e	`The lazy DFA and pure literal engines are absent from this list because`
Packit	f0b94e	`they cannot be used on every regular expression. Instead, we rely on`
Packit	f0b94e	`tests/test_dynamic.rs` to test the lazy DFA and literal engines when possible.
Packit	f0b94e
Packit	f0b94e	Since the tests are repeated several times, and because `cargo test` runs all
Packit	f0b94e	`entry points, it can take a while to compile everything. To reduce compile`
Packit	f0b94e	times slightly, try using `cargo test --test default`, which will only use the
Packit	f0b94e	`tests/test_default.rs` entry point.
Packit	f0b94e
Packit	f0b94e	`The random testing takes quite a while, so it is not enabled by default.`
Packit	f0b94e	`In order to run the random testing you can set the`
Packit	f0b94e	`RUST_REGEX_RANDOM_TEST` environment variable to anything before
Packit	f0b94e	invoking `cargo test`. Note that this variable is inspected at compile
Packit	f0b94e	`time, so if the tests don't seem to be running, you may need to run`
Packit	f0b94e	`cargo clean`.
Packit	f0b94e
Packit	f0b94e	`## Benchmarking`
Packit	f0b94e
Packit	f0b94e	`The benchmarking in this crate is made up of many micro-benchmarks. Currently,`
Packit	f0b94e	`there are two primary sets of benchmarks: the benchmarks that were adopted`
Packit	f0b94e	at this library's inception (in `bench/src/misc.rs`) and a newer set of
Packit	f0b94e	`benchmarks meant to test various optimizations. Specifically, the latter set`
Packit	f0b94e	contain some analysis and are in `bench/src/sherlock.rs`. Also, the latter
Packit	f0b94e	`set are all executed on the same lengthy input whereas the former benchmarks`
Packit	f0b94e	`are executed on strings of varying length.`
Packit	f0b94e
Packit	f0b94e	`There is also a smattering of benchmarks for parsing and compilation.`
Packit	f0b94e
Packit	f0b94e	`Benchmarks are in a separate crate so that its dependencies can be managed`
Packit	f0b94e	`separately from the main regex crate.`
Packit	f0b94e
Packit	f0b94e	`Benchmarking follows a similarly wonky setup as tests. There are multiple entry`
Packit	f0b94e	`points:`
Packit	f0b94e
Packit	f0b94e	* `bench_rust.rs` - benchmarks `Regex::new`
Packit	f0b94e	* `bench_rust_bytes.rs` benchmarks `bytes::Regex::new`
Packit	f0b94e	* `bench_pcre.rs` - benchmarks PCRE
Packit	f0b94e	* `bench_onig.rs` - benchmarks Oniguruma
Packit	f0b94e
Packit	f0b94e	`The PCRE and Oniguruma benchmarks exist as a comparison point to a mature`
Packit	f0b94e	`regular expression library. In general, this regex library compares favorably`
Packit	f0b94e	`(there are even a few benchmarks that PCRE simply runs too slowly on or`
Packit	f0b94e	`outright can't execute at all). I would love to add other regular expression`
Packit	f0b94e	`library benchmarks (especially RE2).`
Packit	f0b94e
Packit	f0b94e	`If you're hacking on one of the matching engines and just want to see`
Packit	f0b94e	`benchmarks, then all you need to run is:`
Packit	f0b94e
Packit	f0b94e	`$ ./bench/run rust`
Packit	f0b94e
Packit	f0b94e	`If you want to compare your results with older benchmarks, then try:`
Packit	f0b94e
Packit	f0b94e	`$ ./bench/run rust \| tee old`
Packit	f0b94e	`$ ... make it faster`
Packit	f0b94e	`$ ./bench/run rust \| tee new`
Packit	f0b94e	`$ cargo benchcmp old new --improvements`
Packit	f0b94e
Packit	f0b94e	The `cargo-benchcmp` utility is available here:
Packit	f0b94e	`https://github.com/BurntSushi/cargo-benchcmp`
Packit	f0b94e
Packit	f0b94e	The `./bench/run` utility can run benchmarks for PCRE and Oniguruma too. See
Packit	f0b94e	`./bench/bench --help`.
Packit	f0b94e
Packit	f0b94e	`## Dev Docs`
Packit	f0b94e
Packit	f0b94e	`When digging your teeth into the codebase for the first time, the`
Packit	f0b94e	crate documentation can be a great resource. By default `rustdoc`
Packit	f0b94e	`will strip out all documentation of private crate members in an`
Packit	f0b94e	`effort to help consumers of the crate focus on the interface`
Packit	f0b94e	`without having to concern themselves with the implementation.`
Packit	f0b94e	`Normally this is a great thing, but if you want to start hacking`
Packit	f0b94e	`on regex internals it is not what you want. Many of the private members`
Packit	f0b94e	`of this crate are well documented with rustdoc style comments, and`
Packit	f0b94e	`it would be a shame to miss out on the opportunity that presents.`
Packit	f0b94e	`You can generate the private docs with:`
Packit	f0b94e
Packit	f0b94e	```
Packit	f0b94e	`$ rustdoc --crate-name docs src/lib.rs -o target/doc -L target/debug/deps --no-defaults --passes collapse-docs --passes unindent-comments`
Packit	f0b94e	```
Packit	f0b94e
Packit	f0b94e	Then just point your browser at `target/doc/regex/index.html`.
Packit	f0b94e
Packit	f0b94e	`See https://github.com/rust-lang/rust/issues/15347 for more info`
Packit	f0b94e	`about generating developer docs for internal use.`

source-git / mozjs60

Source Code

Blame third_party/rust/regex/HACKING.md