Blame docs/reference/glib/regex-syntax.xml

Packit ae235b
Packit ae235b
Packit ae235b
               "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd" [
Packit ae235b
]>
Packit ae235b
<refentry id="glib-regex-syntax" revision="11 Jul 2006">
Packit ae235b
 <refmeta>
Packit ae235b
  <refentrytitle>Regular expression syntax</refentrytitle>
Packit ae235b
 </refmeta>
Packit ae235b
Packit ae235b
Packit ae235b
Packit ae235b
Based on the man page for pcrepattern.
Packit ae235b
Packit ae235b
Remember to sync this document with the file docs/pcrepattern.3 in the
Packit ae235b
pcre package when upgrading to a newer version of pcre.
Packit ae235b
Packit ae235b
In sync with PCRE 7.0
Packit ae235b
-->
Packit ae235b
Packit ae235b
<refnamediv>
Packit ae235b
<refname>Regular expression syntax</refname>
Packit ae235b
<refpurpose>
Packit ae235b
syntax and semantics of regular expressions supported by GRegex
Packit ae235b
</refpurpose>
Packit ae235b
</refnamediv>
Packit ae235b
Packit ae235b
<refsect1>
Packit ae235b
<title>GRegex regular expression details</title>
Packit ae235b
<para>
Packit ae235b
A regular expression is a pattern that is matched against a
Packit ae235b
string from left to right. Most characters stand for themselves in a
Packit ae235b
pattern, and match the corresponding characters in the string. As a
Packit ae235b
trivial example, the pattern
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
The quick brown fox
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
matches a portion of a string that is identical to itself. When
Packit ae235b
caseless matching is specified (the <varname>G_REGEX_CASELESS</varname> flag), letters are
Packit ae235b
matched independently of case.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
The power of regular expressions comes from the ability to include
Packit ae235b
alternatives and repetitions in the pattern. These are encoded in the
Packit ae235b
pattern by the use of metacharacters, which do not stand for themselves
Packit ae235b
but instead are interpreted in some special way.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
There are two different sets of metacharacters: those that are recognized
Packit ae235b
anywhere in the pattern except within square brackets, and those
Packit ae235b
that are recognized in square brackets. Outside square brackets, the
Packit ae235b
metacharacters are as follows:
Packit ae235b
</para>
Packit ae235b
Packit ae235b
Packit ae235b
<title>Metacharacters outside square brackets</title>
Packit ae235b
<tgroup cols="2">
Packit ae235b
<colspec colnum="1" align="center"/>
Packit ae235b
Packit ae235b
  <row>
Packit ae235b
    <entry>Character</entry>
Packit ae235b
    <entry>Meaning</entry>
Packit ae235b
  </row>
Packit ae235b
Packit ae235b
Packit ae235b
  <row>
Packit ae235b
    <entry>\</entry>
Packit ae235b
    <entry>general escape character with several uses</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>^</entry>
Packit ae235b
    <entry>assert start of string (or line, in multiline mode)</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>$</entry>
Packit ae235b
    <entry>assert end of string (or line, in multiline mode)</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>.</entry>
Packit ae235b
    <entry>match any character except newline (by default)</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>[</entry>
Packit ae235b
    <entry>start character class definition</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>|</entry>
Packit ae235b
    <entry>start of alternative branch</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>(</entry>
Packit ae235b
    <entry>start subpattern</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>)</entry>
Packit ae235b
    <entry>end subpattern</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>?</entry>
Packit ae235b
    <entry>extends the meaning of (, or 0/1 quantifier, or quantifier minimizer</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>*</entry>
Packit ae235b
    <entry>0 or more quantifier</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>+</entry>
Packit ae235b
    <entry>1 or more quantifier, also "possessive quantifier"</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>{</entry>
Packit ae235b
    <entry>start min/max quantifier</entry>
Packit ae235b
  </row>
Packit ae235b
Packit ae235b
</tgroup>
Packit ae235b
Packit ae235b
Packit ae235b
<para>
Packit ae235b
Part of a pattern that is in square brackets is called a "character
Packit ae235b
class". In a character class the only metacharacters are:
Packit ae235b
</para>
Packit ae235b
Packit ae235b
Packit ae235b
<title>Metacharacters inside square brackets</title>
Packit ae235b
<tgroup cols="2">
Packit ae235b
<colspec colnum="1" align="center"/>
Packit ae235b
Packit ae235b
  <row>
Packit ae235b
    <entry>Character</entry>
Packit ae235b
    <entry>Meaning</entry>
Packit ae235b
  </row>
Packit ae235b
Packit ae235b
Packit ae235b
  <row>
Packit ae235b
    <entry>\</entry>
Packit ae235b
    <entry>general escape character</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>^</entry>
Packit ae235b
    <entry>negate the class, but only if the first character</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>-</entry>
Packit ae235b
    <entry>indicates character range</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>[</entry>
Packit ae235b
    <entry>POSIX character class (only if followed by POSIX syntax)</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>]</entry>
Packit ae235b
    <entry>terminates the character class</entry>
Packit ae235b
  </row>
Packit ae235b
Packit ae235b
</tgroup>
Packit ae235b
Packit ae235b
</refsect1>
Packit ae235b
Packit ae235b
<refsect1>
Packit ae235b
<title>Backslash</title>
Packit ae235b
<para>
Packit ae235b
The backslash character has several uses. Firstly, if it is followed by
Packit ae235b
a non-alphanumeric character, it takes away any special meaning that
Packit ae235b
character may have. This use of backslash as an escape character
Packit ae235b
applies both inside and outside character classes.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
For example, if you want to match a * character, you write \* in the
Packit ae235b
pattern. This escaping action applies whether or not the following
Packit ae235b
character would otherwise be interpreted as a metacharacter, so it is
Packit ae235b
always safe to precede a non-alphanumeric with backslash to specify
Packit ae235b
that it stands for itself. In particular, if you want to match a
Packit ae235b
backslash, you write \\.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
If a pattern is compiled with the <varname>G_REGEX_EXTENDED</varname>
Packit ae235b
option, whitespace in the pattern (other than in a character class) and
Packit ae235b
characters between a # outside a character class and the next newline
Packit ae235b
are ignored.
Packit ae235b
An escaping backslash can be used to include a whitespace or # character
Packit ae235b
as part of the pattern.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
Note that the C compiler interprets backslash in strings itself, therefore
Packit ae235b
you need to duplicate all \ characters when you put a regular expression
Packit ae235b
in a C string, like "\\d{3}".
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
If you want to remove the special meaning from a sequence of characters,
Packit ae235b
you can do so by putting them between \Q and \E.
Packit ae235b
The \Q...\E sequence is recognized both inside and outside character
Packit ae235b
classes.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<refsect2>
Packit ae235b
<title>Non-printing characters</title>
Packit ae235b
<para>
Packit ae235b
A second use of backslash provides a way of encoding non-printing
Packit ae235b
characters in patterns in a visible manner. There is no restriction on the
Packit ae235b
appearance of non-printing characters, apart from the binary zero that
Packit ae235b
terminates a pattern, but when a pattern is being prepared by text
Packit ae235b
editing, it is usually easier to use one of the following escape
Packit ae235b
sequences than the binary character it represents:
Packit ae235b
</para>
Packit ae235b
Packit ae235b
Packit ae235b
<title>Non-printing characters</title>
Packit ae235b
<tgroup cols="2">
Packit ae235b
<colspec colnum="1" align="center"/>
Packit ae235b
Packit ae235b
  <row>
Packit ae235b
    <entry>Escape</entry>
Packit ae235b
    <entry>Meaning</entry>
Packit ae235b
  </row>
Packit ae235b
Packit ae235b
Packit ae235b
  <row>
Packit ae235b
    <entry>\a</entry>
Packit ae235b
    <entry>alarm, that is, the BEL character (hex 07)</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>\cx</entry>
Packit ae235b
    <entry>"control-x", where x is any character</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>\e</entry>
Packit ae235b
    <entry>escape (hex 1B)</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>\f</entry>
Packit ae235b
    <entry>formfeed (hex 0C)</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>\n</entry>
Packit ae235b
    <entry>newline (hex 0A)</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>\r</entry>
Packit ae235b
    <entry>carriage return (hex 0D)</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>\t</entry>
Packit ae235b
    <entry>tab (hex 09)</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>\ddd</entry>
Packit ae235b
    <entry>character with octal code ddd, or backreference</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>\xhh</entry>
Packit ae235b
    <entry>character with hex code hh</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>\x{hhh..}</entry>
Packit ae235b
    <entry>character with hex code hhh..</entry>
Packit ae235b
  </row>
Packit ae235b
Packit ae235b
</tgroup>
Packit ae235b
Packit ae235b
Packit ae235b
<para>
Packit ae235b
The precise effect of \cx is as follows: if x is a lower case letter,
Packit ae235b
it is converted to upper case. Then bit 6 of the character (hex 40) is
Packit ae235b
inverted. Thus \cz becomes hex 1A, but \c{ becomes hex 3B, while \c;
Packit ae235b
becomes hex 7B.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
After \x, from zero to two hexadecimal digits are read (letters can be
Packit ae235b
in upper or lower case). Any number of hexadecimal digits may appear
Packit ae235b
between \x{ and }, but the value of the character code
Packit ae235b
must be less than 2**31 (that is, the maximum hexadecimal value is
Packit ae235b
7FFFFFFF). If characters other than hexadecimal digits appear between
Packit ae235b
\x{ and }, or if there is no terminating }, this form of escape is not
Packit ae235b
recognized. Instead, the initial \x will be interpreted as a basic hexadecimal
Packit ae235b
escape, with no following digits, giving a character whose
Packit ae235b
value is zero.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
Characters whose value is less than 256 can be defined by either of the
Packit ae235b
two syntaxes for \x. There is no difference
Packit ae235b
in the way they are handled. For example, \xdc is exactly the same as
Packit ae235b
\x{dc}.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
After \0 up to two further octal digits are read. If there are fewer
Packit ae235b
than two digits, just those that are present are used.
Packit ae235b
Thus the sequence \0\x\07 specifies two binary zeros followed by a BEL
Packit ae235b
character (code value 7). Make sure you supply two digits after the
Packit ae235b
initial zero if the pattern character that follows is itself an octal
Packit ae235b
digit.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
The handling of a backslash followed by a digit other than 0 is complicated.
Packit ae235b
Outside a character class, GRegex reads it and any following digits as a
Packit ae235b
decimal number. If the number is less than 10, or if there
Packit ae235b
have been at least that many previous capturing left parentheses in the
Packit ae235b
expression, the entire sequence is taken as a back reference. A
Packit ae235b
description of how this works is given later, following the discussion
Packit ae235b
of parenthesized subpatterns.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
Inside a character class, or if the decimal number is greater than 9
Packit ae235b
and there have not been that many capturing subpatterns, GRegex re-reads
Packit ae235b
up to three octal digits following the backslash, and uses them to generate
Packit ae235b
a data character. Any subsequent digits stand for themselves. For example:
Packit ae235b
</para>
Packit ae235b
Packit ae235b
Packit ae235b
<title>Non-printing characters</title>
Packit ae235b
<tgroup cols="2">
Packit ae235b
<colspec colnum="1" align="center"/>
Packit ae235b
Packit ae235b
  <row>
Packit ae235b
    <entry>Escape</entry>
Packit ae235b
    <entry>Meaning</entry>
Packit ae235b
  </row>
Packit ae235b
Packit ae235b
Packit ae235b
  <row>
Packit ae235b
    <entry>\040</entry>
Packit ae235b
    <entry>is another way of writing a space</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>\40</entry>
Packit ae235b
    <entry>is the same, provided there are fewer than 40 previous capturing subpatterns</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>\7</entry>
Packit ae235b
    <entry>is always a back reference</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>\11</entry>
Packit ae235b
    <entry>might be a back reference, or another way of writing a tab</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>\011</entry>
Packit ae235b
    <entry>is always a tab</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>\0113</entry>
Packit ae235b
    <entry>is a tab followed by the character "3"</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>\113</entry>
Packit ae235b
    <entry>might be a back reference, otherwise the character with octal code 113</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>\377</entry>
Packit ae235b
    <entry>might be a back reference, otherwise the byte consisting entirely of 1 bits</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>\81</entry>
Packit ae235b
    <entry>is either a back reference, or a binary zero followed by the two characters "8" and "1"</entry>
Packit ae235b
  </row>
Packit ae235b
Packit ae235b
</tgroup>
Packit ae235b
Packit ae235b
Packit ae235b
<para>
Packit ae235b
Note that octal values of 100 or greater must not be introduced by a
Packit ae235b
leading zero, because no more than three octal digits are ever read.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
All the sequences that define a single character can be used both inside
Packit ae235b
and outside character classes. In addition, inside a character class, the
Packit ae235b
sequence \b is interpreted as the backspace character (hex 08), and the
Packit ae235b
sequences \R and \X are interpreted as the characters "R" and "X", respectively.
Packit ae235b
Outside a character class, these sequences have different meanings (see below).
Packit ae235b
</para>
Packit ae235b
</refsect2>
Packit ae235b
Packit ae235b
<refsect2>
Packit ae235b
<title>Absolute and relative back references</title>
Packit ae235b
<para>
Packit ae235b
The sequence \g followed by a positive or negative number, optionally enclosed
Packit ae235b
in braces, is an absolute or relative back reference. Back references are
Packit ae235b
discussed later, following the discussion of parenthesized subpatterns.
Packit ae235b
</para>
Packit ae235b
</refsect2>
Packit ae235b
Packit ae235b
<refsect2>
Packit ae235b
<title>Generic character types</title>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
Another use of backslash is for specifying generic character types.
Packit ae235b
The following are always recognized:
Packit ae235b
</para>
Packit ae235b
Packit ae235b
Packit ae235b
<title>Generic characters</title>
Packit ae235b
<tgroup cols="2">
Packit ae235b
<colspec colnum="1" align="center"/>
Packit ae235b
Packit ae235b
  <row>
Packit ae235b
    <entry>Escape</entry>
Packit ae235b
    <entry>Meaning</entry>
Packit ae235b
  </row>
Packit ae235b
Packit ae235b
Packit ae235b
  <row>
Packit ae235b
    <entry>\d</entry>
Packit ae235b
    <entry>any decimal digit</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>\D</entry>
Packit ae235b
    <entry>any character that is not a decimal digit</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>\s</entry>
Packit ae235b
    <entry>any whitespace character</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>\S</entry>
Packit ae235b
    <entry>any character that is not a whitespace character</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>\w</entry>
Packit ae235b
    <entry>any "word" character</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>\W</entry>
Packit ae235b
    <entry>any "non-word" character</entry>
Packit ae235b
  </row>
Packit ae235b
Packit ae235b
</tgroup>
Packit ae235b
Packit ae235b
Packit ae235b
<para>
Packit ae235b
Each pair of escape sequences partitions the complete set of characters
Packit ae235b
into two disjoint sets. Any given character matches one, and only one,
Packit ae235b
of each pair.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
These character type sequences can appear both inside and outside character
Packit ae235b
classes. They each match one character of the appropriate type.
Packit ae235b
If the current matching point is at the end of the passed string, all
Packit ae235b
of them fail, since there is no character to match.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
For compatibility with Perl, \s does not match the VT character (code
Packit ae235b
11). This makes it different from the POSIX "space" class. The \s
Packit ae235b
characters are HT (9), LF (10), FF (12), CR (13), and space (32).
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
A "word" character is an underscore or any character less than 256 that
Packit ae235b
is a letter or digit.</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
Characters with values greater than 128 never match \d,
Packit ae235b
\s, or \w, and always match \D, \S, and \W.
Packit ae235b
</para>
Packit ae235b
</refsect2>
Packit ae235b
Packit ae235b
<refsect2>
Packit ae235b
<title>Newline sequences</title>
Packit ae235b
<para>Outside a character class, the escape sequence \R matches any Unicode
Packit ae235b
newline sequence.
Packit ae235b
This particular group matches either the two-character sequence CR followed by
Packit ae235b
LF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab,
Packit ae235b
U+000B), FF (formfeed, U+000C), CR (carriage return, U+000D), NEL (next
Packit ae235b
line, U+0085), LS (line separator, U+2028), or PS (paragraph separator, U+2029).
Packit ae235b
The two-character sequence is treated as a single unit that
Packit ae235b
cannot be split. Inside a character class, \R matches the letter "R".</para>
Packit ae235b
</refsect2>
Packit ae235b
Packit ae235b
<refsect2>
Packit ae235b
<title>Unicode character properties</title>
Packit ae235b
<para>
Packit ae235b
To support generic character types there are three additional escape
Packit ae235b
sequences, they are:
Packit ae235b
</para>
Packit ae235b
Packit ae235b
Packit ae235b
<title>Generic character types</title>
Packit ae235b
<tgroup cols="2">
Packit ae235b
<colspec colnum="1" align="center"/>
Packit ae235b
Packit ae235b
  <row>
Packit ae235b
    <entry>Escape</entry>
Packit ae235b
    <entry>Meaning</entry>
Packit ae235b
  </row>
Packit ae235b
Packit ae235b
Packit ae235b
  <row>
Packit ae235b
    <entry>\p{xx}</entry>
Packit ae235b
    <entry>a character with the xx property</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>\P{xx}</entry>
Packit ae235b
    <entry>a character without the xx property</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>\X</entry>
Packit ae235b
    <entry>an extended Unicode sequence</entry>
Packit ae235b
  </row>
Packit ae235b
Packit ae235b
</tgroup>
Packit ae235b
Packit ae235b
Packit ae235b
<para>
Packit ae235b
The property names represented by xx above are limited to the Unicode
Packit ae235b
script names, the general category properties, and "Any", which matches
Packit ae235b
any character (including newline). Other properties such as "InMusicalSymbols"
Packit ae235b
are not currently supported. Note that \P{Any} does not match any characters,
Packit ae235b
so always causes a match failure.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
Sets of Unicode characters are defined as belonging to certain scripts. A
Packit ae235b
character from one of these sets can be matched using a script name. For
Packit ae235b
example, \p{Greek} or \P{Han}.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
Those that are not part of an identified script are lumped together as
Packit ae235b
"Common". The current list of scripts can be found in the documentation for
Packit ae235b
the #GUnicodeScript enumeration. Script names for use with \p{} can be
Packit ae235b
found by replacing all spaces with underscores, e.g. for Linear B use
Packit ae235b
\p{Linear_B}.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
Each character has exactly one general category property, specified by a
Packit ae235b
two-letter abbreviation. For compatibility with Perl, negation can be specified
Packit ae235b
by including a circumflex between the opening brace and the property name. For
Packit ae235b
example, \p{^Lu} is the same as \P{Lu}.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
If only one letter is specified with \p or \P, it includes all the general
Packit ae235b
category properties that start with that letter. In this case, in the absence
Packit ae235b
of negation, the curly brackets in the escape sequence are optional; these two
Packit ae235b
examples have the same effect:
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
\p{L}
Packit ae235b
\pL
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
In addition to the two-letter category codes listed in the
Packit ae235b
documentation for the #GUnicodeType enumeration, the following
Packit ae235b
general category property codes are supported:
Packit ae235b
</para>
Packit ae235b
Packit ae235b
Packit ae235b
<title>Property codes</title>
Packit ae235b
<tgroup cols="2">
Packit ae235b
<colspec colnum="1" align="center"/>
Packit ae235b
Packit ae235b
  <row>
Packit ae235b
    <entry>Code</entry>
Packit ae235b
    <entry>Meaning</entry>
Packit ae235b
  </row>
Packit ae235b
Packit ae235b
Packit ae235b
  <row>
Packit ae235b
    <entry>C</entry>
Packit ae235b
    <entry>Other</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>L</entry>
Packit ae235b
    <entry>Letter</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>M</entry>
Packit ae235b
    <entry>Mark</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>N</entry>
Packit ae235b
    <entry>Number</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>P</entry>
Packit ae235b
    <entry>Punctuation</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>S</entry>
Packit ae235b
    <entry>Symbol</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>Z</entry>
Packit ae235b
    <entry>Separator</entry>
Packit ae235b
  </row>
Packit ae235b
Packit ae235b
</tgroup>
Packit ae235b
Packit ae235b
Packit ae235b
<para>
Packit ae235b
The special property L& is also supported: it matches a character that has
Packit ae235b
the Lu, Ll, or Lt property, in other words, a letter that is not classified as
Packit ae235b
a modifier or "other".
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
The long synonyms for these properties that Perl supports (such as \ep{Letter})
Packit ae235b
are not supported by GRegex, nor is it permitted to prefix any of these
Packit ae235b
properties with "Is".
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
No character that is in the Unicode table has the Cn (unassigned) property.
Packit ae235b
Instead, this property is assumed for any code point that is not in the
Packit ae235b
Unicode table.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
Specifying caseless matching does not affect these escape sequences.
Packit ae235b
For example, \p{Lu} always matches only upper case letters.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
The \X escape matches any number of Unicode characters that form an
Packit ae235b
extended Unicode sequence. \X is equivalent to
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
(?>\PM\pM*)
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
That is, it matches a character without the "mark" property, followed
Packit ae235b
by zero or more characters with the "mark" property, and treats the
Packit ae235b
sequence as an atomic group (see below). Characters with the "mark"
Packit ae235b
property are typically accents that affect the preceding character.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
Matching characters by Unicode property is not fast, because GRegex has
Packit ae235b
to search a structure that contains data for over fifteen thousand
Packit ae235b
characters. That is why the traditional escape sequences such as \d and
Packit ae235b
\w do not use Unicode properties.
Packit ae235b
</para>
Packit ae235b
</refsect2>
Packit ae235b
Packit ae235b
<refsect2>
Packit ae235b
<title>Simple assertions</title>
Packit ae235b
<para>
Packit ae235b
The final use of backslash is for certain simple assertions. An
Packit ae235b
assertion specifies a condition that has to be met at a particular point in
Packit ae235b
a match, without consuming any characters from the string. The
Packit ae235b
use of subpatterns for more complicated assertions is described below.
Packit ae235b
The backslashed assertions are:
Packit ae235b
</para>
Packit ae235b
Packit ae235b
Packit ae235b
<title>Simple assertions</title>
Packit ae235b
<tgroup cols="2">
Packit ae235b
<colspec colnum="1" align="center"/>
Packit ae235b
Packit ae235b
  <row>
Packit ae235b
    <entry>Escape</entry>
Packit ae235b
    <entry>Meaning</entry>
Packit ae235b
  </row>
Packit ae235b
Packit ae235b
Packit ae235b
  <row>
Packit ae235b
    <entry>\b</entry>
Packit ae235b
    <entry>matches at a word boundary</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>\B</entry>
Packit ae235b
    <entry>matches when not at a word boundary</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>\A</entry>
Packit ae235b
    <entry>matches at the start of the string</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>\Z</entry>
Packit ae235b
    <entry>matches at the end of the string or before a newline at the end of the string</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>\z</entry>
Packit ae235b
    <entry>matches only at the end of the string</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>\G</entry>
Packit ae235b
    <entry>matches at first matching position in the string</entry>
Packit ae235b
  </row>
Packit ae235b
Packit ae235b
</tgroup>
Packit ae235b
Packit ae235b
Packit ae235b
<para>
Packit ae235b
These assertions may not appear in character classes (but note that \b
Packit ae235b
has a different meaning, namely the backspace character, inside a
Packit ae235b
character class).
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
A word boundary is a position in the string where the current
Packit ae235b
character and the previous character do not both match \w or \W (i.e.
Packit ae235b
one matches \w and the other matches \W), or the start or end of the
Packit ae235b
string if the first or last character matches \w, respectively.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
The \A, \Z, and \z assertions differ from the traditional circumflex
Packit ae235b
and dollar (described in the next section) in that they only ever match
Packit ae235b
at the very start and end of the string, whatever options are
Packit ae235b
set. Thus, they are independent of multiline mode. These three assertions
Packit ae235b
are not affected by the <varname>G_REGEX_MATCH_NOTBOL</varname> or <varname>G_REGEX_MATCH_NOTEOL</varname> options,
Packit ae235b
which affect only the behaviour of the circumflex and dollar metacharacters.
Packit ae235b
However, if the start_position argument of a matching function is non-zero,
Packit ae235b
indicating that matching is to start at a point other than the beginning of
Packit ae235b
the string, \A can never match. The difference between \Z and \z is
Packit ae235b
that \Z matches before a newline at the end of the string as well at the
Packit ae235b
very end, whereas \z matches only at the end.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
The \G assertion is true only when the current matching position is at
Packit ae235b
the start point of the match, as specified by the start_position argument
Packit ae235b
to the matching functions. It differs from \A when the value of startoffset is
Packit ae235b
non-zero.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
Note, however, that the interpretation of \G, as the start of the
Packit ae235b
current match, is subtly different from Perl’s, which defines it as the
Packit ae235b
end of the previous match. In Perl, these can be different when the
Packit ae235b
previously matched string was empty.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
If all the alternatives of a pattern begin with \G, the expression is
Packit ae235b
anchored to the starting match position, and the "anchored" flag is set
Packit ae235b
in the compiled regular expression.
Packit ae235b
</para>
Packit ae235b
</refsect2>
Packit ae235b
</refsect1>
Packit ae235b
Packit ae235b
<refsect1>
Packit ae235b
<title>Circumflex and dollar</title>
Packit ae235b
<para>
Packit ae235b
Outside a character class, in the default matching mode, the circumflex
Packit ae235b
character is an assertion that is true only if the current matching
Packit ae235b
point is at the start of the string. If the start_position argument to
Packit ae235b
the matching functions is non-zero, circumflex can never match if the
Packit ae235b
<varname>G_REGEX_MULTILINE</varname> option is unset. Inside a character class, circumflex
Packit ae235b
has an entirely different meaning (see below).
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
Circumflex need not be the first character of the pattern if a number
Packit ae235b
of alternatives are involved, but it should be the first thing in each
Packit ae235b
alternative in which it appears if the pattern is ever to match that
Packit ae235b
branch. If all possible alternatives start with a circumflex, that is,
Packit ae235b
if the pattern is constrained to match only at the start of the string,
Packit ae235b
it is said to be an "anchored" pattern. (There are also other
Packit ae235b
constructs that can cause a pattern to be anchored.)
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
A dollar character is an assertion that is true only if the current
Packit ae235b
matching point is at the end of the string, or immediately
Packit ae235b
before a newline at the end of the string (by default). Dollar need not
Packit ae235b
be the last character of the pattern if a number of alternatives are
Packit ae235b
involved, but it should be the last item in any branch in which it
Packit ae235b
appears. Dollar has no special meaning in a character class.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
The meaning of dollar can be changed so that it matches only at the
Packit ae235b
very end of the string, by setting the <varname>G_REGEX_DOLLAR_ENDONLY</varname> option at
Packit ae235b
compile time. This does not affect the \Z assertion.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
The meanings of the circumflex and dollar characters are changed if the
Packit ae235b
<varname>G_REGEX_MULTILINE</varname> option is set. When this is the case,
Packit ae235b
a circumflex matches immediately after internal newlines as well as at the
Packit ae235b
start of the string. It does not match after a newline that ends the string.
Packit ae235b
A dollar matches before any newlines in the string, as well as at the very
Packit ae235b
end, when <varname>G_REGEX_MULTILINE</varname> is set. When newline is
Packit ae235b
specified as the two-character sequence CRLF, isolated CR and LF characters
Packit ae235b
do not indicate newlines.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
For example, the pattern /^abc$/ matches the string "def\nabc" (where
Packit ae235b
\n represents a newline) in multiline mode, but not otherwise. Consequently,
Packit ae235b
patterns that are anchored in single line mode because all branches start with
Packit ae235b
^ are not anchored in multiline mode, and a match for circumflex is possible
Packit ae235b
when the <varname>start_position</varname> argument of a matching function
Packit ae235b
is non-zero. The <varname>G_REGEX_DOLLAR_ENDONLY</varname> option is ignored
Packit ae235b
if <varname>G_REGEX_MULTILINE</varname> is set.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
Note that the sequences \A, \Z, and \z can be used to match the start and
Packit ae235b
end of the string in both modes, and if all branches of a pattern start with
Packit ae235b
\A it is always anchored, whether or not <varname>G_REGEX_MULTILINE</varname>
Packit ae235b
is set.
Packit ae235b
</para>
Packit ae235b
</refsect1>
Packit ae235b
Packit ae235b
<refsect1>
Packit ae235b
<title>Full stop (period, dot)</title>
Packit ae235b
<para>
Packit ae235b
Outside a character class, a dot in the pattern matches any one character
Packit ae235b
in the string, including a non-printing character, but not (by
Packit ae235b
default) newline. In UTF-8 a character might be more than one byte long.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
When a line ending is defined as a single character, dot never matches that
Packit ae235b
character; when the two-character sequence CRLF is used, dot does not match CR
Packit ae235b
if it is immediately followed by LF, but otherwise it matches all characters
Packit ae235b
(including isolated CRs and LFs). When any Unicode line endings are being
Packit ae235b
recognized, dot does not match CR or LF or any of the other line ending
Packit ae235b
characters.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
If the <varname>G_REGEX_DOTALL</varname> flag is set, dots match newlines
Packit ae235b
as well. The handling of dot is entirely independent of the handling of circumflex
Packit ae235b
and dollar, the only relationship being that they both involve newline
Packit ae235b
characters. Dot has no special meaning in a character class.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
The behaviour of dot with regard to newlines can be changed. If the
Packit ae235b
<varname>G_REGEX_DOTALL</varname> option is set, a dot matches any one
Packit ae235b
character, without exception. If newline is defined as the two-character
Packit ae235b
sequence CRLF, it takes two dots to match it.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
The handling of dot is entirely independent of the handling of circumflex and
Packit ae235b
dollar, the only relationship being that they both involve newlines. Dot has no
Packit ae235b
special meaning in a character class.
Packit ae235b
</para>
Packit ae235b
</refsect1>
Packit ae235b
Packit ae235b
<refsect1>
Packit ae235b
<title>Matching a single byte</title>
Packit ae235b
<para>
Packit ae235b
Outside a character class, the escape sequence \C matches any one byte,
Packit ae235b
both in and out of UTF-8 mode. Unlike a dot, it always matches any line
Packit ae235b
ending characters.
Packit ae235b
The feature is provided in Perl in order to match individual bytes in
Packit ae235b
UTF-8 mode. Because it breaks up UTF-8 characters into individual
Packit ae235b
bytes, what remains in the string may be a malformed UTF-8 string. For
Packit ae235b
this reason, the \C escape sequence is best avoided.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
GRegex does not allow \C to appear in lookbehind assertions (described
Packit ae235b
below), because in UTF-8 mode this would make it impossible to calculate
Packit ae235b
the length of the lookbehind.
Packit ae235b
</para>
Packit ae235b
</refsect1>
Packit ae235b
Packit ae235b
<refsect1>
Packit ae235b
<title>Square brackets and character classes</title>
Packit ae235b
<para>
Packit ae235b
An opening square bracket introduces a character class, terminated by a
Packit ae235b
closing square bracket. A closing square bracket on its own is not special. If a closing square bracket is required as a member of the class,
Packit ae235b
it should be the first data character in the class (after an initial
Packit ae235b
circumflex, if present) or escaped with a backslash.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
A character class matches a single character in the string.  A matched character
Packit ae235b
must be in the set of characters defined by the class, unless the first
Packit ae235b
character in the class definition is a circumflex, in which case the
Packit ae235b
string character must not be in the set defined by the class. If a
Packit ae235b
circumflex is actually required as a member of the class, ensure it is
Packit ae235b
not the first character, or escape it with a backslash.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
For example, the character class [aeiou] matches any lower case vowel,
Packit ae235b
while [^aeiou] matches any character that is not a lower case vowel.
Packit ae235b
Note that a circumflex is just a convenient notation for specifying the
Packit ae235b
characters that are in the class by enumerating those that are not. A
Packit ae235b
class that starts with a circumflex is not an assertion: it still consumes
Packit ae235b
a character from the string, and therefore it fails if the current pointer
Packit ae235b
is at the end of the string.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
In UTF-8 mode, characters with values greater than 255 can be included
Packit ae235b
in a class as a literal string of bytes, or by using the \x{ escaping
Packit ae235b
mechanism.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
When caseless matching is set, any letters in a class represent both
Packit ae235b
their upper case and lower case versions, so for example, a caseless
Packit ae235b
[aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
Packit ae235b
match "A", whereas a caseful version would.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
Characters that might indicate line breaks are never treated
Packit ae235b
in any special way when matching character classes, whatever line-ending
Packit ae235b
sequence is in use, and whatever setting of the <varname>G_REGEX_DOTALL</varname>
Packit ae235b
and <varname>G_REGEX_MULTILINE</varname> options is used. A class such as [^a]
Packit ae235b
always matches one of these characters.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
The minus (hyphen) character can be used to specify a range of characters in
Packit ae235b
a character class. For example, [d-m] matches any letter
Packit ae235b
between d and m, inclusive. If a minus character is required in a
Packit ae235b
class, it must be escaped with a backslash or appear in a position
Packit ae235b
where it cannot be interpreted as indicating a range, typically as the
Packit ae235b
first or last character in the class.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
It is not possible to have the literal character "]" as the end character
Packit ae235b
of a range. A pattern such as [W-]46] is interpreted as a class of
Packit ae235b
two characters ("W" and "-") followed by a literal string "46]", so it
Packit ae235b
would match "W46]" or "-46]". However, if the "]" is escaped with a
Packit ae235b
backslash it is interpreted as the end of range, so [W-\]46] is interpreted
Packit ae235b
as a class containing a range followed by two other characters.
Packit ae235b
The octal or hexadecimal representation of "]" can also be used to end
Packit ae235b
a range.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
Ranges operate in the collating sequence of character values. They can
Packit ae235b
also be used for characters specified numerically, for example
Packit ae235b
[\000-\037]. In UTF-8 mode, ranges can include characters whose values
Packit ae235b
are greater than 255, for example [\x{100}-\x{2ff}].
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
The character types \d, \D, \p, \P, \s, \S, \w, and \W may also appear
Packit ae235b
in a character class, and add the characters that they match to the
Packit ae235b
class. For example, [\dABCDEF] matches any hexadecimal digit. A
Packit ae235b
circumflex can conveniently be used with the upper case character types to
Packit ae235b
specify a more restricted set of characters than the matching lower
Packit ae235b
case type. For example, the class [^\W_] matches any letter or digit,
Packit ae235b
but not underscore.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
The only metacharacters that are recognized in character classes are
Packit ae235b
backslash, hyphen (only where it can be interpreted as specifying a
Packit ae235b
range), circumflex (only at the start), opening square bracket (only
Packit ae235b
when it can be interpreted as introducing a POSIX class name - see the
Packit ae235b
next section), and the terminating closing square bracket. However,
Packit ae235b
escaping other non-alphanumeric characters does no harm.
Packit ae235b
</para>
Packit ae235b
</refsect1>
Packit ae235b
Packit ae235b
<refsect1>
Packit ae235b
<title>Posix character classes</title>
Packit ae235b
<para>
Packit ae235b
GRegex supports the POSIX notation for character classes. This uses names
Packit ae235b
enclosed by [: and :] within the enclosing square brackets. For example,
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
[01[:alpha:]%]
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
matches "0", "1", any alphabetic character, or "%". The supported class
Packit ae235b
names are
Packit ae235b
</para>
Packit ae235b
Packit ae235b
Packit ae235b
<title>Posix classes</title>
Packit ae235b
<tgroup cols="2">
Packit ae235b
<colspec colnum="1" align="center"/>
Packit ae235b
Packit ae235b
  <row>
Packit ae235b
    <entry>Name</entry>
Packit ae235b
    <entry>Meaning</entry>
Packit ae235b
  </row>
Packit ae235b
Packit ae235b
Packit ae235b
  <row>
Packit ae235b
    <entry>alnum</entry>
Packit ae235b
    <entry>letters and digits</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>alpha</entry>
Packit ae235b
    <entry>letters</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>ascii</entry>
Packit ae235b
    <entry>character codes 0 - 127</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>blank</entry>
Packit ae235b
    <entry>space or tab only</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>cntrl</entry>
Packit ae235b
    <entry>control characters</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>digit</entry>
Packit ae235b
    <entry>decimal digits (same as \d)</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>graph</entry>
Packit ae235b
    <entry>printing characters, excluding space</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>lower</entry>
Packit ae235b
    <entry>lower case letters</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>print</entry>
Packit ae235b
    <entry>printing characters, including space</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>punct</entry>
Packit ae235b
    <entry>printing characters, excluding letters and digits</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>space</entry>
Packit ae235b
    <entry>white space (not quite the same as \s)</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>upper</entry>
Packit ae235b
    <entry>upper case letters</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>word</entry>
Packit ae235b
    <entry>"word" characters (same as \w)</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>xdigit</entry>
Packit ae235b
    <entry>hexadecimal digits</entry>
Packit ae235b
  </row>
Packit ae235b
Packit ae235b
</tgroup>
Packit ae235b
Packit ae235b
Packit ae235b
<para>
Packit ae235b
The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
Packit ae235b
and space (32). Notice that this list includes the VT character (code
Packit ae235b
11). This makes "space" different to \s, which does not include VT (for
Packit ae235b
Perl compatibility).
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
The name "word" is a Perl extension, and "blank" is a GNU extension.
Packit ae235b
Another Perl extension is negation, which is indicated by a ^ character
Packit ae235b
after the colon. For example,
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
[12[:^digit:]]
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
matches "1", "2", or any non-digit. GRegex also recognize the
Packit ae235b
POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
Packit ae235b
these are not supported, and an error is given if they are encountered.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
In UTF-8 mode, characters with values greater than 128 do not match any
Packit ae235b
of the POSIX character classes.
Packit ae235b
</para>
Packit ae235b
</refsect1>
Packit ae235b
Packit ae235b
<refsect1>
Packit ae235b
<title>Vertical bar</title>
Packit ae235b
<para>
Packit ae235b
Vertical bar characters are used to separate alternative patterns. For
Packit ae235b
example, the pattern
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
 gilbert|sullivan
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
matches either "gilbert" or "sullivan". Any number of alternatives may
Packit ae235b
appear, and an empty alternative is permitted (matching the empty
Packit ae235b
string). The matching process tries each alternative in turn, from
Packit ae235b
left to right, and the first one that succeeds is used. If the alternatives are within a subpattern (defined below), "succeeds" means matching the rest of the main pattern as well as the alternative in the subpattern.
Packit ae235b
</para>
Packit ae235b
</refsect1>
Packit ae235b
Packit ae235b
<refsect1>
Packit ae235b
<title>Internal option setting</title>
Packit ae235b
<para>
Packit ae235b
The settings of the <varname>G_REGEX_CASELESS</varname>, <varname>G_REGEX_MULTILINE</varname>, <varname>G_REGEX_MULTILINE</varname>,
Packit ae235b
and <varname>G_REGEX_EXTENDED</varname> options can be changed from within the pattern by a
Packit ae235b
sequence of Perl-style option letters enclosed between "(?" and ")". The
Packit ae235b
option letters are
Packit ae235b
</para>
Packit ae235b
Packit ae235b
Packit ae235b
<title>Option settings</title>
Packit ae235b
<tgroup cols="2">
Packit ae235b
<colspec colnum="1" align="center"/>
Packit ae235b
Packit ae235b
  <row>
Packit ae235b
    <entry>Option</entry>
Packit ae235b
    <entry>Flag</entry>
Packit ae235b
  </row>
Packit ae235b
Packit ae235b
Packit ae235b
  <row>
Packit ae235b
    <entry>i</entry>
Packit ae235b
    <entry><varname>G_REGEX_CASELESS</varname></entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>m</entry>
Packit ae235b
    <entry><varname>G_REGEX_MULTILINE</varname></entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>s</entry>
Packit ae235b
    <entry><varname>G_REGEX_DOTALL</varname></entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>x</entry>
Packit ae235b
    <entry><varname>G_REGEX_EXTENDED</varname></entry>
Packit ae235b
  </row>
Packit ae235b
Packit ae235b
</tgroup>
Packit ae235b
Packit ae235b
Packit ae235b
<para>
Packit ae235b
For example, (?im) sets caseless, multiline matching. It is also
Packit ae235b
possible to unset these options by preceding the letter with a hyphen, and a
Packit ae235b
combined setting and unsetting such as (?im-sx), which sets <varname>G_REGEX_CASELESS</varname>
Packit ae235b
and <varname>G_REGEX_MULTILINE</varname> while unsetting <varname>G_REGEX_DOTALL</varname> and <varname>G_REGEX_EXTENDED</varname>,
Packit ae235b
is also permitted. If a letter appears both before and after the
Packit ae235b
hyphen, the option is unset.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
When an option change occurs at top level (that is, not inside subpattern
Packit ae235b
parentheses), the change applies to the remainder of the pattern
Packit ae235b
that follows.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
An option change within a subpattern (see below for a description of subpatterns)
Packit ae235b
affects only that part of the current pattern that follows it, so
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
(a(?i)b)c
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
matches abc and aBc and no other strings (assuming <varname>G_REGEX_CASELESS</varname> is not
Packit ae235b
used). By this means, options can be made to have different settings
Packit ae235b
in different parts of the pattern. Any changes made in one alternative
Packit ae235b
do carry on into subsequent branches within the same subpattern. For
Packit ae235b
example,
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
(a(?i)b|c)
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
matches "ab", "aB", "c", and "C", even though when matching "C" the
Packit ae235b
first branch is abandoned before the option setting. This is because
Packit ae235b
the effects of option settings happen at compile time. There would be
Packit ae235b
some very weird behaviour otherwise.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
The options <varname>G_REGEX_UNGREEDY</varname> and
Packit ae235b
<varname>G_REGEX_EXTRA</varname> and <varname>G_REGEX_DUPNAMES</varname>
Packit ae235b
can be changed in the same way as the Perl-compatible options by using
Packit ae235b
the characters U, X and J respectively.
Packit ae235b
</para>
Packit ae235b
</refsect1>
Packit ae235b
Packit ae235b
<refsect1>
Packit ae235b
<title>Subpatterns</title>
Packit ae235b
<para>
Packit ae235b
Subpatterns are delimited by parentheses (round brackets), which can be
Packit ae235b
nested. Turning part of a pattern into a subpattern does two things:
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<itemizedlist>
Packit ae235b
<listitem><para>
Packit ae235b
It localizes a set of alternatives. For example, the pattern
Packit ae235b
cat(aract|erpillar|) matches one of the words "cat", "cataract", or
Packit ae235b
"caterpillar". Without the parentheses, it would match "cataract",
Packit ae235b
"erpillar" or an empty string.
Packit ae235b
</para></listitem>
Packit ae235b
<listitem><para>
Packit ae235b
It sets up the subpattern as a capturing subpattern. This means
Packit ae235b
that, when the whole pattern matches, that portion of the
Packit ae235b
string that matched the subpattern can be obtained using <function>g_match_info_fetch()</function>.
Packit ae235b
Opening parentheses are counted from left to right (starting from 1, as
Packit ae235b
subpattern 0 is the whole matched string) to obtain numbers for the
Packit ae235b
capturing subpatterns.
Packit ae235b
</para></listitem>
Packit ae235b
</itemizedlist>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
For example, if the string "the red king" is matched against the pattern
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
the ((red|white) (king|queen))
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
the captured substrings are "red king", "red", and "king", and are numbered 1, 2, and 3, respectively.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
The fact that plain parentheses fulfil two functions is not always
Packit ae235b
helpful. There are often times when a grouping subpattern is required
Packit ae235b
without a capturing requirement. If an opening parenthesis is followed
Packit ae235b
by a question mark and a colon, the subpattern does not do any capturing,
Packit ae235b
and is not counted when computing the number of any subsequent
Packit ae235b
capturing subpatterns. For example, if the string "the white queen" is
Packit ae235b
matched against the pattern
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
the ((?:red|white) (king|queen))
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
the captured substrings are "white queen" and "queen", and are numbered
Packit ae235b
1 and 2. The maximum number of capturing subpatterns is 65535.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
As a convenient shorthand, if any option settings are required at the
Packit ae235b
start of a non-capturing subpattern, the option letters may appear
Packit ae235b
between the "?" and the ":". Thus the two patterns
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
(?i:saturday|sunday)
Packit ae235b
(?:(?i)saturday|sunday)
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
match exactly the same set of strings. Because alternative branches are
Packit ae235b
tried from left to right, and options are not reset until the end of
Packit ae235b
the subpattern is reached, an option setting in one branch does affect
Packit ae235b
subsequent branches, so the above patterns match "SUNDAY" as well as
Packit ae235b
"Saturday".
Packit ae235b
</para>
Packit ae235b
</refsect1>
Packit ae235b
Packit ae235b
<refsect1>
Packit ae235b
<title>Named subpatterns</title>
Packit ae235b
<para>
Packit ae235b
Identifying capturing parentheses by number is simple, but it can be
Packit ae235b
very hard to keep track of the numbers in complicated regular expressions.
Packit ae235b
Furthermore, if an expression is modified, the numbers may
Packit ae235b
change. To help with this difficulty, GRegex supports the naming of
Packit ae235b
subpatterns.  A subpattern can be named in one of three ways: (?<name>...) or
Packit ae235b
(?'name'...) as in Perl, or (?P<name>...) as in Python.
Packit ae235b
References to capturing parentheses from other
Packit ae235b
parts of the pattern, such as backreferences, recursion, and conditions,
Packit ae235b
can be made by name as well as by number.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
Names consist of up to 32 alphanumeric characters and underscores. Named
Packit ae235b
capturing parentheses are still allocated numbers as well as names, exactly as
Packit ae235b
if the names were not present.
Packit ae235b
By default, a name must be unique within a pattern, but it is possible to relax
Packit ae235b
this constraint by setting the <varname>G_REGEX_DUPNAMES</varname> option at
Packit ae235b
compile time. This can be useful for patterns where only one instance of the
Packit ae235b
named parentheses can match. Suppose you want to match the name of a weekday,
Packit ae235b
either as a 3-letter abbreviation or as the full name, and in both cases you
Packit ae235b
want to extract the abbreviation. This pattern (ignoring the line breaks) does
Packit ae235b
the job:
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
(?<DN>Mon|Fri|Sun)(?:day)?|
Packit ae235b
(?<DN>Tue)(?:sday)?|
Packit ae235b
(?<DN>Wed)(?:nesday)?|
Packit ae235b
(?<DN>Thu)(?:rsday)?|
Packit ae235b
(?<DN>Sat)(?:urday)?
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
There are five capturing substrings, but only one is ever set after a match.
Packit ae235b
The function for extracting the data by name returns the substring
Packit ae235b
for the first (and in this example, the only) subpattern of that name that
Packit ae235b
matched. This saves searching to find which numbered subpattern it was. If you
Packit ae235b
make a reference to a non-unique named subpattern from elsewhere in the
Packit ae235b
pattern, the one that corresponds to the lowest number is used.
Packit ae235b
</para>
Packit ae235b
</refsect1>
Packit ae235b
Packit ae235b
<refsect1>
Packit ae235b
<title>Repetition</title>
Packit ae235b
<para>
Packit ae235b
Repetition is specified by quantifiers, which can follow any of the
Packit ae235b
following items:
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<itemizedlist>
Packit ae235b
<listitem><para>a literal data character</para></listitem>
Packit ae235b
<listitem><para>the dot metacharacter</para></listitem>
Packit ae235b
<listitem><para>the \C escape sequence</para></listitem>
Packit ae235b
<listitem><para>the \X escape sequence (in UTF-8 mode)</para></listitem>
Packit ae235b
<listitem><para>the \R escape sequence</para></listitem>
Packit ae235b
<listitem><para>an escape such as \d that matches a single character</para></listitem>
Packit ae235b
<listitem><para>a character class</para></listitem>
Packit ae235b
<listitem><para>a back reference (see next section)</para></listitem>
Packit ae235b
<listitem><para>a parenthesized subpattern (unless it is an assertion)</para></listitem>
Packit ae235b
</itemizedlist>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
The general repetition quantifier specifies a minimum and maximum number
Packit ae235b
of permitted matches, by giving the two numbers in curly brackets
Packit ae235b
(braces), separated by a comma. The numbers must be less than 65536,
Packit ae235b
and the first must be less than or equal to the second. For example:
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
z{2,4}
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
matches "zz", "zzz", or "zzzz". A closing brace on its own is not a
Packit ae235b
special character. If the second number is omitted, but the comma is
Packit ae235b
present, there is no upper limit; if the second number and the comma
Packit ae235b
are both omitted, the quantifier specifies an exact number of required
Packit ae235b
matches. Thus
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
[aeiou]{3,}
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
matches at least 3 successive vowels, but may match many more, while
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
\d{8}
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
matches exactly 8 digits. An opening curly bracket that appears in a
Packit ae235b
position where a quantifier is not allowed, or one that does not match
Packit ae235b
the syntax of a quantifier, is taken as a literal character. For example,
Packit ae235b
{,6} is not a quantifier, but a literal string of four characters.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to
Packit ae235b
individual bytes. Thus, for example, \x{100}{2} matches two UTF-8
Packit ae235b
characters, each of which is represented by a two-byte sequence. Similarly,
Packit ae235b
\X{3} matches three Unicode extended sequences, each of which may be
Packit ae235b
several bytes long (and they may be of different lengths).
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
The quantifier {0} is permitted, causing the expression to behave as if
Packit ae235b
the previous item and the quantifier were not present.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
For convenience, the three most common quantifiers have single-character
Packit ae235b
abbreviations:
Packit ae235b
</para>
Packit ae235b
Packit ae235b
Packit ae235b
<title>Abbreviations for quantifiers</title>
Packit ae235b
<tgroup cols="2">
Packit ae235b
<colspec colnum="1" align="center"/>
Packit ae235b
Packit ae235b
  <row>
Packit ae235b
    <entry>Abbreviation</entry>
Packit ae235b
    <entry>Meaning</entry>
Packit ae235b
  </row>
Packit ae235b
Packit ae235b
Packit ae235b
  <row>
Packit ae235b
    <entry>*</entry>
Packit ae235b
    <entry>is equivalent to {0,}</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>+</entry>
Packit ae235b
    <entry>is equivalent to {1,}</entry>
Packit ae235b
  </row>
Packit ae235b
  <row>
Packit ae235b
    <entry>?</entry>
Packit ae235b
    <entry>is equivalent to {0,1}</entry>
Packit ae235b
  </row>
Packit ae235b
Packit ae235b
</tgroup>
Packit ae235b
Packit ae235b
Packit ae235b
<para>
Packit ae235b
It is possible to construct infinite loops by following a subpattern
Packit ae235b
that can match no characters with a quantifier that has no upper limit,
Packit ae235b
for example:
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
(a?)*
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
Because there are cases where this can be useful, such patterns are
Packit ae235b
accepted, but if any repetition of the subpattern does in fact match
Packit ae235b
no characters, the loop is forcibly broken.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
By default, the quantifiers are "greedy", that is, they match as much
Packit ae235b
as possible (up to the maximum number of permitted times), without
Packit ae235b
causing the rest of the pattern to fail. The classic example of where
Packit ae235b
this gives problems is in trying to match comments in C programs. These
Packit ae235b
appear between /* and */ and within the comment, individual * and /
Packit ae235b
characters may appear. An attempt to match C comments by applying the
Packit ae235b
pattern
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
/\*.*\*/
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
to the string
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
/* first comment */  not comment  /* second comment */
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
fails, because it matches the entire string owing to the greediness of
Packit ae235b
the .* item.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
However, if a quantifier is followed by a question mark, it ceases to
Packit ae235b
be greedy, and instead matches the minimum number of times possible, so
Packit ae235b
the pattern
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
/\*.*?\*/
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
does the right thing with the C comments. The meaning of the various
Packit ae235b
quantifiers is not otherwise changed, just the preferred number of
Packit ae235b
matches. Do not confuse this use of question mark with its use as a
Packit ae235b
quantifier in its own right. Because it has two uses, it can sometimes
Packit ae235b
appear doubled, as in
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
\d??\d
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
which matches one digit by preference, but can match two if that is the
Packit ae235b
only way the rest of the pattern matches.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
If the <varname>G_REGEX_UNGREEDY</varname> flag is set, the quantifiers are not greedy
Packit ae235b
by default, but individual ones can be made greedy by following them with
Packit ae235b
a question mark. In other words, it inverts the default behaviour.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
When a parenthesized subpattern is quantified with a minimum repeat
Packit ae235b
count that is greater than 1 or with a limited maximum, more memory is
Packit ae235b
required for the compiled pattern, in proportion to the size of the
Packit ae235b
minimum or maximum.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
If a pattern starts with .* or .{0,} and the <varname>G_REGEX_DOTALL</varname> flag
Packit ae235b
is set, thus allowing the dot to match newlines, the
Packit ae235b
pattern is implicitly anchored, because whatever follows will be tried
Packit ae235b
against every character position in the string, so there is no
Packit ae235b
point in retrying the overall match at any position after the first.
Packit ae235b
GRegex normally treats such a pattern as though it were preceded by \A.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
In cases where it is known that the string contains no newlines, it
Packit ae235b
is worth setting <varname>G_REGEX_DOTALL</varname> in order to obtain this optimization,
Packit ae235b
or alternatively using ^ to indicate anchoring explicitly.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
However, there is one situation where the optimization cannot be used.
Packit ae235b
When .* is inside capturing parentheses that are the subject of a
Packit ae235b
backreference elsewhere in the pattern, a match at the start may fail
Packit ae235b
where a later one succeeds. Consider, for example:
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
(.*)abc\1
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
If the string is "xyz123abc123" the match point is the fourth character.
Packit ae235b
For this reason, such a pattern is not implicitly anchored.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
When a capturing subpattern is repeated, the value captured is the
Packit ae235b
substring that matched the final iteration. For example, after
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
(tweedle[dume]{3}\s*)+
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
has matched "tweedledum tweedledee" the value of the captured substring
Packit ae235b
is "tweedledee". However, if there are nested capturing subpatterns,
Packit ae235b
the corresponding captured values may have been set in previous iterations.
Packit ae235b
For example, after
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
/(a|(b))+/
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
matches "aba" the value of the second captured substring is "b".
Packit ae235b
</para>
Packit ae235b
</refsect1>
Packit ae235b
Packit ae235b
<refsect1>
Packit ae235b
<title>Atomic grouping and possessive quantifiers</title>
Packit ae235b
<para>
Packit ae235b
With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
Packit ae235b
repetition, failure of what follows normally causes the repeated
Packit ae235b
item to be re-evaluated to see if a different number
Packit ae235b
of repeats allows the rest of the pattern to match. Sometimes it
Packit ae235b
is useful to prevent this, either to change the nature of the
Packit ae235b
match, or to cause it fail earlier than it otherwise might, when the
Packit ae235b
author of the pattern knows there is no point in carrying on.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
Consider, for example, the pattern \d+foo when applied to the string
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
123456bar
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
After matching all 6 digits and then failing to match "foo", the normal
Packit ae235b
action of the matcher is to try again with only 5 digits matching the
Packit ae235b
\d+ item, and then with 4, and so on, before ultimately failing.
Packit ae235b
"Atomic grouping" (a term taken from Jeffrey Friedl’s book) provides
Packit ae235b
the means for specifying that once a subpattern has matched, it is not
Packit ae235b
to be re-evaluated in this way.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
If we use atomic grouping for the previous example, the matcher
Packit ae235b
give up immediately on failing to match "foo" the first time. The notation
Packit ae235b
is a kind of special parenthesis, starting with (?> as in this
Packit ae235b
example:
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
(?>\d+)foo
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
This kind of parenthesis "locks up" the part of the pattern it contains
Packit ae235b
once it has matched, and a failure further into the pattern is
Packit ae235b
prevented from backtracking into it. Backtracking past it to previous
Packit ae235b
items, however, works as normal.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
An alternative description is that a subpattern of this type matches
Packit ae235b
the string of characters that an identical standalone pattern would
Packit ae235b
match, if anchored at the current point in the string.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
Atomic grouping subpatterns are not capturing subpatterns. Simple cases
Packit ae235b
such as the above example can be thought of as a maximizing repeat that
Packit ae235b
must swallow everything it can. So, while both \d+ and \d+? are prepared
Packit ae235b
to adjust the number of digits they match in order to make the
Packit ae235b
rest of the pattern match, (?>\d+) can only match an entire sequence of
Packit ae235b
digits.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
Atomic groups in general can of course contain arbitrarily complicated
Packit ae235b
subpatterns, and can be nested. However, when the subpattern for an
Packit ae235b
atomic group is just a single repeated item, as in the example above, a
Packit ae235b
simpler notation, called a "possessive quantifier" can be used. This
Packit ae235b
consists of an additional + character following a quantifier. Using
Packit ae235b
this notation, the previous example can be rewritten as
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
\d++foo
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
Possessive quantifiers are always greedy; the setting of the
Packit ae235b
<varname>G_REGEX_UNGREEDY</varname> option is ignored. They are a convenient notation for the
Packit ae235b
simpler forms of atomic group. However, there is no difference in the
Packit ae235b
meaning of a possessive quantifier and the equivalent
Packit ae235b
atomic group, though there may be a performance difference;
Packit ae235b
possessive quantifiers should be slightly faster.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
The possessive quantifier syntax is an extension to the Perl syntax.
Packit ae235b
It was invented by Jeffrey Friedl in the first edition of his book and
Packit ae235b
then implemented by Mike McCloskey in Sun's Java package.
Packit ae235b
It ultimately found its way into Perl at release 5.10.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
GRegex has an optimization that automatically "possessifies" certain simple
Packit ae235b
pattern constructs. For example, the sequence A+B is treated as A++B because
Packit ae235b
there is no point in backtracking into a sequence of A's when B must follow.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
When a pattern contains an unlimited repeat inside a subpattern that
Packit ae235b
can itself be repeated an unlimited number of times, the use of an
Packit ae235b
atomic group is the only way to avoid some failing matches taking a
Packit ae235b
very long time indeed. The pattern
Packit ae235b
</para>
Packit ae235b
 
Packit ae235b
<programlisting>
Packit ae235b
(\D+|<\d+>)*[!?]
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
matches an unlimited number of substrings that either consist of non-
Packit ae235b
digits, or digits enclosed in <>, followed by either ! or ?. When it
Packit ae235b
matches, it runs quickly. However, if it is applied to
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
it takes a long time before reporting failure. This is because the
Packit ae235b
string can be divided between the internal \D+ repeat and the external
Packit ae235b
* repeat in a large number of ways, and all have to be tried. (The
Packit ae235b
example uses [!?] rather than a single character at the end, because
Packit ae235b
GRegex has an optimization that allows for fast failure
Packit ae235b
when a single character is used. It remember the last single character
Packit ae235b
that is required for a match, and fail early if it is not present
Packit ae235b
in the string.) If the pattern is changed so that it uses an atomic
Packit ae235b
group, like this:
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
((?>\D+)|<\d+>)*[!?]
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
sequences of non-digits cannot be broken, and failure happens quickly.
Packit ae235b
</para>
Packit ae235b
</refsect1>
Packit ae235b
Packit ae235b
<refsect1>
Packit ae235b
<title>Back references</title>
Packit ae235b
<para>
Packit ae235b
Outside a character class, a backslash followed by a digit greater than
Packit ae235b
0 (and possibly further digits) is a back reference to a capturing subpattern
Packit ae235b
earlier (that is, to its left) in the pattern, provided there have been that
Packit ae235b
many previous capturing left parentheses.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
However, if the decimal number following the backslash is less than 10,
Packit ae235b
it is always taken as a back reference, and causes an error only if
Packit ae235b
there are not that many capturing left parentheses in the entire pattern.
Packit ae235b
In other words, the parentheses that are referenced need not be
Packit ae235b
to the left of the reference for numbers less than 10. A "forward back
Packit ae235b
reference" of this type can make sense when a repetition is involved and
Packit ae235b
the subpattern to the right has participated in an earlier iteration.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
It is not possible to have a numerical "forward back reference" to subpattern
Packit ae235b
whose number is 10 or more using this syntax because a sequence such as \e50 is
Packit ae235b
interpreted as a character defined in octal. See the subsection entitled
Packit ae235b
"Non-printing characters" above for further details of the handling of digits
Packit ae235b
following a backslash. There is no such problem when named parentheses are used.
Packit ae235b
A back reference to any subpattern is possible using named parentheses (see below).
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
Another way of avoiding the ambiguity inherent in the use of digits following a
Packit ae235b
backslash is to use the \g escape sequence (introduced in Perl 5.10.)
Packit ae235b
This escape must be followed by a positive or a negative number,
Packit ae235b
optionally enclosed in braces.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
A positive number specifies an absolute reference without the ambiguity that is
Packit ae235b
present in the older syntax. It is also useful when literal digits follow the
Packit ae235b
reference. A negative number is a relative reference. Consider "(abc(def)ghi)\g{-1}",
Packit ae235b
the sequence \g{-1} is a reference to the most recently started capturing
Packit ae235b
subpattern before \g, that is, is it equivalent to \2. Similarly, \g{-2}
Packit ae235b
would be equivalent to \1. The use of relative references can be helpful in
Packit ae235b
long patterns, and also in patterns that are created by joining together
Packit ae235b
fragments that contain references within themselves.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
A back reference matches whatever actually matched the capturing subpattern
Packit ae235b
in the current string, rather than anything matching
Packit ae235b
the subpattern itself (see "Subpatterns as subroutines" below for a way
Packit ae235b
of doing that). So the pattern
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
(sens|respons)e and \1ibility
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
matches "sense and sensibility" and "response and responsibility", but
Packit ae235b
not "sense and responsibility". If caseful matching is in force at the
Packit ae235b
time of the back reference, the case of letters is relevant. For example,
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
((?i)rah)\s+\1
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
matches "rah rah" and "RAH RAH", but not "RAH rah", even though the
Packit ae235b
original capturing subpattern is matched caselessly.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
Back references to named subpatterns use the Perl syntax \k<name> or \k'name'
Packit ae235b
or the Python syntax (?P=name). We could rewrite the above example in either of
Packit ae235b
the following ways:
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
(?<p1>(?i)rah)\s+\k<p1>
Packit ae235b
(?P<p1>(?i)rah)\s+(?P=p1)
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
A subpattern that is referenced by name may appear in the pattern before or
Packit ae235b
after the reference.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
There may be more than one back reference to the same subpattern. If a
Packit ae235b
subpattern has not actually been used in a particular match, any back
Packit ae235b
references to it always fail. For example, the pattern
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
(a|(bc))\2
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
always fails if it starts to match "a" rather than "bc". Because there
Packit ae235b
may be many capturing parentheses in a pattern, all digits following
Packit ae235b
the backslash are taken as part of a potential back reference number.
Packit ae235b
If the pattern continues with a digit character, some delimiter must be
Packit ae235b
used to terminate the back reference. If the <varname>G_REGEX_EXTENDED</varname> flag is
Packit ae235b
set, this can be whitespace. Otherwise an empty comment (see "Comments" below) can be used.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
A back reference that occurs inside the parentheses to which it refers
Packit ae235b
fails when the subpattern is first used, so, for example, (a\1) never
Packit ae235b
matches. However, such references can be useful inside repeated subpatterns.
Packit ae235b
For example, the pattern
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
(a|b\1)+
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
matches any number of "a"s and also "aba", "ababbaa" etc. At each iteration
Packit ae235b
of the subpattern, the back reference matches the character
Packit ae235b
string corresponding to the previous iteration. In order for this to
Packit ae235b
work, the pattern must be such that the first iteration does not need
Packit ae235b
to match the back reference. This can be done using alternation, as in
Packit ae235b
the example above, or by a quantifier with a minimum of zero.
Packit ae235b
</para>
Packit ae235b
</refsect1>
Packit ae235b
Packit ae235b
<refsect1>
Packit ae235b
<title>Assertions</title>
Packit ae235b
<para>
Packit ae235b
An assertion is a test on the characters following or preceding the
Packit ae235b
current matching point that does not actually consume any characters.
Packit ae235b
The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are
Packit ae235b
described above.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
More complicated assertions are coded as subpatterns. There are two
Packit ae235b
kinds: those that look ahead of the current position in the
Packit ae235b
string, and those that look behind it. An assertion subpattern is
Packit ae235b
matched in the normal way, except that it does not cause the current
Packit ae235b
matching position to be changed.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
Assertion subpatterns are not capturing subpatterns, and may not be
Packit ae235b
repeated, because it makes no sense to assert the same thing several
Packit ae235b
times. If any kind of assertion contains capturing subpatterns within
Packit ae235b
it, these are counted for the purposes of numbering the capturing
Packit ae235b
subpatterns in the whole pattern. However, substring capturing is carried
Packit ae235b
out only for positive assertions, because it does not make sense for
Packit ae235b
negative assertions.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<refsect2>
Packit ae235b
<title>Lookahead assertions</title>
Packit ae235b
<para>
Packit ae235b
Lookahead assertions start with (?= for positive assertions and (?! for
Packit ae235b
negative assertions. For example,
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
\w+(?=;)
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
matches a word followed by a semicolon, but does not include the semicolon
Packit ae235b
in the match, and
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
foo(?!bar)
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
matches any occurrence of "foo" that is not followed by "bar". Note
Packit ae235b
that the apparently similar pattern
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
(?!foo)bar
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
does not find an occurrence of "bar" that is preceded by something
Packit ae235b
other than "foo"; it finds any occurrence of "bar" whatsoever, because
Packit ae235b
the assertion (?!foo) is always true when the next three characters are
Packit ae235b
"bar". A lookbehind assertion is needed to achieve the other effect.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
If you want to force a matching failure at some point in a pattern, the
Packit ae235b
most convenient way to do it is with (?!) because an empty string
Packit ae235b
always matches, so an assertion that requires there not to be an empty
Packit ae235b
string must always fail.
Packit ae235b
</para>
Packit ae235b
</refsect2>
Packit ae235b
Packit ae235b
<refsect2>
Packit ae235b
<title>Lookbehind assertions</title>
Packit ae235b
<para>
Packit ae235b
Lookbehind assertions start with (?<= for positive assertions and (?<!
Packit ae235b
for negative assertions. For example,
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
(?<!foo)bar
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
does find an occurrence of "bar" that is not preceded by "foo". The
Packit ae235b
contents of a lookbehind assertion are restricted such that all the
Packit ae235b
strings it matches must have a fixed length. However, if there are
Packit ae235b
several top-level alternatives, they do not all have to have the same
Packit ae235b
fixed length. Thus
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
(?<=bullock|donkey)
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
is permitted, but
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
(?<!dogs?|cats?)
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
causes an error at compile time. Branches that match different length
Packit ae235b
strings are permitted only at the top level of a lookbehind assertion.
Packit ae235b
An assertion such as
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
(?<=ab(c|de))
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
is not permitted, because its single top-level branch can match two
Packit ae235b
different lengths, but it is acceptable if rewritten to use two top-
Packit ae235b
level branches:
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
(?<=abc|abde)
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
The implementation of lookbehind assertions is, for each alternative,
Packit ae235b
to temporarily move the current position back by the fixed length and
Packit ae235b
then try to match. If there are insufficient characters before the
Packit ae235b
current position, the assertion fails.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
GRegex does not allow the \C escape (which matches a single byte in UTF-8
Packit ae235b
mode) to appear in lookbehind assertions, because it makes it impossible
Packit ae235b
to calculate the length of the lookbehind. The \X and \R escapes, which can
Packit ae235b
match different numbers of bytes, are also not permitted.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
Possessive quantifiers can be used in conjunction with lookbehind assertions to
Packit ae235b
specify efficient matching at the end of the subject string. Consider a simple
Packit ae235b
pattern such as
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
abcd$
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
when applied to a long string that does not match. Because matching
Packit ae235b
proceeds from left to right, GRegex will look for each "a" in the string
Packit ae235b
and then see if what follows matches the rest of the pattern. If the
Packit ae235b
pattern is specified as
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
^.*abcd$
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
the initial .* matches the entire string at first, but when this fails
Packit ae235b
(because there is no following "a"), it backtracks to match all but the
Packit ae235b
last character, then all but the last two characters, and so on. Once
Packit ae235b
again the search for "a" covers the entire string, from right to left,
Packit ae235b
so we are no better off. However, if the pattern is written as
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
^.*+(?<=abcd)
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
there can be no backtracking for the .*+ item; it can match only the
Packit ae235b
entire string. The subsequent lookbehind assertion does a single test
Packit ae235b
on the last four characters. If it fails, the match fails immediately.
Packit ae235b
For long strings, this approach makes a significant difference to the
Packit ae235b
processing time.
Packit ae235b
</para>
Packit ae235b
</refsect2>
Packit ae235b
Packit ae235b
<refsect2>
Packit ae235b
<title>Using multiple assertions</title>
Packit ae235b
<para>
Packit ae235b
Several assertions (of any sort) may occur in succession. For example,
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
(?<=\d{3})(?<!999)foo
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
matches "foo" preceded by three digits that are not "999". Notice that
Packit ae235b
each of the assertions is applied independently at the same point in
Packit ae235b
the string. First there is a check that the previous three
Packit ae235b
characters are all digits, and then there is a check that the same
Packit ae235b
three characters are not "999". This pattern does not match "foo" preceded
Packit ae235b
by six characters, the first of which are digits and the last
Packit ae235b
three of which are not "999". For example, it doesn’t match "123abcfoo".
Packit ae235b
A pattern to do that is
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
(?<=\d{3}...)(?<!999)foo
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
This time the first assertion looks at the preceding six characters,
Packit ae235b
checking that the first three are digits, and then the second assertion
Packit ae235b
checks that the preceding three characters are not "999".
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
Assertions can be nested in any combination. For example,
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
(?<=(?<!foo)bar)baz
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
matches an occurrence of "baz" that is preceded by "bar" which in turn
Packit ae235b
is not preceded by "foo", while
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
(?<=\d{3}(?!999)...)foo
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
is another pattern that matches "foo" preceded by three digits and any
Packit ae235b
three characters that are not "999".
Packit ae235b
</para>
Packit ae235b
</refsect2>
Packit ae235b
</refsect1>
Packit ae235b
Packit ae235b
<refsect1>
Packit ae235b
<title>Conditional subpatterns</title>
Packit ae235b
<para>
Packit ae235b
It is possible to cause the matching process to obey a subpattern
Packit ae235b
conditionally or to choose between two alternative subpatterns, depending
Packit ae235b
on the result of an assertion, or whether a previous capturing subpattern
Packit ae235b
matched or not. The two possible forms of conditional subpattern are
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
(?(condition)yes-pattern)
Packit ae235b
(?(condition)yes-pattern|no-pattern)
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
If the condition is satisfied, the yes-pattern is used; otherwise the
Packit ae235b
no-pattern (if present) is used. If there are more than two alternatives
Packit ae235b
in the subpattern, a compile-time error occurs.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
There are four kinds of condition: references to subpatterns, references to
Packit ae235b
recursion, a pseudo-condition called DEFINE, and assertions.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<refsect2>
Packit ae235b
<title>Checking for a used subpattern by number</title>
Packit ae235b
<para>
Packit ae235b
If the text between the parentheses consists of a sequence of digits, the
Packit ae235b
condition is true if the capturing subpattern of that number has previously
Packit ae235b
matched.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
Consider the following pattern, which contains non-significant white space
Packit ae235b
to make it more readable (assume the <varname>G_REGEX_EXTENDED</varname>)
Packit ae235b
and to divide it into three parts for ease of discussion:
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
( \( )?    [^()]+    (?(1) \) )
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
The first part matches an optional opening parenthesis, and if that
Packit ae235b
character is present, sets it as the first captured substring. The second
Packit ae235b
part matches one or more characters that are not parentheses. The
Packit ae235b
third part is a conditional subpattern that tests whether the first set
Packit ae235b
of parentheses matched or not. If they did, that is, if string started
Packit ae235b
with an opening parenthesis, the condition is true, and so the yes-pattern
Packit ae235b
is executed and a closing parenthesis is required. Otherwise,
Packit ae235b
since no-pattern is not present, the subpattern matches nothing. In
Packit ae235b
other words, this pattern matches a sequence of non-parentheses,
Packit ae235b
optionally enclosed in parentheses.
Packit ae235b
</para>
Packit ae235b
</refsect2>
Packit ae235b
Packit ae235b
<refsect2>
Packit ae235b
<title>Checking for a used subpattern by name</title>
Packit ae235b
<para>
Packit ae235b
Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a used
Packit ae235b
subpattern by name, the Python syntax (?(name)...) is also recognized. However,
Packit ae235b
there is a possible ambiguity with this syntax, because subpattern names may
Packit ae235b
consist entirely of digits. GRegex looks first for a named subpattern; if it
Packit ae235b
cannot find one and the name consists entirely of digits, GRegex looks for a
Packit ae235b
subpattern of that number, which must be greater than zero. Using subpattern
Packit ae235b
names that consist entirely of digits is not recommended.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
Rewriting the above example to use a named subpattern gives this:
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
(?<OPEN> \( )?    [^()]+    (?(<OPEN>) \) )
Packit ae235b
</programlisting>
Packit ae235b
</refsect2>
Packit ae235b
Packit ae235b
<refsect2>
Packit ae235b
<title>Checking for pattern recursion</title>
Packit ae235b
<para>
Packit ae235b
If the condition is the string (R), and there is no subpattern with the name R,
Packit ae235b
the condition is true if a recursive call to the whole pattern or any
Packit ae235b
subpattern has been made. If digits or a name preceded by ampersand follow the
Packit ae235b
letter R, for example:
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
(?(R3)...)
Packit ae235b
(?(R&name)...)
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
the condition is true if the most recent recursion is into the subpattern whose
Packit ae235b
number or name is given. This condition does not check the entire recursion
Packit ae235b
stack.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
At "top level", all these recursion test conditions are false. Recursive
Packit ae235b
patterns are described below.
Packit ae235b
</para>
Packit ae235b
</refsect2>
Packit ae235b
Packit ae235b
<refsect2>
Packit ae235b
<title>Defining subpatterns for use by reference only</title>
Packit ae235b
<para>
Packit ae235b
If the condition is the string (DEFINE), and there is no subpattern with the
Packit ae235b
name DEFINE, the condition is always false. In this case, there may be only one
Packit ae235b
alternative in the subpattern. It is always skipped if control reaches this
Packit ae235b
point in the pattern; the idea of DEFINE is that it can be used to define
Packit ae235b
"subroutines" that can be referenced from elsewhere. (The use of "subroutines"
Packit ae235b
is described below.) For example, a pattern to match an IPv4 address could be
Packit ae235b
written like this (ignore whitespace and line breaks):
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
(?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
Packit ae235b
\b (?&byte) (\.(?&byte)){3} \b
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
The first part of the pattern is a DEFINE group inside which a another group
Packit ae235b
named "byte" is defined. This matches an individual component of an IPv4
Packit ae235b
address (a number less than 256). When matching takes place, this part of the
Packit ae235b
pattern is skipped because DEFINE acts like a false condition.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
The rest of the pattern uses references to the named group to match the four
Packit ae235b
dot-separated components of an IPv4 address, insisting on a word boundary at
Packit ae235b
each end.
Packit ae235b
</para>
Packit ae235b
</refsect2>
Packit ae235b
Packit ae235b
<refsect2>
Packit ae235b
<title>Assertion conditions</title>
Packit ae235b
<para>
Packit ae235b
If the condition is not in any of the above formats, it must be an
Packit ae235b
assertion. This may be a positive or negative lookahead or lookbehind
Packit ae235b
assertion. Consider this pattern, again containing non-significant
Packit ae235b
white space, and with the two alternatives on the second line:
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
(?(?=[^a-z]*[a-z])
Packit ae235b
\d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
The condition is a positive lookahead assertion that matches an
Packit ae235b
optional sequence of non-letters followed by a letter. In other words,
Packit ae235b
it tests for the presence of at least one letter in the string. If a
Packit ae235b
letter is found, the string is matched against the first alternative;
Packit ae235b
otherwise it is matched against the second. This pattern matches
Packit ae235b
strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
Packit ae235b
letters and dd are digits.
Packit ae235b
</para>
Packit ae235b
</refsect2>
Packit ae235b
</refsect1>
Packit ae235b
Packit ae235b
<refsect1>
Packit ae235b
<title>Comments</title>
Packit ae235b
<para>
Packit ae235b
The sequence (?# marks the start of a comment that continues up to the
Packit ae235b
next closing parenthesis. Nested parentheses are not permitted. The
Packit ae235b
characters that make up a comment play no part in the pattern matching
Packit ae235b
at all.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
If the <varname>G_REGEX_EXTENDED</varname> option is set, an unescaped #
Packit ae235b
character outside a character class introduces a comment that continues to
Packit ae235b
immediately after the next newline in the pattern.
Packit ae235b
</para>
Packit ae235b
</refsect1>
Packit ae235b
Packit ae235b
<refsect1>
Packit ae235b
<title>Recursive patterns</title>
Packit ae235b
<para>
Packit ae235b
Consider the problem of matching a string in parentheses, allowing for
Packit ae235b
unlimited nested parentheses. Without the use of recursion, the best
Packit ae235b
that can be done is to use a pattern that matches up to some fixed
Packit ae235b
depth of nesting. It is not possible to handle an arbitrary nesting
Packit ae235b
depth.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
For some time, Perl has provided a facility that allows regular expressions to
Packit ae235b
recurse (amongst other things). It does this by interpolating Perl code in the
Packit ae235b
expression at run time, and the code can refer to the expression itself. A Perl
Packit ae235b
pattern using code interpolation to solve the parentheses problem can be
Packit ae235b
created like this:
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
$re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
The (?p{...}) item interpolates Perl code at run time, and in this case refers
Packit ae235b
recursively to the pattern in which it appears.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
Obviously, GRegex cannot support the interpolation of Perl code. Instead, it
Packit ae235b
supports special syntax for recursion of the entire pattern, and also for
Packit ae235b
individual subpattern recursion. This kind of recursion was introduced into
Packit ae235b
Perl at release 5.10.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
A special item that consists of (? followed by a number greater than zero and a
Packit ae235b
closing parenthesis is a recursive call of the subpattern of the given number,
Packit ae235b
provided that it occurs inside that subpattern. (If not, it is a "subroutine"
Packit ae235b
call, which is described in the next section.) The special item (?R) or (?0) is
Packit ae235b
a recursive call of the entire regular expression.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
In GRegex (like Python, but unlike Perl), a recursive subpattern call is always
Packit ae235b
treated as an atomic group. That is, once it has matched some of the subject
Packit ae235b
string, it is never re-entered, even if it contains untried alternatives and
Packit ae235b
there is a subsequent matching failure.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
This pattern solves the nested parentheses problem (assume the
Packit ae235b
<varname>G_REGEX_EXTENDED</varname> option is set so that white space is
Packit ae235b
ignored):
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
\( ( (?>[^()]+) | (?R) )* \)
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
First it matches an opening parenthesis. Then it matches any number of
Packit ae235b
substrings which can either be a sequence of non-parentheses, or a
Packit ae235b
recursive match of the pattern itself (that is, a correctly parenthesized
Packit ae235b
substring). Finally there is a closing parenthesis.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
If this were part of a larger pattern, you would not want to recurse
Packit ae235b
the entire pattern, so instead you could use this:
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
( \( ( (?>[^()]+) | (?1) )* \) )
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
We have put the pattern into parentheses, and caused the recursion to
Packit ae235b
refer to them instead of the whole pattern. In a larger pattern, keeping
Packit ae235b
track of parenthesis numbers can be tricky. It may be more convenient to
Packit ae235b
use named parentheses instead.
Packit ae235b
The Perl syntax for this is (?&name); GRegex also supports the(?P>name)
Packit ae235b
syntac. We could rewrite the above example as follows:
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
(?<pn> \( ( (?>[^()]+) | (?&pn) )* \) )
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
If there is more than one subpattern with the same name, the earliest one is
Packit ae235b
used. This particular example pattern contains nested unlimited repeats, and so
Packit ae235b
the use of atomic grouping for matching strings of non-parentheses is important
Packit ae235b
when applying the pattern to strings that do not match.
Packit ae235b
For example, when this pattern is applied to
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
it yields "no match" quickly. However, if atomic grouping is not used,
Packit ae235b
the match runs for a very long time indeed because there are so many
Packit ae235b
different ways the + and * repeats can carve up the string, and all
Packit ae235b
have to be tested before failure can be reported.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
At the end of a match, the values set for any capturing subpatterns are
Packit ae235b
those from the outermost level of the recursion at which the subpattern
Packit ae235b
value is set.
Packit ae235b
Packit ae235b
Packit ae235b
If you want to obtain intermediate values, a callout
Packit ae235b
function can be used (see below and the pcrecallout documentation). -->
Packit ae235b
Packit ae235b
If the pattern above is matched against
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
(ab(cd)ef)
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
the value for the capturing parentheses is "ef", which is the last
Packit ae235b
value taken on at the top level. If additional parentheses are added,
Packit ae235b
giving
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
\( ( ( (?>[^()]+) | (?R) )* ) \)
Packit ae235b
   ^                        ^
Packit ae235b
   ^                        ^
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
the string they capture is "ab(cd)ef", the contents of the top level
Packit ae235b
parentheses.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
Do not confuse the (?R) item with the condition (R), which tests for
Packit ae235b
recursion. Consider this pattern, which matches text in angle brackets,
Packit ae235b
allowing for arbitrary nesting. Only digits are allowed in nested
Packit ae235b
brackets (that is, when recursing), whereas any characters are permitted
Packit ae235b
at the outer level.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
< (?: (?(R) \d++ | [^<>]*+) | (?R)) * >
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
In this pattern, (?(R) is the start of a conditional subpattern, with
Packit ae235b
two different alternatives for the recursive and non-recursive cases.
Packit ae235b
The (?R) item is the actual recursive call.
Packit ae235b
</para>
Packit ae235b
</refsect1>
Packit ae235b
Packit ae235b
<refsect1>
Packit ae235b
<title>Subpatterns as subroutines</title>
Packit ae235b
<para>
Packit ae235b
If the syntax for a recursive subpattern reference (either by number or
Packit ae235b
by name) is used outside the parentheses to which it refers, it operates
Packit ae235b
like a subroutine in a programming language. The "called" subpattern may
Packit ae235b
be defined before or after the reference. An earlier example pointed out
Packit ae235b
that the pattern
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
(sens|respons)e and \1ibility
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
matches "sense and sensibility" and "response and responsibility", but
Packit ae235b
not "sense and responsibility". If instead the pattern
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
(sens|respons)e and (?1)ibility
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
is used, it does match "sense and responsibility" as well as the other
Packit ae235b
two strings. Another example is given in the discussion of DEFINE above.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
Like recursive subpatterns, a "subroutine" call is always treated as an atomic
Packit ae235b
group. That is, once it has matched some of the string, it is never
Packit ae235b
re-entered, even if it contains untried alternatives and there is a subsequent
Packit ae235b
matching failure.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
When a subpattern is used as a subroutine, processing options such as
Packit ae235b
case-independence are fixed when the subpattern is defined. They cannot be
Packit ae235b
changed for different calls. For example, consider this pattern:
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
(abc)(?i:(?1))
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
It matches "abcabc". It does not match "abcABC" because the change of
Packit ae235b
processing option does not affect the called subpattern.
Packit ae235b
</para>
Packit ae235b
</refsect1>
Packit ae235b
Packit ae235b
Packit ae235b
<refsect1>
Packit ae235b
<title>Callouts</title>
Packit ae235b
<para>
Packit ae235b
Perl has a feature whereby using the sequence (?{...}) causes arbitrary
Packit ae235b
Perl code to be obeyed in the middle of matching a regular expression.
Packit ae235b
This makes it possible, amongst other things, to extract different substrings that match the same pair of parentheses when there is a repetition.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
PCRE provides a similar feature, but of course it cannot obey arbitrary
Packit ae235b
Perl code. The feature is called "callout". The caller of PCRE provides
Packit ae235b
an external function by putting its entry point in the global variable
Packit ae235b
pcre_callout. By default, this variable contains NULL, which disables
Packit ae235b
all calling out.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
Within a regular expression, (?C) indicates the points at which the
Packit ae235b
external function is to be called. If you want to identify different
Packit ae235b
callout points, you can put a number less than 256 after the letter C.
Packit ae235b
The default value is zero. For example, this pattern has two callout
Packit ae235b
points:
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
(?C1)abc(?C2)def
Packit ae235b
</programlisting>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are
Packit ae235b
automatically installed before each item in the pattern. They are all
Packit ae235b
numbered 255.
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<para>
Packit ae235b
During matching, when PCRE reaches a callout point (and pcre_callout is
Packit ae235b
set), the external function is called. It is provided with the number
Packit ae235b
of the callout, the position in the pattern, and, optionally, one item
Packit ae235b
of data originally supplied by the caller of pcre_exec(). The callout
Packit ae235b
function may cause matching to proceed, to backtrack, or to fail altogether. A complete description of the interface to the callout function
Packit ae235b
is given in the pcrecallout documentation.
Packit ae235b
</para>
Packit ae235b
</refsect1>
Packit ae235b
-->
Packit ae235b
Packit ae235b
<refsect1>
Packit ae235b
<title>Copyright</title>
Packit ae235b
<para>
Packit ae235b
This document was copied and adapted from the PCRE documentation,
Packit ae235b
specifically from the man page for pcrepattern.
Packit ae235b
The original copyright note is:
Packit ae235b
</para>
Packit ae235b
Packit ae235b
<programlisting>
Packit ae235b
Copyright (c) 1997-2006 University of Cambridge.
Packit ae235b
Packit ae235b
Redistribution and use in source and binary forms, with or without
Packit ae235b
modification, are permitted provided that the following conditions are met:
Packit ae235b
Packit ae235b
    * Redistributions of source code must retain the above copyright notice,
Packit ae235b
      this list of conditions and the following disclaimer.
Packit ae235b
Packit ae235b
    * Redistributions in binary form must reproduce the above copyright
Packit ae235b
      notice, this list of conditions and the following disclaimer in the
Packit ae235b
      documentation and/or other materials provided with the distribution.
Packit ae235b
Packit ae235b
    * Neither the name of the University of Cambridge nor the name of Google
Packit ae235b
      Inc. nor the names of their contributors may be used to endorse or
Packit ae235b
      promote products derived from this software without specific prior
Packit ae235b
      written permission.
Packit ae235b
Packit ae235b
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
Packit ae235b
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
Packit ae235b
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
Packit ae235b
ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
Packit ae235b
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
Packit ae235b
CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
Packit ae235b
SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
Packit ae235b
INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
Packit ae235b
CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
Packit ae235b
ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
Packit ae235b
POSSIBILITY OF SUCH DAMAGE.
Packit ae235b
</programlisting>
Packit ae235b
</refsect1>
Packit ae235b
Packit ae235b
</refentry>