Blame man7/regex.7

Packit 7cfc04
.\" From Henry Spencer's regex package (as found in the apache
Packit 7cfc04
.\" distribution). The package carries the following copyright:
Packit 7cfc04
.\"
Packit 7cfc04
.\"  Copyright 1992, 1993, 1994 Henry Spencer.  All rights reserved.
Packit 7cfc04
.\" %%%LICENSE_START(MISC)
Packit 7cfc04
.\"  This software is not subject to any license of the American Telephone
Packit 7cfc04
.\"  and Telegraph Company or of the Regents of the University of California.
Packit 7cfc04
.\"
Packit 7cfc04
.\"  Permission is granted to anyone to use this software for any purpose
Packit 7cfc04
.\"  on any computer system, and to alter it and redistribute it, subject
Packit 7cfc04
.\"  to the following restrictions:
Packit 7cfc04
.\"
Packit 7cfc04
.\"  1. The author is not responsible for the consequences of use of this
Packit 7cfc04
.\"     software, no matter how awful, even if they arise from flaws in it.
Packit 7cfc04
.\"
Packit 7cfc04
.\"  2. The origin of this software must not be misrepresented, either by
Packit 7cfc04
.\"     explicit claim or by omission.  Since few users ever read sources,
Packit 7cfc04
.\"     credits must appear in the documentation.
Packit 7cfc04
.\"
Packit 7cfc04
.\"  3. Altered versions must be plainly marked as such, and must not be
Packit 7cfc04
.\"     misrepresented as being the original software.  Since few users
Packit 7cfc04
.\"     ever read sources, credits must appear in the documentation.
Packit 7cfc04
.\"
Packit 7cfc04
.\"  4. This notice may not be removed or altered.
Packit 7cfc04
.\" %%%LICENSE_END
Packit 7cfc04
.\"
Packit 7cfc04
.\" In order to comply with `credits must appear in the documentation'
Packit 7cfc04
.\" I added an AUTHOR paragraph below - aeb.
Packit 7cfc04
.\"
Packit 7cfc04
.\" In the default nroff environment there is no dagger \(dg.
Packit 7cfc04
.\"
Packit 7cfc04
.\" 2005-05-11 Removed discussion of `[[:<:]]' and `[[:>:]]', which
Packit 7cfc04
.\" 	appear not to be in the glibc implementation of regcomp
Packit 7cfc04
.\"
Packit 7cfc04
.ie t .ds dg \(dg
Packit 7cfc04
.el .ds dg (!)
Packit 7cfc04
.TH REGEX 7 2009-01-12 "" "Linux Programmer's Manual"
Packit 7cfc04
.SH NAME
Packit 7cfc04
regex \- POSIX.2 regular expressions
Packit 7cfc04
.SH DESCRIPTION
Packit 7cfc04
Regular expressions ("RE"s),
Packit 7cfc04
as defined in POSIX.2, come in two forms:
Packit 7cfc04
modern REs (roughly those of
Packit 7cfc04
.IR egrep ;
Packit 7cfc04
POSIX.2 calls these "extended" REs)
Packit 7cfc04
and obsolete REs (roughly those of
Packit 7cfc04
.BR ed (1);
Packit 7cfc04
POSIX.2 "basic" REs).
Packit 7cfc04
Obsolete REs mostly exist for backward compatibility in some old programs;
Packit 7cfc04
they will be discussed at the end.
Packit 7cfc04
POSIX.2 leaves some aspects of RE syntax and semantics open;
Packit 7cfc04
"\*(dg" marks decisions on these aspects that
Packit 7cfc04
may not be fully portable to other POSIX.2 implementations.
Packit 7cfc04
.PP
Packit 7cfc04
A (modern) RE is one\*(dg or more nonempty\*(dg \fIbranches\fR,
Packit 7cfc04
separated by \(aq|\(aq.
Packit 7cfc04
It matches anything that matches one of the branches.
Packit 7cfc04
.PP
Packit 7cfc04
A branch is one\*(dg or more \fIpieces\fR, concatenated.
Packit 7cfc04
It matches a match for the first, followed by a match for the second,
Packit 7cfc04
and so on.
Packit 7cfc04
.PP
Packit 7cfc04
A piece is an \fIatom\fR possibly followed
Packit 7cfc04
by a single\*(dg \(aq*\(aq, \(aq+\(aq, \(aq?\(aq, or \fIbound\fR.
Packit 7cfc04
An atom followed by \(aq*\(aq
Packit 7cfc04
matches a sequence of 0 or more matches of the atom.
Packit 7cfc04
An atom followed by \(aq+\(aq
Packit 7cfc04
matches a sequence of 1 or more matches of the atom.
Packit 7cfc04
An atom followed by \(aq?\(aq
Packit 7cfc04
matches a sequence of 0 or 1 matches of the atom.
Packit 7cfc04
.PP
Packit 7cfc04
A \fIbound\fR is \(aq{\(aq followed by an unsigned decimal integer,
Packit 7cfc04
possibly followed by \(aq,\(aq
Packit 7cfc04
possibly followed by another unsigned decimal integer,
Packit 7cfc04
always followed by \(aq}\(aq.
Packit 7cfc04
The integers must lie between 0 and
Packit 7cfc04
.B RE_DUP_MAX
Packit 7cfc04
(255\*(dg) inclusive,
Packit 7cfc04
and if there are two of them, the first may not exceed the second.
Packit 7cfc04
An atom followed by a bound containing one integer \fIi\fR
Packit 7cfc04
and no comma matches
Packit 7cfc04
a sequence of exactly \fIi\fR matches of the atom.
Packit 7cfc04
An atom followed by a bound
Packit 7cfc04
containing one integer \fIi\fR and a comma matches
Packit 7cfc04
a sequence of \fIi\fR or more matches of the atom.
Packit 7cfc04
An atom followed by a bound
Packit 7cfc04
containing two integers \fIi\fR and \fIj\fR matches
Packit 7cfc04
a sequence of \fIi\fR through \fIj\fR (inclusive) matches of the atom.
Packit 7cfc04
.PP
Packit 7cfc04
An atom is a regular expression enclosed in "\fI()\fP"
Packit 7cfc04
(matching a match for the regular expression),
Packit 7cfc04
an empty set of "\fI()\fP" (matching the null string)\*(dg,
Packit 7cfc04
a \fIbracket expression\fR (see below), \(aq.\(aq
Packit 7cfc04
(matching any single character), \(aq^\(aq (matching the null string at the
Packit 7cfc04
beginning of a line), \(aq$\(aq (matching the null string at the
Packit 7cfc04
end of a line), a \(aq\e\(aq followed by one of the characters
Packit 7cfc04
"\fI^.[$()|*+?{\e\fP"
Packit 7cfc04
(matching that character taken as an ordinary character),
Packit 7cfc04
a \(aq\e\(aq followed by any other character\*(dg
Packit 7cfc04
(matching that character taken as an ordinary character,
Packit 7cfc04
as if the \(aq\e\(aq had not been present\*(dg),
Packit 7cfc04
or a single character with no other significance (matching that character).
Packit 7cfc04
A \(aq{\(aq followed by a character other than a digit is an ordinary
Packit 7cfc04
character, not the beginning of a bound\*(dg.
Packit 7cfc04
It is illegal to end an RE with \(aq\e\(aq.
Packit 7cfc04
.PP
Packit 7cfc04
A \fIbracket expression\fR is a list of characters enclosed in "\fI[]\fP".
Packit 7cfc04
It normally matches any single character from the list (but see below).
Packit 7cfc04
If the list begins with \(aq^\(aq,
Packit 7cfc04
it matches any single character
Packit 7cfc04
(but see below) \fInot\fR from the rest of the list.
Packit 7cfc04
If two characters in the list are separated by \(aq\-\(aq, this is shorthand
Packit 7cfc04
for the full \fIrange\fR of characters between those two (inclusive) in the
Packit 7cfc04
collating sequence,
Packit 7cfc04
for example, "\fI[0\-9]\fP" in ASCII matches any decimal digit.
Packit 7cfc04
It is illegal\*(dg for two ranges to share an
Packit 7cfc04
endpoint, for example, "\fIa\-c\-e\fP".
Packit 7cfc04
Ranges are very collating-sequence-dependent,
Packit 7cfc04
and portable programs should avoid relying on them.
Packit 7cfc04
.PP
Packit 7cfc04
To include a literal \(aq]\(aq in the list, make it the first character
Packit 7cfc04
(following a possible \(aq^\(aq).
Packit 7cfc04
To include a literal \(aq\-\(aq, make it the first or last character,
Packit 7cfc04
or the second endpoint of a range.
Packit 7cfc04
To use a literal \(aq\-\(aq as the first endpoint of a range,
Packit 7cfc04
enclose it in "\fI[.\fP" and "\fI.]\fP"
Packit 7cfc04
to make it a collating element (see below).
Packit 7cfc04
With the exception of these and some combinations using \(aq[\(aq (see next
Packit 7cfc04
paragraphs), all other special characters, including \(aq\e\(aq, lose their
Packit 7cfc04
special significance within a bracket expression.
Packit 7cfc04
.PP
Packit 7cfc04
Within a bracket expression, a collating element (a character,
Packit 7cfc04
a multicharacter sequence that collates as if it were a single character,
Packit 7cfc04
or a collating-sequence name for either)
Packit 7cfc04
enclosed in "\fI[.\fP" and "\fI.]\fP" stands for the
Packit 7cfc04
sequence of characters of that collating element.
Packit 7cfc04
The sequence is a single element of the bracket expression's list.
Packit 7cfc04
A bracket expression containing a multicharacter collating element
Packit 7cfc04
can thus match more than one character,
Packit 7cfc04
for example, if the collating sequence includes a "ch" collating element,
Packit 7cfc04
then the RE "\fI[[.ch.]]*c\fP" matches the first five characters
Packit 7cfc04
of "chchcc".
Packit 7cfc04
.PP
Packit 7cfc04
Within a bracket expression, a collating element enclosed in "\fI[=\fP" and
Packit 7cfc04
"\fI=]\fP" is an equivalence class, standing for the sequences of characters
Packit 7cfc04
of all collating elements equivalent to that one, including itself.
Packit 7cfc04
(If there are no other equivalent collating elements,
Packit 7cfc04
the treatment is as if the enclosing delimiters
Packit 7cfc04
were "\fI[.\fP" and "\fI.]\fP".)
Packit 7cfc04
For example, if o and \o'o^' are the members of an equivalence class,
Packit 7cfc04
then "\fI[[=o=]]\fP", "\fI[[=\o'o^'=]]\fP",
Packit 7cfc04
and "\fI[o\o'o^']\fP" are all synonymous.
Packit 7cfc04
An equivalence class may not\*(dg be an endpoint
Packit 7cfc04
of a range.
Packit 7cfc04
.PP
Packit 7cfc04
Within a bracket expression, the name of a \fIcharacter class\fR enclosed
Packit 7cfc04
in "\fI[:\fP" and "\fI:]\fP" stands for the list
Packit 7cfc04
of all characters belonging to that
Packit 7cfc04
class.
Packit 7cfc04
Standard character class names are:
Packit 7cfc04
.PP
Packit 7cfc04
.RS
Packit 7cfc04
.TS
Packit 7cfc04
l l l.
Packit 7cfc04
alnum	digit	punct
Packit 7cfc04
alpha	graph	space
Packit 7cfc04
blank	lower	upper
Packit 7cfc04
cntrl	print	xdigit
Packit 7cfc04
.TE
Packit 7cfc04
.RE
Packit 7cfc04
.PP
Packit 7cfc04
These stand for the character classes defined in
Packit 7cfc04
.BR wctype (3).
Packit 7cfc04
A locale may provide others.
Packit 7cfc04
A character class may not be used as an endpoint of a range.
Packit 7cfc04
.\" As per http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=295666
Packit 7cfc04
.\" The following does not seem to apply in the glibc implementation
Packit 7cfc04
.\" .PP
Packit 7cfc04
.\" There are two special cases\*(dg of bracket expressions:
Packit 7cfc04
.\" the bracket expressions "\fI[[:<:]]\fP" and "\fI[[:>:]]\fP" match
Packit 7cfc04
.\" the null string at the beginning and end of a word respectively.
Packit 7cfc04
.\" A word is defined as a sequence of
Packit 7cfc04
.\" word characters
Packit 7cfc04
.\" which is neither preceded nor followed by
Packit 7cfc04
.\" word characters.
Packit 7cfc04
.\" A word character is an
Packit 7cfc04
.\" .I alnum
Packit 7cfc04
.\" character (as defined by
Packit 7cfc04
.\" .BR wctype (3))
Packit 7cfc04
.\" or an underscore.
Packit 7cfc04
.\" This is an extension,
Packit 7cfc04
.\" compatible with but not specified by POSIX.2,
Packit 7cfc04
.\" and should be used with
Packit 7cfc04
.\" caution in software intended to be portable to other systems.
Packit 7cfc04
.PP
Packit 7cfc04
In the event that an RE could match more than one substring of a given
Packit 7cfc04
string,
Packit 7cfc04
the RE matches the one starting earliest in the string.
Packit 7cfc04
If the RE could match more than one substring starting at that point,
Packit 7cfc04
it matches the longest.
Packit 7cfc04
Subexpressions also match the longest possible substrings, subject to
Packit 7cfc04
the constraint that the whole match be as long as possible,
Packit 7cfc04
with subexpressions starting earlier in the RE taking priority over
Packit 7cfc04
ones starting later.
Packit 7cfc04
Note that higher-level subexpressions thus take priority over
Packit 7cfc04
their lower-level component subexpressions.
Packit 7cfc04
.PP
Packit 7cfc04
Match lengths are measured in characters, not collating elements.
Packit 7cfc04
A null string is considered longer than no match at all.
Packit 7cfc04
For example,
Packit 7cfc04
"\fIbb*\fP" matches the three middle characters of "abbbc",
Packit 7cfc04
"\fI(wee|week)(knights|nights)\fP"
Packit 7cfc04
matches all ten characters of "weeknights",
Packit 7cfc04
when "\fI(.*).*\fP" is matched against "abc" the parenthesized subexpression
Packit 7cfc04
matches all three characters, and
Packit 7cfc04
when "\fI(a*)*\fP" is matched against "bc"
Packit 7cfc04
both the whole RE and the parenthesized
Packit 7cfc04
subexpression match the null string.
Packit 7cfc04
.PP
Packit 7cfc04
If case-independent matching is specified,
Packit 7cfc04
the effect is much as if all case distinctions had vanished from the
Packit 7cfc04
alphabet.
Packit 7cfc04
When an alphabetic that exists in multiple cases appears as an
Packit 7cfc04
ordinary character outside a bracket expression, it is effectively
Packit 7cfc04
transformed into a bracket expression containing both cases,
Packit 7cfc04
for example, \(aqx\(aq becomes "\fI[xX]\fP".
Packit 7cfc04
When it appears inside a bracket expression, all case counterparts
Packit 7cfc04
of it are added to the bracket expression, so that, for example, "\fI[x]\fP"
Packit 7cfc04
becomes "\fI[xX]\fP" and "\fI[^x]\fP" becomes "\fI[^xX]\fP".
Packit 7cfc04
.PP
Packit 7cfc04
No particular limit is imposed on the length of REs\*(dg.
Packit 7cfc04
Programs intended to be portable should not employ REs longer
Packit 7cfc04
than 256 bytes,
Packit 7cfc04
as an implementation can refuse to accept such REs and remain
Packit 7cfc04
POSIX-compliant.
Packit 7cfc04
.PP
Packit 7cfc04
Obsolete ("basic") regular expressions differ in several respects.
Packit 7cfc04
\(aq|\(aq, \(aq+\(aq, and \(aq?\(aq are
Packit 7cfc04
ordinary characters and there is no equivalent
Packit 7cfc04
for their functionality.
Packit 7cfc04
The delimiters for bounds are "\fI\e{\fP" and "\fI\e}\fP",
Packit 7cfc04
with \(aq{\(aq and \(aq}\(aq by themselves ordinary characters.
Packit 7cfc04
The parentheses for nested subexpressions are "\fI\e(\fP" and "\fI\e)\fP",
Packit 7cfc04
with \(aq(\(aq and \(aq)\(aq by themselves ordinary characters.
Packit 7cfc04
\(aq^\(aq is an ordinary character except at the beginning of the
Packit 7cfc04
RE or\*(dg the beginning of a parenthesized subexpression,
Packit 7cfc04
\(aq$\(aq is an ordinary character except at the end of the
Packit 7cfc04
RE or\*(dg the end of a parenthesized subexpression,
Packit 7cfc04
and \(aq*\(aq is an ordinary character if it appears at the beginning of the
Packit 7cfc04
RE or the beginning of a parenthesized subexpression
Packit 7cfc04
(after a possible leading \(aq^\(aq).
Packit 7cfc04
.PP
Packit 7cfc04
Finally, there is one new type of atom, a \fIback reference\fR:
Packit 7cfc04
\(aq\e\(aq followed by a nonzero decimal digit \fId\fR
Packit 7cfc04
matches the same sequence of characters
Packit 7cfc04
matched by the \fId\fRth parenthesized subexpression
Packit 7cfc04
(numbering subexpressions by the positions of their opening parentheses,
Packit 7cfc04
left to right),
Packit 7cfc04
so that, for example, "\fI\e([bc]\e)\e1\fP" matches "bb" or "cc" but not "bc".
Packit 7cfc04
.SH BUGS
Packit 7cfc04
Having two kinds of REs is a botch.
Packit 7cfc04
.PP
Packit 7cfc04
The current POSIX.2 spec says that \(aq)\(aq is an ordinary character in
Packit 7cfc04
the absence of an unmatched \(aq(\(aq;
Packit 7cfc04
this was an unintentional result of a wording error,
Packit 7cfc04
and change is likely.
Packit 7cfc04
Avoid relying on it.
Packit 7cfc04
.PP
Packit 7cfc04
Back references are a dreadful botch,
Packit 7cfc04
posing major problems for efficient implementations.
Packit 7cfc04
They are also somewhat vaguely defined
Packit 7cfc04
(does
Packit 7cfc04
"\fIa\e(\e(b\e)*\e2\e)*d\fP" match "abbbd"?).
Packit 7cfc04
Avoid using them.
Packit 7cfc04
.PP
Packit 7cfc04
POSIX.2's specification of case-independent matching is vague.
Packit 7cfc04
The "one case implies all cases" definition given above
Packit 7cfc04
is current consensus among implementors as to the right interpretation.
Packit 7cfc04
.\" As per http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=295666
Packit 7cfc04
.\" The following does not seem to apply in the glibc implementation
Packit 7cfc04
.\" .PP
Packit 7cfc04
.\" The syntax for word boundaries is incredibly ugly.
Packit 7cfc04
.SH AUTHOR
Packit 7cfc04
.\" Sigh... The page license means we must have the author's name
Packit 7cfc04
.\" in the formatted output.
Packit 7cfc04
This page was taken from Henry Spencer's regex package.
Packit 7cfc04
.SH SEE ALSO
Packit 7cfc04
.BR grep (1),
Packit 7cfc04
.BR regex (3)
Packit 7cfc04
.PP
Packit 7cfc04
POSIX.2, section 2.8 (Regular Expression Notation).
Packit 7cfc04
.SH COLOPHON
Packit 7cfc04
This page is part of release 4.15 of the Linux
Packit 7cfc04
.I man-pages
Packit 7cfc04
project.
Packit 7cfc04
A description of the project,
Packit 7cfc04
information about reporting bugs,
Packit 7cfc04
and the latest version of this page,
Packit 7cfc04
can be found at
Packit 7cfc04
\%https://www.kernel.org/doc/man\-pages/.