|
Packit |
7cfc04 |
.\" From Henry Spencer's regex package (as found in the apache
|
|
Packit |
7cfc04 |
.\" distribution). The package carries the following copyright:
|
|
Packit |
7cfc04 |
.\"
|
|
Packit |
7cfc04 |
.\" Copyright 1992, 1993, 1994 Henry Spencer. All rights reserved.
|
|
Packit |
7cfc04 |
.\" %%%LICENSE_START(MISC)
|
|
Packit |
7cfc04 |
.\" This software is not subject to any license of the American Telephone
|
|
Packit |
7cfc04 |
.\" and Telegraph Company or of the Regents of the University of California.
|
|
Packit |
7cfc04 |
.\"
|
|
Packit |
7cfc04 |
.\" Permission is granted to anyone to use this software for any purpose
|
|
Packit |
7cfc04 |
.\" on any computer system, and to alter it and redistribute it, subject
|
|
Packit |
7cfc04 |
.\" to the following restrictions:
|
|
Packit |
7cfc04 |
.\"
|
|
Packit |
7cfc04 |
.\" 1. The author is not responsible for the consequences of use of this
|
|
Packit |
7cfc04 |
.\" software, no matter how awful, even if they arise from flaws in it.
|
|
Packit |
7cfc04 |
.\"
|
|
Packit |
7cfc04 |
.\" 2. The origin of this software must not be misrepresented, either by
|
|
Packit |
7cfc04 |
.\" explicit claim or by omission. Since few users ever read sources,
|
|
Packit |
7cfc04 |
.\" credits must appear in the documentation.
|
|
Packit |
7cfc04 |
.\"
|
|
Packit |
7cfc04 |
.\" 3. Altered versions must be plainly marked as such, and must not be
|
|
Packit |
7cfc04 |
.\" misrepresented as being the original software. Since few users
|
|
Packit |
7cfc04 |
.\" ever read sources, credits must appear in the documentation.
|
|
Packit |
7cfc04 |
.\"
|
|
Packit |
7cfc04 |
.\" 4. This notice may not be removed or altered.
|
|
Packit |
7cfc04 |
.\" %%%LICENSE_END
|
|
Packit |
7cfc04 |
.\"
|
|
Packit |
7cfc04 |
.\" In order to comply with `credits must appear in the documentation'
|
|
Packit |
7cfc04 |
.\" I added an AUTHOR paragraph below - aeb.
|
|
Packit |
7cfc04 |
.\"
|
|
Packit |
7cfc04 |
.\" In the default nroff environment there is no dagger \(dg.
|
|
Packit |
7cfc04 |
.\"
|
|
Packit |
7cfc04 |
.\" 2005-05-11 Removed discussion of `[[:<:]]' and `[[:>:]]', which
|
|
Packit |
7cfc04 |
.\" appear not to be in the glibc implementation of regcomp
|
|
Packit |
7cfc04 |
.\"
|
|
Packit |
7cfc04 |
.ie t .ds dg \(dg
|
|
Packit |
7cfc04 |
.el .ds dg (!)
|
|
Packit |
7cfc04 |
.TH REGEX 7 2009-01-12 "" "Linux Programmer's Manual"
|
|
Packit |
7cfc04 |
.SH NAME
|
|
Packit |
7cfc04 |
regex \- POSIX.2 regular expressions
|
|
Packit |
7cfc04 |
.SH DESCRIPTION
|
|
Packit |
7cfc04 |
Regular expressions ("RE"s),
|
|
Packit |
7cfc04 |
as defined in POSIX.2, come in two forms:
|
|
Packit |
7cfc04 |
modern REs (roughly those of
|
|
Packit |
7cfc04 |
.IR egrep ;
|
|
Packit |
7cfc04 |
POSIX.2 calls these "extended" REs)
|
|
Packit |
7cfc04 |
and obsolete REs (roughly those of
|
|
Packit |
7cfc04 |
.BR ed (1);
|
|
Packit |
7cfc04 |
POSIX.2 "basic" REs).
|
|
Packit |
7cfc04 |
Obsolete REs mostly exist for backward compatibility in some old programs;
|
|
Packit |
7cfc04 |
they will be discussed at the end.
|
|
Packit |
7cfc04 |
POSIX.2 leaves some aspects of RE syntax and semantics open;
|
|
Packit |
7cfc04 |
"\*(dg" marks decisions on these aspects that
|
|
Packit |
7cfc04 |
may not be fully portable to other POSIX.2 implementations.
|
|
Packit |
7cfc04 |
.PP
|
|
Packit |
7cfc04 |
A (modern) RE is one\*(dg or more nonempty\*(dg \fIbranches\fR,
|
|
Packit |
7cfc04 |
separated by \(aq|\(aq.
|
|
Packit |
7cfc04 |
It matches anything that matches one of the branches.
|
|
Packit |
7cfc04 |
.PP
|
|
Packit |
7cfc04 |
A branch is one\*(dg or more \fIpieces\fR, concatenated.
|
|
Packit |
7cfc04 |
It matches a match for the first, followed by a match for the second,
|
|
Packit |
7cfc04 |
and so on.
|
|
Packit |
7cfc04 |
.PP
|
|
Packit |
7cfc04 |
A piece is an \fIatom\fR possibly followed
|
|
Packit |
7cfc04 |
by a single\*(dg \(aq*\(aq, \(aq+\(aq, \(aq?\(aq, or \fIbound\fR.
|
|
Packit |
7cfc04 |
An atom followed by \(aq*\(aq
|
|
Packit |
7cfc04 |
matches a sequence of 0 or more matches of the atom.
|
|
Packit |
7cfc04 |
An atom followed by \(aq+\(aq
|
|
Packit |
7cfc04 |
matches a sequence of 1 or more matches of the atom.
|
|
Packit |
7cfc04 |
An atom followed by \(aq?\(aq
|
|
Packit |
7cfc04 |
matches a sequence of 0 or 1 matches of the atom.
|
|
Packit |
7cfc04 |
.PP
|
|
Packit |
7cfc04 |
A \fIbound\fR is \(aq{\(aq followed by an unsigned decimal integer,
|
|
Packit |
7cfc04 |
possibly followed by \(aq,\(aq
|
|
Packit |
7cfc04 |
possibly followed by another unsigned decimal integer,
|
|
Packit |
7cfc04 |
always followed by \(aq}\(aq.
|
|
Packit |
7cfc04 |
The integers must lie between 0 and
|
|
Packit |
7cfc04 |
.B RE_DUP_MAX
|
|
Packit |
7cfc04 |
(255\*(dg) inclusive,
|
|
Packit |
7cfc04 |
and if there are two of them, the first may not exceed the second.
|
|
Packit |
7cfc04 |
An atom followed by a bound containing one integer \fIi\fR
|
|
Packit |
7cfc04 |
and no comma matches
|
|
Packit |
7cfc04 |
a sequence of exactly \fIi\fR matches of the atom.
|
|
Packit |
7cfc04 |
An atom followed by a bound
|
|
Packit |
7cfc04 |
containing one integer \fIi\fR and a comma matches
|
|
Packit |
7cfc04 |
a sequence of \fIi\fR or more matches of the atom.
|
|
Packit |
7cfc04 |
An atom followed by a bound
|
|
Packit |
7cfc04 |
containing two integers \fIi\fR and \fIj\fR matches
|
|
Packit |
7cfc04 |
a sequence of \fIi\fR through \fIj\fR (inclusive) matches of the atom.
|
|
Packit |
7cfc04 |
.PP
|
|
Packit |
7cfc04 |
An atom is a regular expression enclosed in "\fI()\fP"
|
|
Packit |
7cfc04 |
(matching a match for the regular expression),
|
|
Packit |
7cfc04 |
an empty set of "\fI()\fP" (matching the null string)\*(dg,
|
|
Packit |
7cfc04 |
a \fIbracket expression\fR (see below), \(aq.\(aq
|
|
Packit |
7cfc04 |
(matching any single character), \(aq^\(aq (matching the null string at the
|
|
Packit |
7cfc04 |
beginning of a line), \(aq$\(aq (matching the null string at the
|
|
Packit |
7cfc04 |
end of a line), a \(aq\e\(aq followed by one of the characters
|
|
Packit |
7cfc04 |
"\fI^.[$()|*+?{\e\fP"
|
|
Packit |
7cfc04 |
(matching that character taken as an ordinary character),
|
|
Packit |
7cfc04 |
a \(aq\e\(aq followed by any other character\*(dg
|
|
Packit |
7cfc04 |
(matching that character taken as an ordinary character,
|
|
Packit |
7cfc04 |
as if the \(aq\e\(aq had not been present\*(dg),
|
|
Packit |
7cfc04 |
or a single character with no other significance (matching that character).
|
|
Packit |
7cfc04 |
A \(aq{\(aq followed by a character other than a digit is an ordinary
|
|
Packit |
7cfc04 |
character, not the beginning of a bound\*(dg.
|
|
Packit |
7cfc04 |
It is illegal to end an RE with \(aq\e\(aq.
|
|
Packit |
7cfc04 |
.PP
|
|
Packit |
7cfc04 |
A \fIbracket expression\fR is a list of characters enclosed in "\fI[]\fP".
|
|
Packit |
7cfc04 |
It normally matches any single character from the list (but see below).
|
|
Packit |
7cfc04 |
If the list begins with \(aq^\(aq,
|
|
Packit |
7cfc04 |
it matches any single character
|
|
Packit |
7cfc04 |
(but see below) \fInot\fR from the rest of the list.
|
|
Packit |
7cfc04 |
If two characters in the list are separated by \(aq\-\(aq, this is shorthand
|
|
Packit |
7cfc04 |
for the full \fIrange\fR of characters between those two (inclusive) in the
|
|
Packit |
7cfc04 |
collating sequence,
|
|
Packit |
7cfc04 |
for example, "\fI[0\-9]\fP" in ASCII matches any decimal digit.
|
|
Packit |
7cfc04 |
It is illegal\*(dg for two ranges to share an
|
|
Packit |
7cfc04 |
endpoint, for example, "\fIa\-c\-e\fP".
|
|
Packit |
7cfc04 |
Ranges are very collating-sequence-dependent,
|
|
Packit |
7cfc04 |
and portable programs should avoid relying on them.
|
|
Packit |
7cfc04 |
.PP
|
|
Packit |
7cfc04 |
To include a literal \(aq]\(aq in the list, make it the first character
|
|
Packit |
7cfc04 |
(following a possible \(aq^\(aq).
|
|
Packit |
7cfc04 |
To include a literal \(aq\-\(aq, make it the first or last character,
|
|
Packit |
7cfc04 |
or the second endpoint of a range.
|
|
Packit |
7cfc04 |
To use a literal \(aq\-\(aq as the first endpoint of a range,
|
|
Packit |
7cfc04 |
enclose it in "\fI[.\fP" and "\fI.]\fP"
|
|
Packit |
7cfc04 |
to make it a collating element (see below).
|
|
Packit |
7cfc04 |
With the exception of these and some combinations using \(aq[\(aq (see next
|
|
Packit |
7cfc04 |
paragraphs), all other special characters, including \(aq\e\(aq, lose their
|
|
Packit |
7cfc04 |
special significance within a bracket expression.
|
|
Packit |
7cfc04 |
.PP
|
|
Packit |
7cfc04 |
Within a bracket expression, a collating element (a character,
|
|
Packit |
7cfc04 |
a multicharacter sequence that collates as if it were a single character,
|
|
Packit |
7cfc04 |
or a collating-sequence name for either)
|
|
Packit |
7cfc04 |
enclosed in "\fI[.\fP" and "\fI.]\fP" stands for the
|
|
Packit |
7cfc04 |
sequence of characters of that collating element.
|
|
Packit |
7cfc04 |
The sequence is a single element of the bracket expression's list.
|
|
Packit |
7cfc04 |
A bracket expression containing a multicharacter collating element
|
|
Packit |
7cfc04 |
can thus match more than one character,
|
|
Packit |
7cfc04 |
for example, if the collating sequence includes a "ch" collating element,
|
|
Packit |
7cfc04 |
then the RE "\fI[[.ch.]]*c\fP" matches the first five characters
|
|
Packit |
7cfc04 |
of "chchcc".
|
|
Packit |
7cfc04 |
.PP
|
|
Packit |
7cfc04 |
Within a bracket expression, a collating element enclosed in "\fI[=\fP" and
|
|
Packit |
7cfc04 |
"\fI=]\fP" is an equivalence class, standing for the sequences of characters
|
|
Packit |
7cfc04 |
of all collating elements equivalent to that one, including itself.
|
|
Packit |
7cfc04 |
(If there are no other equivalent collating elements,
|
|
Packit |
7cfc04 |
the treatment is as if the enclosing delimiters
|
|
Packit |
7cfc04 |
were "\fI[.\fP" and "\fI.]\fP".)
|
|
Packit |
7cfc04 |
For example, if o and \o'o^' are the members of an equivalence class,
|
|
Packit |
7cfc04 |
then "\fI[[=o=]]\fP", "\fI[[=\o'o^'=]]\fP",
|
|
Packit |
7cfc04 |
and "\fI[o\o'o^']\fP" are all synonymous.
|
|
Packit |
7cfc04 |
An equivalence class may not\*(dg be an endpoint
|
|
Packit |
7cfc04 |
of a range.
|
|
Packit |
7cfc04 |
.PP
|
|
Packit |
7cfc04 |
Within a bracket expression, the name of a \fIcharacter class\fR enclosed
|
|
Packit |
7cfc04 |
in "\fI[:\fP" and "\fI:]\fP" stands for the list
|
|
Packit |
7cfc04 |
of all characters belonging to that
|
|
Packit |
7cfc04 |
class.
|
|
Packit |
7cfc04 |
Standard character class names are:
|
|
Packit |
7cfc04 |
.PP
|
|
Packit |
7cfc04 |
.RS
|
|
Packit |
7cfc04 |
.TS
|
|
Packit |
7cfc04 |
l l l.
|
|
Packit |
7cfc04 |
alnum digit punct
|
|
Packit |
7cfc04 |
alpha graph space
|
|
Packit |
7cfc04 |
blank lower upper
|
|
Packit |
7cfc04 |
cntrl print xdigit
|
|
Packit |
7cfc04 |
.TE
|
|
Packit |
7cfc04 |
.RE
|
|
Packit |
7cfc04 |
.PP
|
|
Packit |
7cfc04 |
These stand for the character classes defined in
|
|
Packit |
7cfc04 |
.BR wctype (3).
|
|
Packit |
7cfc04 |
A locale may provide others.
|
|
Packit |
7cfc04 |
A character class may not be used as an endpoint of a range.
|
|
Packit |
7cfc04 |
.\" As per http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=295666
|
|
Packit |
7cfc04 |
.\" The following does not seem to apply in the glibc implementation
|
|
Packit |
7cfc04 |
.\" .PP
|
|
Packit |
7cfc04 |
.\" There are two special cases\*(dg of bracket expressions:
|
|
Packit |
7cfc04 |
.\" the bracket expressions "\fI[[:<:]]\fP" and "\fI[[:>:]]\fP" match
|
|
Packit |
7cfc04 |
.\" the null string at the beginning and end of a word respectively.
|
|
Packit |
7cfc04 |
.\" A word is defined as a sequence of
|
|
Packit |
7cfc04 |
.\" word characters
|
|
Packit |
7cfc04 |
.\" which is neither preceded nor followed by
|
|
Packit |
7cfc04 |
.\" word characters.
|
|
Packit |
7cfc04 |
.\" A word character is an
|
|
Packit |
7cfc04 |
.\" .I alnum
|
|
Packit |
7cfc04 |
.\" character (as defined by
|
|
Packit |
7cfc04 |
.\" .BR wctype (3))
|
|
Packit |
7cfc04 |
.\" or an underscore.
|
|
Packit |
7cfc04 |
.\" This is an extension,
|
|
Packit |
7cfc04 |
.\" compatible with but not specified by POSIX.2,
|
|
Packit |
7cfc04 |
.\" and should be used with
|
|
Packit |
7cfc04 |
.\" caution in software intended to be portable to other systems.
|
|
Packit |
7cfc04 |
.PP
|
|
Packit |
7cfc04 |
In the event that an RE could match more than one substring of a given
|
|
Packit |
7cfc04 |
string,
|
|
Packit |
7cfc04 |
the RE matches the one starting earliest in the string.
|
|
Packit |
7cfc04 |
If the RE could match more than one substring starting at that point,
|
|
Packit |
7cfc04 |
it matches the longest.
|
|
Packit |
7cfc04 |
Subexpressions also match the longest possible substrings, subject to
|
|
Packit |
7cfc04 |
the constraint that the whole match be as long as possible,
|
|
Packit |
7cfc04 |
with subexpressions starting earlier in the RE taking priority over
|
|
Packit |
7cfc04 |
ones starting later.
|
|
Packit |
7cfc04 |
Note that higher-level subexpressions thus take priority over
|
|
Packit |
7cfc04 |
their lower-level component subexpressions.
|
|
Packit |
7cfc04 |
.PP
|
|
Packit |
7cfc04 |
Match lengths are measured in characters, not collating elements.
|
|
Packit |
7cfc04 |
A null string is considered longer than no match at all.
|
|
Packit |
7cfc04 |
For example,
|
|
Packit |
7cfc04 |
"\fIbb*\fP" matches the three middle characters of "abbbc",
|
|
Packit |
7cfc04 |
"\fI(wee|week)(knights|nights)\fP"
|
|
Packit |
7cfc04 |
matches all ten characters of "weeknights",
|
|
Packit |
7cfc04 |
when "\fI(.*).*\fP" is matched against "abc" the parenthesized subexpression
|
|
Packit |
7cfc04 |
matches all three characters, and
|
|
Packit |
7cfc04 |
when "\fI(a*)*\fP" is matched against "bc"
|
|
Packit |
7cfc04 |
both the whole RE and the parenthesized
|
|
Packit |
7cfc04 |
subexpression match the null string.
|
|
Packit |
7cfc04 |
.PP
|
|
Packit |
7cfc04 |
If case-independent matching is specified,
|
|
Packit |
7cfc04 |
the effect is much as if all case distinctions had vanished from the
|
|
Packit |
7cfc04 |
alphabet.
|
|
Packit |
7cfc04 |
When an alphabetic that exists in multiple cases appears as an
|
|
Packit |
7cfc04 |
ordinary character outside a bracket expression, it is effectively
|
|
Packit |
7cfc04 |
transformed into a bracket expression containing both cases,
|
|
Packit |
7cfc04 |
for example, \(aqx\(aq becomes "\fI[xX]\fP".
|
|
Packit |
7cfc04 |
When it appears inside a bracket expression, all case counterparts
|
|
Packit |
7cfc04 |
of it are added to the bracket expression, so that, for example, "\fI[x]\fP"
|
|
Packit |
7cfc04 |
becomes "\fI[xX]\fP" and "\fI[^x]\fP" becomes "\fI[^xX]\fP".
|
|
Packit |
7cfc04 |
.PP
|
|
Packit |
7cfc04 |
No particular limit is imposed on the length of REs\*(dg.
|
|
Packit |
7cfc04 |
Programs intended to be portable should not employ REs longer
|
|
Packit |
7cfc04 |
than 256 bytes,
|
|
Packit |
7cfc04 |
as an implementation can refuse to accept such REs and remain
|
|
Packit |
7cfc04 |
POSIX-compliant.
|
|
Packit |
7cfc04 |
.PP
|
|
Packit |
7cfc04 |
Obsolete ("basic") regular expressions differ in several respects.
|
|
Packit |
7cfc04 |
\(aq|\(aq, \(aq+\(aq, and \(aq?\(aq are
|
|
Packit |
7cfc04 |
ordinary characters and there is no equivalent
|
|
Packit |
7cfc04 |
for their functionality.
|
|
Packit |
7cfc04 |
The delimiters for bounds are "\fI\e{\fP" and "\fI\e}\fP",
|
|
Packit |
7cfc04 |
with \(aq{\(aq and \(aq}\(aq by themselves ordinary characters.
|
|
Packit |
7cfc04 |
The parentheses for nested subexpressions are "\fI\e(\fP" and "\fI\e)\fP",
|
|
Packit |
7cfc04 |
with \(aq(\(aq and \(aq)\(aq by themselves ordinary characters.
|
|
Packit |
7cfc04 |
\(aq^\(aq is an ordinary character except at the beginning of the
|
|
Packit |
7cfc04 |
RE or\*(dg the beginning of a parenthesized subexpression,
|
|
Packit |
7cfc04 |
\(aq$\(aq is an ordinary character except at the end of the
|
|
Packit |
7cfc04 |
RE or\*(dg the end of a parenthesized subexpression,
|
|
Packit |
7cfc04 |
and \(aq*\(aq is an ordinary character if it appears at the beginning of the
|
|
Packit |
7cfc04 |
RE or the beginning of a parenthesized subexpression
|
|
Packit |
7cfc04 |
(after a possible leading \(aq^\(aq).
|
|
Packit |
7cfc04 |
.PP
|
|
Packit |
7cfc04 |
Finally, there is one new type of atom, a \fIback reference\fR:
|
|
Packit |
7cfc04 |
\(aq\e\(aq followed by a nonzero decimal digit \fId\fR
|
|
Packit |
7cfc04 |
matches the same sequence of characters
|
|
Packit |
7cfc04 |
matched by the \fId\fRth parenthesized subexpression
|
|
Packit |
7cfc04 |
(numbering subexpressions by the positions of their opening parentheses,
|
|
Packit |
7cfc04 |
left to right),
|
|
Packit |
7cfc04 |
so that, for example, "\fI\e([bc]\e)\e1\fP" matches "bb" or "cc" but not "bc".
|
|
Packit |
7cfc04 |
.SH BUGS
|
|
Packit |
7cfc04 |
Having two kinds of REs is a botch.
|
|
Packit |
7cfc04 |
.PP
|
|
Packit |
7cfc04 |
The current POSIX.2 spec says that \(aq)\(aq is an ordinary character in
|
|
Packit |
7cfc04 |
the absence of an unmatched \(aq(\(aq;
|
|
Packit |
7cfc04 |
this was an unintentional result of a wording error,
|
|
Packit |
7cfc04 |
and change is likely.
|
|
Packit |
7cfc04 |
Avoid relying on it.
|
|
Packit |
7cfc04 |
.PP
|
|
Packit |
7cfc04 |
Back references are a dreadful botch,
|
|
Packit |
7cfc04 |
posing major problems for efficient implementations.
|
|
Packit |
7cfc04 |
They are also somewhat vaguely defined
|
|
Packit |
7cfc04 |
(does
|
|
Packit |
7cfc04 |
"\fIa\e(\e(b\e)*\e2\e)*d\fP" match "abbbd"?).
|
|
Packit |
7cfc04 |
Avoid using them.
|
|
Packit |
7cfc04 |
.PP
|
|
Packit |
7cfc04 |
POSIX.2's specification of case-independent matching is vague.
|
|
Packit |
7cfc04 |
The "one case implies all cases" definition given above
|
|
Packit |
7cfc04 |
is current consensus among implementors as to the right interpretation.
|
|
Packit |
7cfc04 |
.\" As per http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=295666
|
|
Packit |
7cfc04 |
.\" The following does not seem to apply in the glibc implementation
|
|
Packit |
7cfc04 |
.\" .PP
|
|
Packit |
7cfc04 |
.\" The syntax for word boundaries is incredibly ugly.
|
|
Packit |
7cfc04 |
.SH AUTHOR
|
|
Packit |
7cfc04 |
.\" Sigh... The page license means we must have the author's name
|
|
Packit |
7cfc04 |
.\" in the formatted output.
|
|
Packit |
7cfc04 |
This page was taken from Henry Spencer's regex package.
|
|
Packit |
7cfc04 |
.SH SEE ALSO
|
|
Packit |
7cfc04 |
.BR grep (1),
|
|
Packit |
7cfc04 |
.BR regex (3)
|
|
Packit |
7cfc04 |
.PP
|
|
Packit |
7cfc04 |
POSIX.2, section 2.8 (Regular Expression Notation).
|
|
Packit |
7cfc04 |
.SH COLOPHON
|
|
Packit |
7cfc04 |
This page is part of release 4.15 of the Linux
|
|
Packit |
7cfc04 |
.I man-pages
|
|
Packit |
7cfc04 |
project.
|
|
Packit |
7cfc04 |
A description of the project,
|
|
Packit |
7cfc04 |
information about reporting bugs,
|
|
Packit |
7cfc04 |
and the latest version of this page,
|
|
Packit |
7cfc04 |
can be found at
|
|
Packit |
7cfc04 |
\%https://www.kernel.org/doc/man\-pages/.
|