|
Packit |
709fb3 |
Things to do for GNU grep
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
Copyright (C) 1992, 1997-2002, 2004-2017 Free Software Foundation, Inc.
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
Copying and distribution of this file, with or without modification,
|
|
Packit |
709fb3 |
are permitted in any medium without royalty provided the copyright
|
|
Packit |
709fb3 |
notice and this notice are preserved.
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
===============
|
|
Packit |
709fb3 |
Short term work
|
|
Packit |
709fb3 |
===============
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
See where we are with UTF-8 performance.
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
Merge Debian patches that seem relevant.
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
Go through patches in Savannah.
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
Fix --directories=read.
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
Write better Texinfo documentation for grep. The manual page would be a
|
|
Packit |
709fb3 |
good place to start, but Info documents are also supposed to contain a
|
|
Packit |
709fb3 |
tutorial and examples.
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
Some tests in tests/spencer2.tests should have failed! Need to filter out
|
|
Packit |
709fb3 |
some bugs in dfa.[ch]/regex.[ch].
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
Multithreading?
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
GNU grep originally did 32-bit arithmetic. Although it has moved to
|
|
Packit |
709fb3 |
64-bit on 64-bit platforms by using types like ptrdiff_t and size_t,
|
|
Packit |
709fb3 |
this conversion has not been entirely systematic and should be checked.
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
Lazy dynamic linking of libpcre. See Debian’s 03-397262-dlopen-pcre.patch.
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
Check FreeBSD’s integration of zgrep (-Z) and bzgrep (-J) in one
|
|
Packit |
709fb3 |
binary. Is there a possibility of doing even better by automatically
|
|
Packit |
709fb3 |
checking the magic of binary files ourselves (0x1F 0x8B for gzip, 0x1F
|
|
Packit |
709fb3 |
0x9D for compress, and 0x42 0x5A 0x68 for bzip2)? Once what to do with
|
|
Packit |
709fb3 |
libpcre is decided, do the same for libz and libbz2.
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
===================
|
|
Packit |
709fb3 |
Matching algorithms
|
|
Packit |
709fb3 |
===================
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
Take a look at these and consider opportunities for merging or cloning:
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
-- http://osrd.org/projects/grep/global-regular-expression-print-tools-grep-variants
|
|
Packit |
709fb3 |
-- ja-grep’s mlb2 patch (Japanese grep)
|
|
Packit |
709fb3 |
<http://distcache.freebsd.org/ports-distfiles/grep-2.4.2-mlb2.patch.gz>
|
|
Packit |
709fb3 |
-- lgrep (from lv, a Powerful Multilingual File Viewer / Grep)
|
|
Packit |
709fb3 |
<http://www.mt.cs.keio.ac.jp/person/narita/lv/>;
|
|
Packit |
709fb3 |
-- cgrep (Context grep) <https://awgn.github.io/cgrep/>
|
|
Packit |
709fb3 |
seems like nice work;
|
|
Packit |
709fb3 |
-- sgrep (Struct grep) <https://www.cs.helsinki.fi/u/jjaakkol/sgrep.html>;
|
|
Packit |
709fb3 |
-- agrep (Approximate grep) <https://www.tgries.de/agrep/>,
|
|
Packit |
709fb3 |
from glimpse;
|
|
Packit |
709fb3 |
-- nr-grep (Nondeterministic reverse grep)
|
|
Packit |
709fb3 |
<https://www.dcc.uchile.cl/~gnavarro/software/>;
|
|
Packit |
709fb3 |
-- ggrep (Grouse grep) <http://www.grouse.com.au/ggrep/>;
|
|
Packit |
709fb3 |
-- freegrep <https://github.com/howardjp/freegrep>;
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
Check some new algorithms for matching. See, for example, Faro &
|
|
Packit |
709fb3 |
Lecroq (cited in kwset.c).
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
Fix the DFA matcher to never use exponential space. (Fortunately, these
|
|
Packit |
709fb3 |
cases are rare.)
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
============================
|
|
Packit |
709fb3 |
Standards: POSIX and Unicode
|
|
Packit |
709fb3 |
============================
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
For POSIX compliance issues, see POSIX 1003.1.
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
Current support for the POSIX [= =] and [. .] constructs is limited to
|
|
Packit |
709fb3 |
platforms whose regular expression matchers are sufficiently
|
|
Packit |
709fb3 |
compatible with the GNU C library so that the --without-included-regex
|
|
Packit |
709fb3 |
option of ‘configure’ is in effect. Extend this support to non-glibc
|
|
Packit |
709fb3 |
platforms, where --with-included-regex is in effect, by modifying the
|
|
Packit |
709fb3 |
included version of the regex code to defer to the native version when
|
|
Packit |
709fb3 |
handling [= =] and [. .].
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
For Unicode, interesting things to check include the Unicode Standard
|
|
Packit |
709fb3 |
<http://www.unicode.org/standard/standard.html> and the Unicode Technical
|
|
Packit |
709fb3 |
Standard #18 (<http://www.unicode.org/reports/tr18/> “Unicode Regular
|
|
Packit |
709fb3 |
Expressions”). Talk to Bruno Haible who’s maintaining GNU libunistring.
|
|
Packit |
709fb3 |
See also Unicode Standard Annex #15 (<http://www.unicode.org/reports/tr15/>
|
|
Packit |
709fb3 |
“Unicode Normalization Forms”), already implemented by GNU libunistring.
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
In particular, --ignore-case needs to be evaluated against the standards.
|
|
Packit |
709fb3 |
We may want to deviate from POSIX if Unicode provides better or clearer
|
|
Packit |
709fb3 |
semantics.
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
POSIX and --ignore-case
|
|
Packit |
709fb3 |
-----------------------
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
For this issue, interesting things to check in POSIX include the
|
|
Packit |
709fb3 |
Open Group Base Specifications, Chapter “Regular Expressions”, in
|
|
Packit |
709fb3 |
particular Section “Regular Expression General Requirements” and its
|
|
Packit |
709fb3 |
paragraph about caseless matching (this may not have been fully
|
|
Packit |
709fb3 |
thought through and that this text may be self-contradicting
|
|
Packit |
709fb3 |
[specifically: “of either data or patterns” versus all the rest]).
|
|
Packit |
709fb3 |
See:
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_02
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
In particular, consider the following with POSIX’s approach to case
|
|
Packit |
709fb3 |
folding in mind. Assume a non-Turkic locale with a character
|
|
Packit |
709fb3 |
repertoire reduced to the following various forms of “LATIN LETTER I”:
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
0049;LATIN CAPITAL LETTER I;Lu;0;L;;;;;N;;;;0069;
|
|
Packit |
709fb3 |
0069;LATIN SMALL LETTER I;Ll;0;L;;;;;N;;;0049;;0049
|
|
Packit |
709fb3 |
0130;LATIN CAPITAL LETTER I WITH DOT ABOVE;Lu;0;L;0049 0307;;;;N;\
|
|
Packit |
709fb3 |
LATIN CAPITAL LETTER I DOT;;;0069;
|
|
Packit |
709fb3 |
0131;LATIN SMALL LETTER DOTLESS I;Ll;0;L;;;;;N;;;0049;;0049
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
UTF-8 octet lengths differ between U+0049 (0x49) and U+0069 (0x69)
|
|
Packit |
709fb3 |
versus U+0130 (0xC4 0xB0) and U+0131 (0xC4 0xB1). This implies that
|
|
Packit |
709fb3 |
whole UTF-8 strings cannot be case-converted in place, using the same
|
|
Packit |
709fb3 |
memory buffer, and that the needed octet-size of the new buffer cannot
|
|
Packit |
709fb3 |
merely be guessed (although there’s a simple upper bound of five times
|
|
Packit |
709fb3 |
the size of the input, as the longest UTF-8 encoding of any character
|
|
Packit |
709fb3 |
is five bytes).
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
We have
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
lc(I) = i, uc(I) = I
|
|
Packit |
709fb3 |
lc(i) = i, uc(i) = I
|
|
Packit |
709fb3 |
lc(İ) = i, uc(İ) = İ
|
|
Packit |
709fb3 |
lc(ı) = ı, uc(ı) = I
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
where lc() and uc() denote lower-case and upper-case conversions.
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
There are several candidate --ignore-case logics. Using the
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
if (lc(input_wchar) == lc(pattern_wchar))
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
logic leads to the following matches:
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
\in I i İ ı
|
|
Packit |
709fb3 |
pat\ ----------
|
|
Packit |
709fb3 |
I | Y Y Y n
|
|
Packit |
709fb3 |
i | Y Y Y n
|
|
Packit |
709fb3 |
İ | Y Y Y n
|
|
Packit |
709fb3 |
ı | n n n Y
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
There is a lack of symmetry between CAPITAL and SMALL LETTERs with
|
|
Packit |
709fb3 |
this. Using the
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
if (uc(input_wchar) == uc(pattern_wchar))
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
logic (which is what GNU grep currently does although this is not
|
|
Packit |
709fb3 |
documented or guaranteed in the future), leads to the following
|
|
Packit |
709fb3 |
matches:
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
\in I i İ ı
|
|
Packit |
709fb3 |
pat\ ----------
|
|
Packit |
709fb3 |
I | Y Y n Y
|
|
Packit |
709fb3 |
i | Y Y n Y
|
|
Packit |
709fb3 |
İ | n n Y n
|
|
Packit |
709fb3 |
ı | Y Y n Y
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
There is a lack of symmetry between CAPITAL and SMALL LETTERs with
|
|
Packit |
709fb3 |
this.
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
Using the
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
if (lc(input_wchar) == lc(pattern_wchar)
|
|
Packit |
709fb3 |
|| uc(input_wchar) == uc(pattern_wchar))
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
logic leads to the following matches:
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
\in I i İ ı
|
|
Packit |
709fb3 |
pat\ ----------
|
|
Packit |
709fb3 |
I | Y Y Y Y
|
|
Packit |
709fb3 |
i | Y Y Y Y
|
|
Packit |
709fb3 |
İ | Y Y Y n
|
|
Packit |
709fb3 |
ı | Y Y n Y
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
There is some elegance and symmetry with this. But there are
|
|
Packit |
709fb3 |
potentially two conversions to be made per input character. If the
|
|
Packit |
709fb3 |
pattern is pre-converted, two copies of it need to be kept and used in
|
|
Packit |
709fb3 |
a mutually coherent fashion.
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
Using the
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
if (input_wchar == pattern_wchar
|
|
Packit |
709fb3 |
|| lc(input_wchar) == pattern_wchar
|
|
Packit |
709fb3 |
|| uc(input_wchar) == pattern_wchar)
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
logic (a plausible interpretation of POSIX) leads to the following
|
|
Packit |
709fb3 |
matches:
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
\in I i İ ı
|
|
Packit |
709fb3 |
pat\ ----------
|
|
Packit |
709fb3 |
I | Y Y n Y
|
|
Packit |
709fb3 |
i | Y Y Y n
|
|
Packit |
709fb3 |
İ | n n Y n
|
|
Packit |
709fb3 |
ı | n n n Y
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
There is a different CAPITAL/SMALL symmetry with this. But there’s
|
|
Packit |
709fb3 |
also a loss of pattern/input symmetry that’s unique to it. Also there
|
|
Packit |
709fb3 |
are potentially two conversions to be made per input character.
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
Using the
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
if (lc(uc(input_wchar)) == lc(uc(pattern_wchar)))
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
logic leads to the following matches:
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
\in I i İ ı
|
|
Packit |
709fb3 |
pat\ ----------
|
|
Packit |
709fb3 |
I | Y Y Y Y
|
|
Packit |
709fb3 |
i | Y Y Y Y
|
|
Packit |
709fb3 |
İ | Y Y Y Y
|
|
Packit |
709fb3 |
ı | Y Y Y Y
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
This shows total symmetry and transitivity (at least in this example
|
|
Packit |
709fb3 |
analysis). There are two conversions to be made per input character,
|
|
Packit |
709fb3 |
but support could be added for having a single straight mapping
|
|
Packit |
709fb3 |
performing a composition of the two conversions.
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
Any optimization in the implementation of each logic must not change
|
|
Packit |
709fb3 |
its basic semantic.
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
Unicode and --ignore-case
|
|
Packit |
709fb3 |
-------------------------
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
For this issue, interesting things to check in Unicode include:
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
- The Unicode Standard, Chapter 3
|
|
Packit |
709fb3 |
(<http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf>
|
|
Packit |
709fb3 |
“Conformance”), Section 3.13 (“Default Case Algorithms”) and the
|
|
Packit |
709fb3 |
toCasefold() case conversion operation.
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
- The Unicode Standard, Chapter 4
|
|
Packit |
709fb3 |
(<http://www.unicode.org/versions/Unicode9.0.0/ch04.pdf>
|
|
Packit |
709fb3 |
“Character Properties”), Section 4.2 (“Case”) and
|
|
Packit |
709fb3 |
the <http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt>
|
|
Packit |
709fb3 |
SpecialCasing.txt and
|
|
Packit |
709fb3 |
<http://www.unicode.org/Public/UNIDATA/CaseFolding.txt>
|
|
Packit |
709fb3 |
CaseFolding.txt files.
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
- The Unicode Standard, Chapter 5
|
|
Packit |
709fb3 |
(<http://www.unicode.org/versions/Unicode9.0.0/ch05.pdf>
|
|
Packit |
709fb3 |
“Implementation Guidelines”), Section 5.18 (“Case Mappings”),
|
|
Packit |
709fb3 |
Subsection “Caseless Matching”.
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
- The Unicode case charts <http://www.unicode.org/charts/case/>.
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
Unicode uses the
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
if (toCasefold(input_wchar_string) == toCasefold(pattern_wchar_string))
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
logic for caseless matching. Consider the “LATIN LETTER I” example
|
|
Packit |
709fb3 |
mentioned above. In a non-Turkic locale, simple case folding yields
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
toCasefold_simple(U+0049) = U+0069
|
|
Packit |
709fb3 |
toCasefold_simple(U+0069) = U+0069
|
|
Packit |
709fb3 |
toCasefold_simple(U+0130) = U+0130
|
|
Packit |
709fb3 |
toCasefold_simple(U+0131) = U+0131
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
which leads to the following matches:
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
\in I i İ ı
|
|
Packit |
709fb3 |
pat\ ----------
|
|
Packit |
709fb3 |
I | Y Y n n
|
|
Packit |
709fb3 |
i | Y Y n n
|
|
Packit |
709fb3 |
İ | n n Y n
|
|
Packit |
709fb3 |
ı | n n n Y
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
This is different from anything so far!
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
In a non-Turkic locale, full case folding yields
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
toCasefold_full(U+0049) = U+0069
|
|
Packit |
709fb3 |
toCasefold_full(U+0069) = U+0069
|
|
Packit |
709fb3 |
toCasefold_full(U+0130) = <U+0069, U+0307>
|
|
Packit |
709fb3 |
toCasefold_full(U+0131) = U+0131
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
with
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
0307;COMBINING DOT ABOVE;Mn;230;NSM;;;;;N;NON-SPACING DOT ABOVE;;;;
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
which leads to the following matches:
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
\in I i İ ı
|
|
Packit |
709fb3 |
pat\ ----------
|
|
Packit |
709fb3 |
I | Y Y * n
|
|
Packit |
709fb3 |
i | Y Y * n
|
|
Packit |
709fb3 |
İ | n n Y n
|
|
Packit |
709fb3 |
ı | n n n Y
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
This is just sad!
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
Having toCasefold(U+0131), simple or full, map to itself instead of
|
|
Packit |
709fb3 |
U+0069 is in contradiction with the rules of Section 5.18 of the
|
|
Packit |
709fb3 |
Unicode Standard since toUpperCase(U+0131) is U+0049. Same thing for
|
|
Packit |
709fb3 |
toCasefold_simple(U+0130) since toLowerCase(U+0131) is U+0069. The
|
|
Packit |
709fb3 |
justification for the weird toCasefold_full(U+0130) mapping is
|
|
Packit |
709fb3 |
unknown; it doesn’t even make sense to add a dot (U+0307) to a letter
|
|
Packit |
709fb3 |
that already has one (U+0069). It would have been so simple to put
|
|
Packit |
709fb3 |
them all in the same equivalence class!
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
Otherwise, also consider the following problem with Unicode’s approach
|
|
Packit |
709fb3 |
on case folding in mind. Assume that we want to perform
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
echo 'AßBC' | grep -i 'Sb'
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
which corresponds to
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
input: U+0041 U+00DF U+0042 U+0043 U+000A
|
|
Packit |
709fb3 |
pattern: U+0053 U+0062
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
Following CaseFolding.txt, applying the toCasefold() transformation to
|
|
Packit |
709fb3 |
these yields
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
input: U+0061 U+0073 U+0073 U+0062 U+0063 U+000A
|
|
Packit |
709fb3 |
pattern: U+0073 U+0062
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
so, according to this approach, the input should match the pattern.
|
|
Packit |
709fb3 |
As long as the original input line is to be reported to the user as a
|
|
Packit |
709fb3 |
whole, there is no problem (from the user’s point-of-view;
|
|
Packit |
709fb3 |
implementation is complicated by this).
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
However, consider both these GNU extensions:
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
echo 'AßBC' | grep -i --only-matching 'Sb'
|
|
Packit |
709fb3 |
echo 'AßBC' | grep -i --color=always 'Sb'
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
What is to be reported in these cases, since the match begins in the
|
|
Packit |
709fb3 |
*middle* of the original input character ‘ß’?
|
|
Packit |
709fb3 |
|
|
Packit |
709fb3 |
Unicode’s toCasefold() cannot be implemented in terms of POSIX’s
|
|
Packit |
709fb3 |
towctrans() since that can only return a single wint_t value per input
|
|
Packit |
709fb3 |
wint_t value.
|