Blame TODO

Packit 709fb3
Things to do for GNU grep
Packit 709fb3
Packit 709fb3
  Copyright (C) 1992, 1997-2002, 2004-2017 Free Software Foundation, Inc.
Packit 709fb3
Packit 709fb3
  Copying and distribution of this file, with or without modification,
Packit 709fb3
  are permitted in any medium without royalty provided the copyright
Packit 709fb3
  notice and this notice are preserved.
Packit 709fb3
Packit 709fb3
===============
Packit 709fb3
Short term work
Packit 709fb3
===============
Packit 709fb3
Packit 709fb3
See where we are with UTF-8 performance.
Packit 709fb3
Packit 709fb3
Merge Debian patches that seem relevant.
Packit 709fb3
Packit 709fb3
Go through patches in Savannah.
Packit 709fb3
Packit 709fb3
Fix --directories=read.
Packit 709fb3
Packit 709fb3
Write better Texinfo documentation for grep.  The manual page would be a
Packit 709fb3
good place to start, but Info documents are also supposed to contain a
Packit 709fb3
tutorial and examples.
Packit 709fb3
Packit 709fb3
Some tests in tests/spencer2.tests should have failed!  Need to filter out
Packit 709fb3
some bugs in dfa.[ch]/regex.[ch].
Packit 709fb3
Packit 709fb3
Multithreading?
Packit 709fb3
Packit 709fb3
GNU grep originally did 32-bit arithmetic.  Although it has moved to
Packit 709fb3
64-bit on 64-bit platforms by using types like ptrdiff_t and size_t,
Packit 709fb3
this conversion has not been entirely systematic and should be checked.
Packit 709fb3
Packit 709fb3
Lazy dynamic linking of libpcre.  See Debian’s 03-397262-dlopen-pcre.patch.
Packit 709fb3
Packit 709fb3
Check FreeBSD’s integration of zgrep (-Z) and bzgrep (-J) in one
Packit 709fb3
binary.  Is there a possibility of doing even better by automatically
Packit 709fb3
checking the magic of binary files ourselves (0x1F 0x8B for gzip, 0x1F
Packit 709fb3
0x9D for compress, and 0x42 0x5A 0x68 for bzip2)?  Once what to do with
Packit 709fb3
libpcre is decided, do the same for libz and libbz2.
Packit 709fb3
Packit 709fb3

Packit 709fb3
===================
Packit 709fb3
Matching algorithms
Packit 709fb3
===================
Packit 709fb3
Packit 709fb3
Take a look at these and consider opportunities for merging or cloning:
Packit 709fb3
Packit 709fb3
   -- http://osrd.org/projects/grep/global-regular-expression-print-tools-grep-variants
Packit 709fb3
   -- ja-grep’s mlb2 patch (Japanese grep)
Packit 709fb3
      <http://distcache.freebsd.org/ports-distfiles/grep-2.4.2-mlb2.patch.gz>
Packit 709fb3
   -- lgrep (from lv, a Powerful Multilingual File Viewer / Grep)
Packit 709fb3
      <http://www.mt.cs.keio.ac.jp/person/narita/lv/>;
Packit 709fb3
   -- cgrep (Context grep) <https://awgn.github.io/cgrep/>
Packit 709fb3
      seems like nice work;
Packit 709fb3
   -- sgrep (Struct grep) <https://www.cs.helsinki.fi/u/jjaakkol/sgrep.html>;
Packit 709fb3
   -- agrep (Approximate grep) <https://www.tgries.de/agrep/>,
Packit 709fb3
      from glimpse;
Packit 709fb3
   -- nr-grep (Nondeterministic reverse grep)
Packit 709fb3
      <https://www.dcc.uchile.cl/~gnavarro/software/>;
Packit 709fb3
   -- ggrep (Grouse grep) <http://www.grouse.com.au/ggrep/>;
Packit 709fb3
   -- freegrep <https://github.com/howardjp/freegrep>;
Packit 709fb3
Packit 709fb3
Check some new algorithms for matching.  See, for example, Faro &
Packit 709fb3
Lecroq (cited in kwset.c).
Packit 709fb3
Packit 709fb3
Fix the DFA matcher to never use exponential space.  (Fortunately, these
Packit 709fb3
cases are rare.)
Packit 709fb3
Packit 709fb3

Packit 709fb3
============================
Packit 709fb3
Standards: POSIX and Unicode
Packit 709fb3
============================
Packit 709fb3
Packit 709fb3
For POSIX compliance issues, see POSIX 1003.1.
Packit 709fb3
Packit 709fb3
Current support for the POSIX [= =] and [. .] constructs is limited to
Packit 709fb3
platforms whose regular expression matchers are sufficiently
Packit 709fb3
compatible with the GNU C library so that the --without-included-regex
Packit 709fb3
option of ‘configure’ is in effect.  Extend this support to non-glibc
Packit 709fb3
platforms, where --with-included-regex is in effect, by modifying the
Packit 709fb3
included version of the regex code to defer to the native version when
Packit 709fb3
handling [= =] and [. .].
Packit 709fb3
Packit 709fb3
For Unicode, interesting things to check include the Unicode Standard
Packit 709fb3
<http://www.unicode.org/standard/standard.html> and the Unicode Technical
Packit 709fb3
Standard #18 (<http://www.unicode.org/reports/tr18/> “Unicode Regular
Packit 709fb3
Expressions”).  Talk to Bruno Haible who’s maintaining GNU libunistring.
Packit 709fb3
See also Unicode Standard Annex #15 (<http://www.unicode.org/reports/tr15/>
Packit 709fb3
“Unicode Normalization Forms”), already implemented by GNU libunistring.
Packit 709fb3
Packit 709fb3
In particular, --ignore-case needs to be evaluated against the standards.
Packit 709fb3
We may want to deviate from POSIX if Unicode provides better or clearer
Packit 709fb3
semantics.
Packit 709fb3
Packit 709fb3
POSIX and --ignore-case
Packit 709fb3
-----------------------
Packit 709fb3
Packit 709fb3
For this issue, interesting things to check in POSIX include the
Packit 709fb3
Open Group Base Specifications, Chapter “Regular Expressions”, in
Packit 709fb3
particular Section “Regular Expression General Requirements” and its
Packit 709fb3
paragraph about caseless matching (this may not have been fully
Packit 709fb3
thought through and that this text may be self-contradicting
Packit 709fb3
[specifically: “of either data or patterns” versus all the rest]).
Packit 709fb3
See:
Packit 709fb3
Packit 709fb3
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_02
Packit 709fb3
Packit 709fb3
In particular, consider the following with POSIX’s approach to case
Packit 709fb3
folding in mind.  Assume a non-Turkic locale with a character
Packit 709fb3
repertoire reduced to the following various forms of “LATIN LETTER I”:
Packit 709fb3
Packit 709fb3
  0049;LATIN CAPITAL LETTER I;Lu;0;L;;;;;N;;;;0069;
Packit 709fb3
  0069;LATIN SMALL LETTER I;Ll;0;L;;;;;N;;;0049;;0049
Packit 709fb3
  0130;LATIN CAPITAL LETTER I WITH DOT ABOVE;Lu;0;L;0049 0307;;;;N;\
Packit 709fb3
    LATIN CAPITAL LETTER I DOT;;;0069;
Packit 709fb3
  0131;LATIN SMALL LETTER DOTLESS I;Ll;0;L;;;;;N;;;0049;;0049
Packit 709fb3
Packit 709fb3
UTF-8 octet lengths differ between U+0049 (0x49) and U+0069 (0x69)
Packit 709fb3
versus U+0130 (0xC4 0xB0) and U+0131 (0xC4 0xB1).  This implies that
Packit 709fb3
whole UTF-8 strings cannot be case-converted in place, using the same
Packit 709fb3
memory buffer, and that the needed octet-size of the new buffer cannot
Packit 709fb3
merely be guessed (although there’s a simple upper bound of five times
Packit 709fb3
the size of the input, as the longest UTF-8 encoding of any character
Packit 709fb3
is five bytes).
Packit 709fb3
Packit 709fb3
We have
Packit 709fb3
Packit 709fb3
  lc(I) = i, uc(I) = I
Packit 709fb3
  lc(i) = i, uc(i) = I
Packit 709fb3
  lc(İ) = i, uc(İ) = İ
Packit 709fb3
  lc(ı) = ı, uc(ı) = I
Packit 709fb3
Packit 709fb3
where lc() and uc() denote lower-case and upper-case conversions.
Packit 709fb3
Packit 709fb3
There are several candidate --ignore-case logics.  Using the
Packit 709fb3
Packit 709fb3
  if (lc(input_wchar) == lc(pattern_wchar))
Packit 709fb3
Packit 709fb3
logic leads to the following matches:
Packit 709fb3
Packit 709fb3
    \in  I  i  İ  ı
Packit 709fb3
  pat\   ----------
Packit 709fb3
   I  |  Y  Y  Y  n
Packit 709fb3
   i  |  Y  Y  Y  n
Packit 709fb3
   İ  |  Y  Y  Y  n
Packit 709fb3
   ı  |  n  n  n  Y
Packit 709fb3
Packit 709fb3
There is a lack of symmetry between CAPITAL and SMALL LETTERs with
Packit 709fb3
this.  Using the
Packit 709fb3
Packit 709fb3
  if (uc(input_wchar) == uc(pattern_wchar))
Packit 709fb3
Packit 709fb3
logic (which is what GNU grep currently does although this is not
Packit 709fb3
documented or guaranteed in the future), leads to the following
Packit 709fb3
matches:
Packit 709fb3
Packit 709fb3
    \in  I  i  İ  ı
Packit 709fb3
  pat\   ----------
Packit 709fb3
   I  |  Y  Y  n  Y
Packit 709fb3
   i  |  Y  Y  n  Y
Packit 709fb3
   İ  |  n  n  Y  n
Packit 709fb3
   ı  |  Y  Y  n  Y
Packit 709fb3
Packit 709fb3
There is a lack of symmetry between CAPITAL and SMALL LETTERs with
Packit 709fb3
this.
Packit 709fb3
Packit 709fb3
Using the
Packit 709fb3
Packit 709fb3
  if (lc(input_wchar) == lc(pattern_wchar)
Packit 709fb3
      || uc(input_wchar) == uc(pattern_wchar))
Packit 709fb3
Packit 709fb3
logic leads to the following matches:
Packit 709fb3
Packit 709fb3
    \in  I  i  İ  ı
Packit 709fb3
  pat\   ----------
Packit 709fb3
   I  |  Y  Y  Y  Y
Packit 709fb3
   i  |  Y  Y  Y  Y
Packit 709fb3
   İ  |  Y  Y  Y  n
Packit 709fb3
   ı  |  Y  Y  n  Y
Packit 709fb3
Packit 709fb3
There is some elegance and symmetry with this.  But there are
Packit 709fb3
potentially two conversions to be made per input character.  If the
Packit 709fb3
pattern is pre-converted, two copies of it need to be kept and used in
Packit 709fb3
a mutually coherent fashion.
Packit 709fb3
Packit 709fb3
Using the
Packit 709fb3
Packit 709fb3
  if (input_wchar  == pattern_wchar
Packit 709fb3
      || lc(input_wchar) == pattern_wchar
Packit 709fb3
      || uc(input_wchar) == pattern_wchar)
Packit 709fb3
Packit 709fb3
logic (a plausible interpretation of POSIX) leads to the following
Packit 709fb3
matches:
Packit 709fb3
Packit 709fb3
    \in  I  i  İ  ı
Packit 709fb3
  pat\   ----------
Packit 709fb3
   I  |  Y  Y  n  Y
Packit 709fb3
   i  |  Y  Y  Y  n
Packit 709fb3
   İ  |  n  n  Y  n
Packit 709fb3
   ı  |  n  n  n  Y
Packit 709fb3
Packit 709fb3
There is a different CAPITAL/SMALL symmetry with this.  But there’s
Packit 709fb3
also a loss of pattern/input symmetry that’s unique to it.  Also there
Packit 709fb3
are potentially two conversions to be made per input character.
Packit 709fb3
Packit 709fb3
Using the
Packit 709fb3
Packit 709fb3
  if (lc(uc(input_wchar)) == lc(uc(pattern_wchar)))
Packit 709fb3
Packit 709fb3
logic leads to the following matches:
Packit 709fb3
Packit 709fb3
    \in  I  i  İ  ı
Packit 709fb3
  pat\   ----------
Packit 709fb3
   I  |  Y  Y  Y  Y
Packit 709fb3
   i  |  Y  Y  Y  Y
Packit 709fb3
   İ  |  Y  Y  Y  Y
Packit 709fb3
   ı  |  Y  Y  Y  Y
Packit 709fb3
Packit 709fb3
This shows total symmetry and transitivity (at least in this example
Packit 709fb3
analysis).  There are two conversions to be made per input character,
Packit 709fb3
but support could be added for having a single straight mapping
Packit 709fb3
performing a composition of the two conversions.
Packit 709fb3
Packit 709fb3
Any optimization in the implementation of each logic must not change
Packit 709fb3
its basic semantic.
Packit 709fb3
Packit 709fb3
Packit 709fb3
Unicode and --ignore-case
Packit 709fb3
-------------------------
Packit 709fb3
Packit 709fb3
For this issue, interesting things to check in Unicode include:
Packit 709fb3
Packit 709fb3
  - The Unicode Standard, Chapter 3
Packit 709fb3
    (<http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf>
Packit 709fb3
    “Conformance”), Section 3.13 (“Default Case Algorithms”) and the
Packit 709fb3
    toCasefold() case conversion operation.
Packit 709fb3
Packit 709fb3
  - The Unicode Standard, Chapter 4
Packit 709fb3
    (<http://www.unicode.org/versions/Unicode9.0.0/ch04.pdf>
Packit 709fb3
    “Character Properties”), Section 4.2 (“Case”) and
Packit 709fb3
    the <http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt>
Packit 709fb3
    SpecialCasing.txt and
Packit 709fb3
    <http://www.unicode.org/Public/UNIDATA/CaseFolding.txt>
Packit 709fb3
    CaseFolding.txt files.
Packit 709fb3
Packit 709fb3
  - The Unicode Standard, Chapter 5
Packit 709fb3
    (<http://www.unicode.org/versions/Unicode9.0.0/ch05.pdf>
Packit 709fb3
    “Implementation Guidelines”), Section 5.18 (“Case Mappings”),
Packit 709fb3
    Subsection “Caseless Matching”.
Packit 709fb3
Packit 709fb3
  - The Unicode case charts <http://www.unicode.org/charts/case/>.
Packit 709fb3
Packit 709fb3
Unicode uses the
Packit 709fb3
Packit 709fb3
  if (toCasefold(input_wchar_string) == toCasefold(pattern_wchar_string))
Packit 709fb3
Packit 709fb3
logic for caseless matching.  Consider the “LATIN LETTER I” example
Packit 709fb3
mentioned above.  In a non-Turkic locale, simple case folding yields
Packit 709fb3
Packit 709fb3
  toCasefold_simple(U+0049) = U+0069
Packit 709fb3
  toCasefold_simple(U+0069) = U+0069
Packit 709fb3
  toCasefold_simple(U+0130) = U+0130
Packit 709fb3
  toCasefold_simple(U+0131) = U+0131
Packit 709fb3
Packit 709fb3
which leads to the following matches:
Packit 709fb3
Packit 709fb3
    \in  I  i  İ  ı
Packit 709fb3
  pat\   ----------
Packit 709fb3
   I  |  Y  Y  n  n
Packit 709fb3
   i  |  Y  Y  n  n
Packit 709fb3
   İ  |  n  n  Y  n
Packit 709fb3
   ı  |  n  n  n  Y
Packit 709fb3
Packit 709fb3
This is different from anything so far!
Packit 709fb3
Packit 709fb3
In a non-Turkic locale, full case folding yields
Packit 709fb3
Packit 709fb3
  toCasefold_full(U+0049) = U+0069
Packit 709fb3
  toCasefold_full(U+0069) = U+0069
Packit 709fb3
  toCasefold_full(U+0130) = <U+0069, U+0307>
Packit 709fb3
  toCasefold_full(U+0131) = U+0131
Packit 709fb3
Packit 709fb3
with
Packit 709fb3
Packit 709fb3
  0307;COMBINING DOT ABOVE;Mn;230;NSM;;;;;N;NON-SPACING DOT ABOVE;;;;
Packit 709fb3
Packit 709fb3
which leads to the following matches:
Packit 709fb3
Packit 709fb3
    \in  I  i  İ  ı
Packit 709fb3
  pat\   ----------
Packit 709fb3
   I  |  Y  Y  *  n
Packit 709fb3
   i  |  Y  Y  *  n
Packit 709fb3
   İ  |  n  n  Y  n
Packit 709fb3
   ı  |  n  n  n  Y
Packit 709fb3
Packit 709fb3
This is just sad!
Packit 709fb3
Packit 709fb3
Having toCasefold(U+0131), simple or full, map to itself instead of
Packit 709fb3
U+0069 is in contradiction with the rules of Section 5.18 of the
Packit 709fb3
Unicode Standard since toUpperCase(U+0131) is U+0049.  Same thing for
Packit 709fb3
toCasefold_simple(U+0130) since toLowerCase(U+0131) is U+0069.  The
Packit 709fb3
justification for the weird toCasefold_full(U+0130) mapping is
Packit 709fb3
unknown; it doesn’t even make sense to add a dot (U+0307) to a letter
Packit 709fb3
that already has one (U+0069).  It would have been so simple to put
Packit 709fb3
them all in the same equivalence class!
Packit 709fb3
Packit 709fb3
Otherwise, also consider the following problem with Unicode’s approach
Packit 709fb3
on case folding in mind.  Assume that we want to perform
Packit 709fb3
Packit 709fb3
  echo 'AßBC' | grep -i 'Sb'
Packit 709fb3
Packit 709fb3
which corresponds to
Packit 709fb3
Packit 709fb3
  input:    U+0041 U+00DF U+0042 U+0043 U+000A
Packit 709fb3
  pattern:  U+0053 U+0062
Packit 709fb3
Packit 709fb3
Following CaseFolding.txt, applying the toCasefold() transformation to
Packit 709fb3
these yields
Packit 709fb3
Packit 709fb3
  input:    U+0061 U+0073 U+0073 U+0062 U+0063 U+000A
Packit 709fb3
  pattern:                U+0073 U+0062
Packit 709fb3
Packit 709fb3
so, according to this approach, the input should match the pattern.
Packit 709fb3
As long as the original input line is to be reported to the user as a
Packit 709fb3
whole, there is no problem (from the user’s point-of-view;
Packit 709fb3
implementation is complicated by this).
Packit 709fb3
Packit 709fb3
However, consider both these GNU extensions:
Packit 709fb3
Packit 709fb3
  echo 'AßBC' | grep -i --only-matching 'Sb'
Packit 709fb3
  echo 'AßBC' | grep -i --color=always  'Sb'
Packit 709fb3
Packit 709fb3
What is to be reported in these cases, since the match begins in the
Packit 709fb3
*middle* of the original input character ‘ß’?
Packit 709fb3
Packit 709fb3
Unicode’s toCasefold() cannot be implemented in terms of POSIX’s
Packit 709fb3
towctrans() since that can only return a single wint_t value per input
Packit 709fb3
wint_t value.