Tree - source-git/enca - CentOS Git server

source-git / enca

Files

Blob Blame History Raw
#============================================================================
# Enca v1.19 (2016-09-05)  guess and convert encoding of text files
# Copyright (C) 2000-2003 David Necas (Yeti) <yeti@physics.muni.cz>
# Copyright (C) 2009-2016 Michal Cihar <michal@cihar.com>
#============================================================================

TO THE NEXT RELEASE:
(this list must be empty at the time of release)

IN FUTURE:
(should be done, but maybe not right now)

* LCUC check for cyrillic charsets.
* Backups -- like cp, mv, etc.  This will be hard to get right with all the
  silly converters.
* More tests
* Structured documentation (the manual page is ugly)
  - keep a reasonably brief manual page
  - put all the boring doc stuff somewhere else, there are possibilities:
    info: searchable, has links, partly portable, has console viewers
    HTML: poorly searchable, has links, most portable, has console viewers
    TeX (ps): not searchable, no links, portable, most pleasant to read,
          no console viewers
    => use SGML (or info itself?) and generate the others


MAYBE SOMEDAY:
(when I will have mood for it, items are freely moved here and removed again)

* Detect all-caps texts OK.
  After several experiments it seems we have to
  - use pair occurences, at least, with specificaly computed
    difference-maximising weights
  - guess in two steps
  - first with uncapitalization and pair weights, and check whether the
    sample looks like natural text (garbageness test, but better)
  - if the first approach fails, do it as we do it now
* design better levels of verbosity/warnings (or: remove the --verbose option,
  keep important messages and remove all others?)
  0: only messages followed by exit(EXIT_FAILURE) (or abort()) are printed
     plus `cannot convert...'
  1: all nonfatal errors/warnings
  2: what converters are tried, what language gets detected (do not duplicate
     --details)
  >2: debug
* _real_ paranoiac behaviour assuring that nothing gets lost and that
  conversion output is either correctly converted text or untouched original
  (requires major redesign of all the conversion stuff)


NEVER:
(you can do anything GNU GPL v2 allows, but I'll restrain)

* features that nobody needs (mm, well, ... ok, let it be)
* duplicate other tools functionality more than necessary, use them instead
* dependency on anything that is not ISO C and/or POSIX (moreover do not use
  braindead features of both); important functionallity must be present
  everywhere nevertheless, enca can be smaller, faster or cleverer on some
  (GNU) systems
* localization; please correct my english instead ;->
* converter calling generalization (would require inlcuding the whole wordexp
  thing in enca, and: launching external converter is Bad Thing(TM) anyway)
* data in run-time files (needs parser (could live with) and disallows hooks
  (can't live without))
* loadable module support (it's not very portable)
-------------


KNOWN ISO C CONFLICTS:
(perhaps to be solved someday)

All constants and typedefs.  They start with ENCA_ and Enca, but:

  Names beginning with a capital `E' followed a digit or uppercase
  letter may be used for additional error code names.               [errno.h]

And additionally inside libenca (i.e. not so serious):
* libenca.h: #define EPSILON                                        [errno.h]
* filters.c: isvbox[]                                               [ctype.h]
* guess.c: #define isbinary                                         [ctype.h]
* guess.c: #define istext                                           [ctype.h]
* multibyte.c: is_valid_utf7()                                      [ctype.h]
* multibyte.c: is_valid_utf8()                                      [ctype.h]

Some probably can't conflict.
source-git / enca

Source Code

Files