Blob Blame History Raw
#============================================================================
# Enca v1.19 (2016-09-05)  guess and convert encoding of text files
# Copyright (C) 2000-2003 David Necas (Yeti) <yeti@physics.muni.cz>
# Copyright (C) 2009-2016 Michal Cihar <michal@cihar.com>
#============================================================================

Frequently Asked Questions about Enca

* Q: What obscure encoding the FAQ, THANKS, and other files use?
* A: Run Enca on them and you'll see.

* Q: How do I specify input encoding?
* A: You can't.  If you know the encoding, you don't need Enca.  Use some
     fully-fledged converter.

* Q: Why Enca can't detect both language and encoding?
* A: Because this is impossible.  Well, it's possible for natural, long
     enough texts.  But Enca can detect encoding of nonsense and people
     are used to it.  No program can tell you “ěčřýáíéú” is both Czech
     and e.g. ISO-8859-2.  Incidentally, interpreting the same bytes as
     Win-1251 one gets “мишэбнйъ”, which is equally good Russian nonsense.

* Q: Why Enca can't recognise encoding of all-uppercase texts?
* A: Mostly again because there's a trade-off between nonsense detection and
     low-probability natural text detection.  But it's possible to detect
     both, I'm working on that.

* Q: Why Enca needs LC_CTYPE to be set?
* A: No, it doesn't need it.  But you need to use “-L language-code” then.
     It's possible to put it into ENCAOPT environment variable, if you want
     to make your life easier.  (Never versions of Enca try to guess your
     language from other locale settings too.)

* Q: Why “enca -x ascii” doesn't work?
* A: Unfortunately there are several different things people call “conversion
     to ASCII”.  Consider following characters: ě (Latin small letter e with
     caron) and ≫ (Much greater-than).  By conversion to ASCII you may mean:
     1a. Omitting these characters in output, because they are not
         representable in ASCII.
     1b. Keeping these characters intact in output, because they are not
         representable in ASCII.
     1c. Failing (possibly damaging the file), because they are not
         representable in ASCII
     2. Approximating them with single, close ASCII characters; in this case
        probably plain “e” and “>”.
     3. Expanding them to sequences of ASCII characters which may be less
        readable then the approximations above, but generally allow
        reconstruction of the original characters.  Such sequences could
        be e.g. RFC-1345 mnemonics “e<” and “>>”.
     What happens when you run “enca -x ascii” on something, depends on
     the converter used.

     The usual scenario is: Enca uses librecode or libiconv, which do (1),
     and you get upset.  The only tool doing (2) known to me is cstocs, and
     does it only for Latin2 characters (install cstocs and specify cstocs
     as the converter to be used, if you want (2)).  AFAIK, there's no tool
     doing reasonably (3), though recode can expand to mnemonics (use rfc1345
     as target charset instead of ascii).

* Q: Why “enca -E cstocs” doesn't work in my RedHat/Fedora?
* A: This is a cstocs problem.  Cstocs is broken for Perl ≥ 5.8 and UTF-8
     locales.  Perl Unicode handling changes with every version so an
     advanced charset converter working in mutiple Perl versions is something
     between impossible and a big mess.  Either set your locales to non-UTF-8
     (e.g. ISO-8859-2) or don't use cstocs until this is resolved somehow.