Tree - source-git/enca - CentOS Git server

source-git / enca

Files

Commit: 11bac313f4b3bb64b62bc7d8b2007d1ea1098218
Blob Blame History Raw
.de XA
.RS
.PP
\\$1
.RE
.PP
..
.TH "enca" "1" "Sep 2009" "enca 1.11" " "
.SH "NAME"
.PP
enca \-\- detect and convert encoding of text files
.
.
.SH "SYNOPSIS"
.PP
\fBenca\fR [\fB\-L\fR \fILANGUAGE\fR] [\fIOPTION\fR]... [\fIFILE\fR]...
.br
\fBenconv\fR [\fB\-L\fR \fILANGUAGE\fR] [\fIOPTION\fR]... [\fIFILE\fR]...
.
.SH "INTRODUCTION AND EXAMPLES"
.PP
If you are lucky enough, the only two things you will ever need to know are:
command
.XA "enca \fIFILE\fR"
will tell you which encoding file \fIFILE\fR uses (without changing it), and
.XA "enconv \fIFILE\fR"
will convert file \fIFILE\fR to your locale native encoding.
To convert the file to some other encoding use the \fB-x\fR option
(see \fB\-x\fR entry in section \fBOPTIONS\fR and sections \fBCONVERSION\fR
and \fBENCODINGS\fR for details).
.PP
Both work with multiple files and standard input (output) too.
E.g.
.XA "enca \-x latin2 <sometext | lpr"
assures file `sometext' is in ISO Latin\~2 when it's sent to printer.
.PP
The main reason why these command will fail and turn your files into
garbage is that Enca needs to know their language to detect the encoding.
It tries to determine your language and preferred charset from locale
settings, which might not be what you want.
.PP
You can (or have to) use \fB\-L\fR option to tell it the right language.
Suppose, you downloaded some Russian HTML file,
`file.htm', it claims it's windows-1251 but it isn't.
So you run
.XA "enca \-L ru file.htm"
and find out it's KOI8\-R (for example).
Be warned, currently there are not many supported languages (see section
\fBLANGUAGES\fR).
.PP
Another warning concerns the fact several Enca's features, namely its
charset conversion capabilities, strongly depend on what other tools
are installed on your system (see section \fBCONVERSION)\fR\-\-run
.XA "enca \-\-version"
to get list of features (see section \fBFEATURES\fR).
Also try
.XA "enca \-\-help"
to get description of all other Enca options (and to find the rest of this
manual page redundant).
.
.
.SH "DESCRIPTION"
.PP
Enca reads given text files, or standard input when none are given,
and uses knowledge about their language (must be supported by you)
and a mixture of parsing, statistical analysis, guessing and
black magic to determine their encodings, which it then prints to standard
output (or it confesses it doesn't have any idea what the encoding could be).
By default, Enca presents results as a multiline human-readable descriptions,
several other formats are available\-\-see Output type selectors below.
.PP
Enca can also convert files to some other encoding \fIENC\fR
when you ask for it\-\-either using a built\-in converter,
some conversion library, or by calling an external converter.
.PP
Enca's primary goal is to be usable unattended, as an automatic conversion
tool, though it perhaps have not reached this point yet (please see section
\fBSECURITY\fR).
.PP
Please note except rare cases Enca really has to know the language of input
files to give you a reliable answer.
On the other hand, it can then cope quite well with files that are not purely
textual or even detect charset of text strings inside some binary file;
of course, it depends on the character of the non-text component.
.PP
Enca doesn't care about structure of input files, it views them as a uniform
piece of text/data.  In case of multipart files (e.g. mailboxes), you have to
use some tool knowing the structure to extract the individual parts first.
It's the cost of ability to detect encodings of any damaged, incomplete or
otherwise incorrect files.
.
.
.SH "OPTIONS"
.PP
There are several categories of options: operation mode options, output type
selectors, guessing parameters, conversion parameters, general options and
listings.
.PP
All long options can be abbreviated as long as they are unambiguous,
mandatory parameters of long options are mandatory for short options too.
.PP
.
.SS Operation modes
.PP
are following:
.TP
\fB\-c\fR, \fB\-\-auto\-convert\fR
Equivalent to calling Enca as \fBenconv\fR.
.sp
If no output type selector is specified, detect file encodings, guess your
preferred charset from locales, and convert files to it (only available with
+target\-charset\-auto feature).
.TP
\fB\-g\fR, \fB\-\-guess\fR
Equivalent to calling Enca as \fBenca\fR.
.sp
If no output type selector is specified, detect file encodings and report
them.
.PP
.
.SS Output type selectors
.PP
select what action Enca will take when it determines the encoding;
most of them just choose between different names, formats and conventions
how encodings can be printed, but one of them (\fB\-x\fR)
is special: it tells Enca to recode files to some other
encoding \fIENC\fR.
These options are mutually exclusive; if you specify more than one output
type selector the last one takes precedence.
.sp
Several output types represent charset name used by some other program,
but not all these programs know all the charsets which Enca recognises.
Be warned, Enca makes no difference between unrecognised charset and
charset having no name in given namespace in such situations.
.TP
\fB\-d\fR, \fB\-\-details\fR
It used to print a few pages of details about the guessing process, but since
Enca is just a program linked against Enca library, this is not possible and
this option is roughly equivalent to \fB\-\-human\-readable\fR,
except it reports failure reason when Enca doesn't recognize the encoding.
.TP
\fB\-e\fR, \fB\-\-enca\-name\fR
Prints Enca's nice name of the charset, i.e., perhaps the most generally
accepted and more or less human-readable charset identifier,
with surfaces appended.
.sp
This name is used when calling an external converter, too.
.TP
\fB\-f\fR, \fB\-\-human\-readable\fR
Prints verbal description of the detected charset and surfaces\-\-something
a human understands best.
This is the default behaviour.
.sp
The precise format is following: the first line contains charset name alone,
and it's followed by zero or more indented lines containing names of detected
surfaces.
This format is not, however, suitable or intended for further
machine-processing, and the verbal charset descriptions are like to change
in the future.
.TP
\fB\-i\fR, \fB\-\-iconv\-name\fR
Prints how \fIiconv\fR(3) (and/or \fIiconv\fR(1)) calls the detected charset.
More precisely, it prints one, more or less arbitrarily chosen, alias
accepted by iconv.
A charset unknown to iconv counts as unknown.
.sp
This output type makes sense only when Enca is compiled with iconv support
(feature +iconv\-interface).
.TP
\fB\-r\fR, \fB\-\-rfc1345\-name\fR
Prints RFC\~1345 charset name.
When such a name doesn't exist because RFC\~1345 doesn't define a given
encoding, some other name defined in some other RFC or just the name which
author considers `the most canonical', is printed.
.sp
Since RFC\~1345 doesn't define surfaces, no surface info is appended.
.TP
\fB\-m\fR, \fB\-\-mime\-name\fR
Prints preferred MIME name of detected charset.  This is the name you should
normally use when fixing e-mails or web pages.
.sp
A charset not present in http://www.iana.org/assignments/character-sets
counts as unknown.
.TP
\fB\-s\fR, \fB\-\-cstocs\-name\fR
Prints how \fIcstocs\fR(1) calls the detected charset.
A charset unknown to cstocs counts as unknown.
.TP
\fB\-n\fR, \fB\-\-name=\fR\fIWORD\fR
Prints charset (encoding) name selected by \fIWORD\fR (can be abbreviated as
long as is unambiguous).
For names listed above, \fB\-\-name=\fR\fIWORD\fR is equivalent to
\fB\-\-\fR\fIWORD\fR.
.sp
Using \fBaliases\fR as the output type causes Enca to print list of all
accepted aliases of detected charset.
.TP
\fB\-x\fR, \fB\-\-convert\-to=\fR[\fB..\fR]\fIENC\fR
Converts file to encoding \fIENC\fR.
.sp
The optional `..' before encoding name has no special meaning, except you can
use it to remind yourself that, unlike in \fIrecode\fR(1), you should specify
\fIdesired\fR encoding, instead of current.
.sp
You can use \fIrecode\fR(1) recoding chains or any other kind of braindead
recoding specification for \fIENC\fR, provided that you tell Enca to use some
tool understanding it for conversion (see section \fBCONVERSION\fR).
.sp
When Enca fails to determine the encoding, it prints a warning and leaves the
the file as is; when it is run as a filter it tries to do its best to copy
standard input to standard output unchanged.
Nevertheless, you should not rely on it and do backup.
.PP
.
.SS Guessing parameters
.PP
There's only one: \fB\-L\fR setting language of input files. This option is
mandatory (but see below).
.TP
\fB\-L\fR, \fB\-\-language=\fR\fILANG\fR
Sets language of input files to \fILANG\fR.
.sp
More precisely, \fILANG\fR can be any valid locale name (or alias with
+locale\-alias feature) of some supported language.
You can also specify `none' as language name, only multibyte encodings are
recognised then.
Run
.sp
enca \-\-list languages
.sp
to get list of supported languages.
When you don't specify any language Enca tries to guess your language from
locale settings and assumes input files use this language.
See section \fBLANGUAGES\fR for details.
.PP
.
.SS Conversion parameters
.PP
give you finer control of how charset conversion will be performed.
They don't affect anything when \fB\-x\fR is not specified as output type.
Please see section \fBCONVERSION\fR for the gory conversion details.
.TP
\fB\-C\fR, \fB\-\-try\-converters=\fR\fILIST\fR
Appends comma separated \fILIST\fR to the list of converters that will
be tried when you ask for conversion.
Their names can be abbreviated as long as they are unambiguous.
Run
.sp
enca \-\-list converters
.sp
to get list of all valid converter names (and see section \fBCONVERSION\fR
for their description).
.sp
The default list depends on how Enca has been compiled, run
.sp
enca \-\-help
.sp
to find out default converter list.
.sp
Note the default list is used only when you don't specify \fB\-C\fR at all.
Otherwise, the list is built as if it were initially empty and every
\fB\-C\fR adds new converter(s) to it.  Moreover, specifying
\fBnone\fR as converter name causes clearing the converter list.
.TP
\fB\-E\fR, \fB\-\-external\-converter\-program=\fR\fIPATH\fR
Sets external converter program name to \fIPATH\fR.
Default external converter depends on how enca has been complied, and the
possibility to use external converters may not be available at all.
Run
.sp
enca \-\-help
.sp
to find out default converter program in your enca build.
.PP
.
.SS General options
.PP
don't fit to other option categories...
.TP
\fB\-p\fR, \fB\-\-with\-filename\fR
Forces Enca to prefix each result with corresponding file name.
By default, Enca prefixes results with filenames when run on multiple files.
.sp
Standard input is printed as \fBSTDIN\fR
and standard output as \fBSTDOUT\fR
(the latter can be probably seen in error messages only).
.TP
\fB\-P\fR, \fB\-\-no\-filename\fR
Forces Enca to not prefix results with file names.
By default, Enca doesn't prefix result with file name when
run on a single file (including standard input).
.TP
\fB\-V\fR, \fB\-\-verbose\fR
Increases verbosity level (each use increases it by one).
.sp
Currently this option in not very useful because different parts of Enca
respond differently to the same verbosity level, mostly not at all.
.PP
.
.SS Listings
.PP
are all terminal, i.e. when Enca encounters some of them it prints
the required listing and terminates without processing any following options.
.TP
\fB\-h\fR, \fB\-\-help\fR
Prints brief usage help.
.TP
\fB\-G\fR, \fB\-\-license\fR
Prints full Enca license (through a pager, if possible).
.TP
\fB\-l\fR, \fB\-\-list=\fR\fIWORD\fR
Prints list specified by \fIWORD\fR (can be abbreviated as long as it is
unambiguous).
Available lists include:
.sp
\fBbuilt\-in\-charsets\fR.
All encodings convertible by built\-in converter, by group
(both input and output encoding must be from this list and belong to the same
group for internal conversion).
.sp
\fBbuilt\-in\-encodings\fR.
Equivalent to \fBbuilt\-in\-charsets\fR, but considered obsolete; will
be accepted with a warning, for a while.
.sp
\fBconverters\fR.
All valid converter names (to be used with \fB\-C\fR).
.sp
\fBcharsets\fR.
All encodings (charsets).
You can select what names will be printed with \fB\-\-name\fR or any
name output type selector (of course, only encodings having a name in given
namespace will be printed then), the selector must be specified \fIbefore\fR
\fB\-\-list\fR.
.sp
\fBencodings\fR.
Equivalent to \fBcharsets\fR, but considered obsolete; will
be accepted with a warning, for a while.
.sp
\fBlanguages\fR.
All supported languages together with charsets belonging to them.
Note output type selects language name style, not charset name style here.
.sp
\fBnames\fR.
All possible values of \fB\-\-name\fR option.
.sp
\fBlists\fR.
All possible values of this option.
(Crazy?)
.sp
\fBsurfaces\fR.
All surfaces Enca recognises.
.TP
\fB\-v\fR, \fB\-\-version\fR
Prints program version and list of features (see section \fBFEATURES\fR).
.
.
.SH "CONVERSION"
.PP
Though Enca has been originally designed as a tool for guessing encoding
only, it now features several methods of charset conversion.
You can control which of them will be used with \fB\-C\fR.
.PP
Enca sequentially tries converters from the list specified by \fB\-C\fR
until it finds some that
is able to perform required conversion or until it exhausts the list.
You should specify preferred converters first, less preferred later.
External converter (\fBextern\fR)
should be always specified last, only as last resort, since it's usually not
possible to recover when it fails.
The default list of converters always starts with \fBbuilt\-in\fR and then
continues with the first one available from: \fBlibrecode\fR, \fBiconv\fR,
nothing.
.PP
It should be noted when Enca says it is not able to perform the
conversion it only means none of the converters is able to perform it.
It can be still possible to perform the required conversion in several steps,
using several converters, but to figure out how, human intelligence is
probably needed.
.PP
.
.SS Built\-in converter
.PP
is the simplest and far the fastest of all, can perform only
a few byte-to-byte conversions and modifies files directly in place (may
be considered dangerous, but is pretty efficient).  You can get list of
all encodings it can convert with
.XA "enca \-\-list built\-in"
Beside speed, its main advantage (and also disadvantage) is that it doesn't
care: it simply converts characters having a representation in target
encoding, doesn't touch anything else and never prints any error message.
.sp
This converter can be specified as \fBbuilt\-in\fR with \fB\-C\fR.
.PP
.
.SS Librecode converter
.PP
is an interface to GNU recode library, that does the actual recoding job.
It may or may not be compiled in; run
.XA "enca \-\-version"
to find out its availability in your enca build
(feature +librecode\-interface).
.sp
You should be familiar with \fIrecode\fR(1) before using it,
since recode is a quite sophisticated and powerful charset conversion tool.
You may run into problems using it together with Enca
particularly because Enca's support for surfaces not 100% compatible,
because recode tries too hard to make the transformation reversible,
because it sometimes silently ignores I/O errors,
and because it's incredibly buggy.
Please see GNU recode info pages for details about recode library.
.sp
This converter can be specified as \fBlibrecode\fR with \fB\-C\fR.
.PP
.
.SS Iconv converter
.PP
is an interface to the UNIX98 \fIiconv\fR(3)
conversion functions, that do the actual recoding job.
It may or may not be compiled in; run
.XA "enca \-\-version"
to find out its availability in your enca build
(feature +iconv\-interface).
.sp
While iconv is present on most today systems it only rarely
offer some useful set of available conversions, the only notable exception
being iconv from GNU libc.
It is usually quite picky about surfaces, too (while, at the same time,
not implementing surface conversion).
It however probably represents the only standard(ized) tool
able to perform conversion from/to Unicode.
Please see iconv documentation about for details about its capabilities on
your particular system.
.sp
This converter can be specified as \fBiconv\fR with \fB\-C\fR.
.PP
.
.SS External converter
.PP
is an arbitrary external conversion tool that can be specified with
\fB\-E\fR option (at most one can be defined simultaneously).
There are some standard, provided together with enca: \fBcstocs\fR,
\fBrecode\fR, \fBmap\fR, \fBumap\fR, and \fBpiconv\fR.
All are wrapper scripts: for \fIcstocs\fR(1), \fIrecode\fR(1),
\fImap\fR(1), \fIumap\fR(1), and \fIpiconv\fR(1).
.sp
Please note enca has little control what the external converter really does.
If you set it to \fB/bin/rm\fR
you are fully responsible for the consequences.
.sp
If you want to make your own converter to use with enca,
you should know it is always called
.XA "\fICONVERTER\fR \fIENC_CURRENT\fR \fIENC\fR \fIFILE\fR [\fB\-\fR]"
where \fICONVERTER\fR is what has been set by \fB\-E\fR,
\fIENC_CURRENT\fR is detected encoding,
\fIENC\fR is what has been specified with \fB\-x\fR,
and \fIFILE\fR is the file to convert, i.e. it is called for each file
separately.
The optional fourth parameter, \fB\-\fR, should cause (when present)
sending result of conversion to standard output instead of overwriting
the file \fIFILE\fR.
The converter should also take care of not changing file permissions,
returning error code\~1 when it fails and cleaning its temporary files.
Please see the standard external converters for examples.
.sp
This converter can be specified as \fBextern\fR with \fI\-C\fR.
.PP
.
.SS Default target charset
.PP
The straightforward way of specifying target charset is the \fB\-x\fR
option, which overrides any defaults.
When Enca is called as \fBenconv\fR, default target charset is selected
exactly the same way as \fIrecode\fR(1) does it.
.PP
If the \fBDEFAULT_CHARSET\fR environment variable is set, it's used as the
target charset.
.PP
Otherwise, if you system provides the \fInl_langinfo\fR(3) function, current
locale's native charset is used as the target charset.
.PP
When both methods fail, Enca complains and terminates.
.PP
.
.SS Reversibility notes
.PP
If reversibility is crucial for you, you shouldn't use enca as converter
at all (or maybe you can, with very specifically designed \fIrecode\fR(1)
wrapper).
Otherwise you should at least know that there four
basic means of handling inconvertible character entities:
.sp
fail\-\-this is a possibility, too, and incidentally it's exactly what current
GNU libc iconv implementation does (recode can be also told to do it)
.sp
don't touch them\-\-this is what enca internal converter always does and
recode can do; though it is not reversible, a human being is usually able to
reconstruct the original (at least in principle)
.sp
approximate them\-\-this is what cstocs can do, and recode too, though
differently; and the best choice if you
just want to make the accursed text readable
.sp
drop them out\-\-this is what both recode and cstocs can do (cstocs can also
replace these characters by some fixed character instead of mere ignoring);
useful when the to\-be\-omitted characters contain only noise.
.sp
Please consult your favourite converter manual for details of this issue.
Generally, if you are not lucky enough to have all convertible characters
in you file, manual intervention is needed anyway.
.PP
.
.SS Performance notes
.PP
Poor performance of available converters has been one of main reasons for
including built\-in converter in enca.
Try to use it whenever possible, i.e. when files in consideration are
charset-clean enough or charset-messy enough so that its zero built\-in
intelligence doesn't matter.
It requires no extra disk space nor extra memory and can outperform
\fIrecode\fR(1) more than 10 times on large files and Perl
version (i.e. the faster one) of \fIcstocs\fR(1) more than 400 times on small
files (in fact it's almost as fast as mere \fIcp\fR(1)).
.PP
Try to avoid external converters when it's not absolutely necessary since
all the forking and moving stuff around is incredibly slow.
.
.
.SH "ENCODINGS"
.PP
You can get list of recognised character sets with
.XA "enca \-\-list charsets"
and using \fB\-\-name\fR parameter you can select any name you want to be
used in the listing.
You can also list all surfaces with
.XA "enca \-\-list surfaces"
Encoding and surface names are case insensitive and non-alphanumeric
characters are not taken into account.
However, non-alphanumeric characters are mostly
not allowed at all.  The only allowed are: `\-', `_', `.', `:', and\~`/'
(as charset/surface separator).
So `ibm852' and `IBM-852' are the same, while `IBM 852' is not accepted.
.PP
.
.SS Charsets
.PP
Following list of recognised charsets uses Enca's names (\fB\-e\fR) and
verbal descriptions as reported by Enca (\fB\-f\fR):
.PP
.TS
tab (@);
l l.
ASCII@7bit ASCII characters
ISO-8859-2@ISO 8859-2 standard; ISO Latin 2
ISO-8859-4@ISO 8859-4 standard; Latin 4
ISO-8859-5@ISO 8859-5 standard; ISO Cyrillic
ISO-8859-13@ISO 8859-13 standard; ISO Baltic; Latin 7
ISO-8859-16@ISO 8859-16 standard
CP1125@MS-Windows code page 1125
CP1250@MS-Windows code page 1250
CP1251@MS-Windows code page 1251
CP1257@MS-Windows code page 1257; WinBaltRim
IBM852@IBM/MS code page 852; PC (DOS) Latin 2
IBM855@IBM/MS code page 855
IBM775@IBM/MS code page 775
IBM866@IBM/MS code page 866
baltic@ISO-IR-179; Baltic
KEYBCS2@Kamenicky encoding; KEYBCS2
macce@Macintosh Central European
maccyr@Macintosh Cyrillic
ECMA-113@Ecma Cyrillic; ECMA-113
KOI-8_CS_2@KOI8-CS2 code (`T602')
KOI8-R@KOI8-R Cyrillic
KOI8-U@KOI8-U Cyrillic
KOI8-UNI@KOI8-Unified Cyrillic
TeX@(La)TeX control sequences
UCS-2@Universal character set 2 bytes; UCS-2; BMP
UCS-4@Universal character set 4 bytes; UCS-4; ISO-10646
UTF-7@Universal transformation format 7 bits; UTF-7
UTF-8@Universal transformation format 8 bits; UTF-8
CORK@Cork encoding; T1
GBK@Simplified Chinese National Standard; GB2312
BIG5@Traditional Chinese Industrial Standard; Big5
HZ@HZ encoded GB2312
unknown@Unrecognized encoding
.TE
.PP
where \fBunknown\fR is not any real encoding,
it's reported when Enca is not able to give a reliable answer.
.PP
.
.SS Surfaces
.PP
Enca has some experimental support for so-called surfaces (see below).
It detects following surfaces (not all can be applied to all charsets):
.PP
.TS
tab (@);
l l.
/CR@CR line terminators
/LF@LF line terminators
/CRLF@CRLF line terminators
N.A.@Mixed line terminators
N.A.@Surrounded by/intermixed with non-text data
/21@Byte order reversed in pairs (1,2 -> 2,1)
/4321@Byte order reversed in quadruples (1,2,3,4 -> 4,3,2,1)
N.A.@Both little and big endian chunks, concatenated
/qp@Quoted-printable encoded
.TE
.PP
Note some surfaces have N.A. in place of identifier\-\-they
cannot be specified on command line, they can only be reported by Enca.
This is intentional because they only inform you why the file cannot be
considered surface-consistent instead of representing a real surface.
.PP
Each charset has its natural surface (called `implied' in recode) which is not
reported, e.g., for IBM 852 charset it's `CRLF line terminators'.
For UCS encodings, big endian is considered as natural surface;
unusual byte orders are constructed from 21 and 4321 permutations:
2143 is reported simply as 21,
while 3412 is reported as combination of 4321 and 21.
.PP
Doubly-encoded UTF-8 is neither charset nor surface, it's just reported.
.PP
.
.SS About charsets, encodings and surfaces
.PP
Charset is a set of character entities while encoding is its representation
in the terms of bytes and bits.
In Enca, the word \fIencoding\fR means the same as `representation of text',
i.e. the relation between sequence of character entities constituting the
text and sequence of bytes (bits) constituting the file.
.PP
So, encoding is both character set and so-called surface
(line terminators, byte order, combining, Base64 transformation, etc.).
Nevertheless, it proves convenient to work with some {charset,surface} pairs
as with genuine charsets.
So, as in \fIrecode\fR(1), all UCS- and UTF- encodings of Universal character
set are called charsets.
Please see recode documentation for more details of this issue.
.PP
The only good thing about surfaces is: when you don't start playing with
them, neither Enca won't start and it will try to behave as much as
possible as a surface-unaware program, even when talking to recode.
.PP
.
.
.SH "LANGUAGES"
.PP
Enca needs to know the language of input files to work reliably, at least
in case of regular 8bit encoding.
Multibyte encodings should be recognised for any Latin, Cyrillic or Greek
language.
.PP
You can (or have to) use \fB\-L\fR option to tell Enca the language.
Since people most often work with files in the same language for which they
have configured locales, Enca tries tries to guess the language by examining
value of \fBLC_CTYPE\fR and other locale categories
(please see \fIlocale\fR(7)) and using it for the
language when you don't specify any.
Of course, it may be completely wrong and will give you nonsense answers and
damage your files, so please don't forget to use the \fB\-L\fR option.
You can also use \fBENCAOPT\fR environment variable to set a default language
(see section \fBENVIRONMENT\fR).
.PP
Following languages are supported by Enca (each language is listed together
with supported 8bit encodings).
.PP
.TS
tab (@);
l l.
Belarusian @CP1251 IBM866 ISO\-8859\-5 KOI8\-UNI maccyr IBM855
Bulgarian  @CP1251 ISO\-8859\-5 IBM855 maccyr ECMA\-113
Czech      @ISO\-8859\-2 CP1250 IBM852 KEYBCS2 macce KOI\-8_CS_2 CORK
Estonian   @ISO\-8859\-4 CP1257 IBM775 ISO\-8859\-13 macce baltic
Croatian   @CP1250 ISO\-8859\-2 IBM852 macce CORK
Hungarian  @ISO\-8859\-2 CP1250 IBM852 macce CORK
Lithuanian @CP1257 ISO\-8859\-4 IBM775 ISO\-8859\-13 macce baltic
Latvian    @CP1257 ISO\-8859\-4 IBM775 ISO\-8859\-13 macce baltic
Polish     @ISO\-8859\-2 CP1250 IBM852 macce ISO\-8859\-13 ISO\-8859\-16 baltic CORK
Russian    @KOI8\-R CP1251 ISO\-8859\-5 IBM866 maccyr
Slovak     @CP1250 ISO\-8859\-2 IBM852 KEYBCS2 macce KOI\-8_CS_2 CORK
Slovene    @ISO\-8859\-2 CP1250 IBM852 macce CORK
Ukrainian  @CP1251 IBM855 ISO\-8859\-5 CP1125 KOI8\-U maccyr
Chinese    @GBK BIG5 HZ
none       @
.TE
.PP
The special language \fBnone\fR can be shortened to \fB__\fR, it
contains no 8bit encodings, so only multibyte encodings are detected.
.PP
You can also use locale names instead of languages:
.TS
tab (@);
l l.
Belarusian   @be
Bulgarian    @bg
Czech        @cs
Estonian     @et
Croatian     @hr
Hungarian    @hu
Lithuanian   @lt
Latvian      @lv
Polish       @pl
Russian      @ru
Slovak       @sk
Slovene      @sl
Ukrainian    @uk
Chinese      @zh
.TE
.PP
.
.
.SH "FEATURES"
.PP
Several Enca's features depend on what is available on your system and how
it was compiled.
You can get their list with
.XA "enca \-\-version"
Plus sign before a feature name means it's available, minus sign means
this build lacks the particular feature.
.PP
\fBlibrecode\-interface\fR.
Enca has interface to GNU recode library charset conversion functions.
.sp
\fBiconv\-interface\fR.
Enca has interface to UNIX98 iconv charset conversion functions.
.sp
\fBexternal\-converter\fR.
Enca can use external conversion programs (if you have some suitable
installed).
.sp
\fBlanguage\-detection\fR.
Enca tries to guess language (\fB\-L\fR) from locales.  You don't need the
\fB\-\-language\fR option, at least in principle.
.sp
\fBlocale\-alias\fR.
Enca is able to decrypt locale aliases used for language names.
.sp
\fBtarget\-charset\-auto\fR.
Enca tries to detect your preferred charset from locales.
Option \fB\-\-auto\-convert\fR and calling Enca as \fBenconv\fR works, at
least in principle.
.sp
\fBENCAOPT\fR.
Enca is able to correctly parse this environment variable before command line
parameters.  Simple stuff like \fBENCAOPT="\-L uk"\fR will work even without
this feature.
.PP
.
.
.SH "ENVIRONMENT"
.PP
The variable \fBENCAOPT\fR can hold set of default Enca options.
Its content is interpreted before command line arguments.
Unfortunately, this doesn't work everywhere (must have +ENCAOPT
feature).
.PP
\fBLC_CTYPE\fR, \fBLC_COLLATE\fR, \fBLC_MESSAGES\fR
(possibly inherited from \fBLC_ALL\fR or \fBLANG\fR) is used
for guessing your language (must have +language-detection feature).
.PP
The variable \fBDEFAULT_CHARSET\fR can be used by \fBenconv\fR as the default
target charset.
.PP
.
.
.SH "DIAGNOSTICS"
.PP
Enca returns exit code\~0 when all input files were successfully proceeded
(i.e. all encodings were detected and all files were converted to required
encoding, if conversion was asked for).
Exit code\~1 is returned when Enca wasn't able to either guess encoding or
perform conversion on any input file because it's not clever enough.
Exit code\~2 is returned in case of serious (e.g. I/O) troubles.
.PP
.
.
.SH "SECURITY"
.PP
It should be possible to let Enca work unattended, it's its goal. However:
.PP
There's no warranty the detection works 100%. Don't bet on it, you can easily
lose valuable data.
.PP
Don't use enca (the program), link to libenca instead if you want anything
resembling security. You have to perform the eventual conversion yourself
then.
.PP
Don't use external converters. Ideally, disable them compile-time.
.PP
Be aware of \fBENCAOPT\fR and all the built-in automagic guessing various
things from environment, namely locales.
.PP
.
.
.SH "SEE ALSO"
.PP
\fIautoconvert\fR(1),
\fIcstocs\fR(1),
\fIfile\fR(1),
\fIiconv\fR(1),
\fIiconv\fR(3),
\fInl_langinfo\fR(3),
\fImap\fR(1),
\fIpiconv\fR(1),
\fIrecode\fR(1),
\fIlocale\fR(5),
\fIlocale\fR(7),
\fIltt\fR(1),
\fIumap\fR(1),
\fIunicode\fR(7),
\fIutf-8\fR(7),
\fIxcode\fR(1)
.PP
.
.
.SH "KNOWN BUGS"
.PP
It has too many \fIunknown\fR bugs.
.PP
The idea of using \fBLC_*\fR value for language is certainly braindead.
However I like it.
.PP
It can't backup files before mangling them.
.PP
In certain situations, it may behave incorrectly on >31bit file systems
and/or over NFS (both untested but shouldn't cause problems in practice).
.PP
Built\-in converter does not convert character `ch' from \fIKOI8-CS2\fR,
and possibly some other characters you've probably never heard about anyway.
.PP
EOL type recognition works poorly on Quoted-printable encoded files.
This should be fixed someday.
.PP
There are no command line options to tune libenca parameters.
This is intentional (Enca should DWIM) but sometimes this is a nuisance.
.PP
The manual page is too long, especially this section.
This doesn't matter since nobody does read it.
.PP
Send bug reports to <https://github.com/nijel/enca/issues>.
.
.
.SH "TRIVIA"
.PP
Enca is Extremely Naive Charset Analyser.
Nevertheless, the `enc' originally comes from `encoding'
so the leading\~`e' should be read as in
`encoding' not as in `extreme'.
.
.
.SH "AUTHORS"
.PP
David Necas (Yeti) <yeti@physics.muni.cz>
.PP
Michal Cihar <michal@cihar.com>
.sp
Unicode data has been generated from various (free) on\-line resources or
using GNU recode.
Statistical data has been generated from various texts on the Net, I hope
character counting doesn't break anyone's copyright.
.
.
.SH "ACKNOWLEDGEMENTS"
.PP
Please see the file THANKS in distribution.
.
.
.SH "COPYRIGHT"
.PP
Copyright (C) 2000-2003 David Necas (Yeti).
.PP
Copyright (C) 2009 Michal Cihar <michal@cihar.com>.
.sp
Enca is free software; you can redistribute it and/or modify it
under the terms of version 2 of the GNU General Public License
as published by the Free Software Foundation.
.sp
Enca is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty
of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
See the GNU General Public License for more details.
.sp
You should have received a copy of the GNU General Public License
along with Enca; if not, write to the Free Software Foundation,
Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
.
source-git / enca

Source Code

Files