Blame lib/Encode/Supported.pod

Packit d0f5c2
=head1 NAME
Packit d0f5c2
Packit d0f5c2
Encode::Supported -- Encodings supported by Encode
Packit d0f5c2
Packit d0f5c2
=head1 DESCRIPTION
Packit d0f5c2
Packit d0f5c2
=head2 Encoding Names
Packit d0f5c2
Packit d0f5c2
Encoding names are case insensitive. White space in names
Packit d0f5c2
is ignored.  In addition, an encoding may have aliases.
Packit d0f5c2
Each encoding has one "canonical" name.  The "canonical"
Packit d0f5c2
name is chosen from the names of the encoding by picking
Packit d0f5c2
the first in the following sequence (with a few exceptions).
Packit d0f5c2
Packit d0f5c2
=over 2
Packit d0f5c2
Packit d0f5c2
=item *
Packit d0f5c2
Packit d0f5c2
The name used by the Perl community.  That includes 'utf8' and 'ascii'.
Packit d0f5c2
Unlike aliases, canonical names directly reach the method so such
Packit d0f5c2
frequently used words like 'utf8' don't need to do alias lookups.
Packit d0f5c2
Packit d0f5c2
=item *
Packit d0f5c2
Packit d0f5c2
The MIME name as defined in IETF RFCs.  This includes all "iso-"s.
Packit d0f5c2
Packit d0f5c2
=item * 
Packit d0f5c2
Packit d0f5c2
The name in the IANA registry.
Packit d0f5c2
Packit d0f5c2
=item *
Packit d0f5c2
Packit d0f5c2
The name used by the organization that defined it.
Packit d0f5c2
Packit d0f5c2
=back
Packit d0f5c2
Packit d0f5c2
In case I<de jure> canonical names differ from that of the Encode
Packit d0f5c2
module, they are always aliased if it ever be implemented.  So you can
Packit d0f5c2
safely tell if a given encoding is implemented or not just by passing 
Packit d0f5c2
the canonical name.
Packit d0f5c2
Packit d0f5c2
Because of all the alias issues, and because in the general case 
Packit d0f5c2
encodings have state, "Encode" uses an encoding object internally 
Packit d0f5c2
once an operation is in progress.
Packit d0f5c2
Packit d0f5c2
=head1 Supported Encodings
Packit d0f5c2
Packit d0f5c2
As of Perl 5.8.0, at least the following encodings are recognized.
Packit d0f5c2
Note that unless otherwise specified, they are all case insensitive
Packit d0f5c2
(via alias) and all occurrence of spaces are replaced with '-'.
Packit d0f5c2
In other words, "ISO 8859 1" and "iso-8859-1" are identical.
Packit d0f5c2
Packit d0f5c2
Encodings are categorized and implemented in several different modules
Packit d0f5c2
but you don't have to C<use Encode::XX> to make them available for
Packit d0f5c2
most cases.  Encode.pm will automatically load those modules on demand.
Packit d0f5c2
Packit d0f5c2
=head2 Built-in Encodings
Packit d0f5c2
Packit d0f5c2
The following encodings are always available.
Packit d0f5c2
Packit d0f5c2
  Canonical     Aliases                      Comments & References
Packit d0f5c2
  ----------------------------------------------------------------
Packit d0f5c2
  ascii         US-ascii ISO-646-US                         [ECMA]
Packit d0f5c2
  ascii-ctrl			                  Special Encoding
Packit d0f5c2
  iso-8859-1    latin1                                       [ISO]
Packit d0f5c2
  null				                  Special Encoding
Packit d0f5c2
  utf8          UTF-8                                    [RFC2279]
Packit d0f5c2
  ----------------------------------------------------------------
Packit d0f5c2
Packit d0f5c2
I<null> and I<ascii-ctrl> are special.  "null" fails for all character
Packit d0f5c2
so when you set fallback mode to PERLQQ, HTMLCREF or XMLCREF, ALL
Packit d0f5c2
CHARACTERS will fall back to character references.  Ditto for
Packit d0f5c2
"ascii-ctrl" except for control characters.  For fallback modes, see
Packit d0f5c2
L<Encode>.
Packit d0f5c2
Packit d0f5c2
=head2 Encode::Unicode -- other Unicode encodings
Packit d0f5c2
Packit d0f5c2
Unicode coding schemes other than native utf8 are supported by
Packit d0f5c2
Encode::Unicode, which will be autoloaded on demand.
Packit d0f5c2
Packit d0f5c2
  ----------------------------------------------------------------
Packit d0f5c2
  UCS-2BE       UCS-2, iso-10646-1                      [IANA, UC]
Packit d0f5c2
  UCS-2LE                                                     [UC]
Packit d0f5c2
  UTF-16                                                      [UC]
Packit d0f5c2
  UTF-16BE                                                    [UC]
Packit d0f5c2
  UTF-16LE                                                    [UC]
Packit d0f5c2
  UTF-32                                                      [UC]
Packit d0f5c2
  UTF-32BE	UCS-4                                         [UC]
Packit d0f5c2
  UTF-32LE                                                    [UC]
Packit d0f5c2
  UTF-7                                                  [RFC2152]
Packit d0f5c2
  ----------------------------------------------------------------
Packit d0f5c2
Packit d0f5c2
To find how (UCS-2|UTF-(16|32))(LE|BE)? differ from one another,
Packit d0f5c2
see L<Encode::Unicode>. 
Packit d0f5c2
Packit d0f5c2
UTF-7 is a special encoding which "re-encodes" UTF-16BE into a 7-bit
Packit d0f5c2
encoding.  It is implemented separately by Encode::Unicode::UTF7.
Packit d0f5c2
Packit d0f5c2
=head2 Encode::Byte -- Extended ASCII
Packit d0f5c2
Packit d0f5c2
Encode::Byte implements most single-byte encodings except for
Packit d0f5c2
Symbols and EBCDIC. The following encodings are based on single-byte
Packit d0f5c2
encodings implemented as extended ASCII.  Most of them map
Packit d0f5c2
\x80-\xff (upper half) to non-ASCII characters.
Packit d0f5c2
Packit d0f5c2
=over 2
Packit d0f5c2
Packit d0f5c2
=item ISO-8859 and corresponding vendor mappings
Packit d0f5c2
Packit d0f5c2
Since there are so many, they are presented in table format with
Packit d0f5c2
languages and corresponding encoding names by vendors.  Note that
Packit d0f5c2
the table is sorted in order of ISO-8859 and the corresponding vendor
Packit d0f5c2
mappings are slightly different from that of ISO.  See
Packit d0f5c2
L<http://czyborra.com/charsets/iso8859.html> for details.
Packit d0f5c2
Packit d0f5c2
  Lang/Regions  ISO/Other Std.  DOS     Windows Macintosh  Others
Packit d0f5c2
  ----------------------------------------------------------------
Packit d0f5c2
  N. America    (ASCII)         cp437        AdobeStandardEncoding
Packit d0f5c2
                                cp863 (DOSCanadaF)
Packit d0f5c2
  W. Europe     iso-8859-1      cp850   cp1252  MacRoman  nextstep
Packit d0f5c2
                                                         hp-roman8
Packit d0f5c2
                                cp860 (DOSPortuguese)
Packit d0f5c2
  Cntrl. Europe iso-8859-2      cp852   cp1250  MacCentralEurRoman
Packit d0f5c2
                                                MacCroatian
Packit d0f5c2
                                                MacRomanian
Packit d0f5c2
                                                MacRumanian
Packit d0f5c2
  Latin3[1]     iso-8859-3      
Packit d0f5c2
  Latin4[2]     iso-8859-4              
Packit d0f5c2
  Cyrillics     iso-8859-5      cp855   cp1251  MacCyrillic
Packit d0f5c2
    (See also next section)     cp866           MacUkrainian
Packit d0f5c2
  Arabic        iso-8859-6      cp864   cp1256  MacArabic
Packit d0f5c2
                                cp1006          MacFarsi
Packit d0f5c2
  Greek         iso-8859-7      cp737   cp1253  MacGreek
Packit d0f5c2
                                cp869 (DOSGreek2)
Packit d0f5c2
  Hebrew        iso-8859-8      cp862   cp1255  MacHebrew
Packit d0f5c2
  Turkish       iso-8859-9      cp857   cp1254  MacTurkish
Packit d0f5c2
  Nordics       iso-8859-10     cp865
Packit d0f5c2
                                cp861           MacIcelandic
Packit d0f5c2
                                                MacSami
Packit d0f5c2
  Thai          iso-8859-11[3]  cp874           MacThai
Packit d0f5c2
  (iso-8859-12 is nonexistent. Reserved for Indics?)
Packit d0f5c2
  Baltics       iso-8859-13     cp775           cp1257
Packit d0f5c2
  Celtics       iso-8859-14
Packit d0f5c2
  Latin9 [4]    iso-8859-15
Packit d0f5c2
  Latin10       iso-8859-16
Packit d0f5c2
  Vietnamese    viscii                  cp1258  MacVietnamese
Packit d0f5c2
  ----------------------------------------------------------------
Packit d0f5c2
Packit d0f5c2
  [1] Esperanto, Maltese, and Turkish. Turkish is now on 8859-9.
Packit d0f5c2
  [2] Baltics.  Now on 8859-10, except for Latvian.
Packit d0f5c2
  [3] TIS 620 +  Non-Breaking Space (0xA0 / U+00A0)
Packit d0f5c2
  [4] Nicknamed Latin0; the Euro sign as well as French and Finnish
Packit d0f5c2
      letters that are missing from 8859-1 were added.
Packit d0f5c2
Packit d0f5c2
All cp* are also available as ibm-*, ms-*, and windows-* .  See also
Packit d0f5c2
L<http://czyborra.com/charsets/codepages.html>.
Packit d0f5c2
Packit d0f5c2
Macintosh encodings don't seem to be registered in such entities as
Packit d0f5c2
IANA.  "Canonical" names in Encode are based upon Apple's Tech Note
Packit d0f5c2
1150.  See L<http://developer.apple.com/technotes/tn/tn1150.html> 
Packit d0f5c2
for details.
Packit d0f5c2
Packit d0f5c2
=item KOI8 - De Facto Standard for the Cyrillic world
Packit d0f5c2
Packit d0f5c2
Though ISO-8859 does have ISO-8859-5, the KOI8 series is far more
Packit d0f5c2
popular in the Net.   L<Encode> comes with the following KOI charsets.
Packit d0f5c2
For gory details, see L<http://czyborra.com/charsets/cyrillic.html>
Packit d0f5c2
Packit d0f5c2
  ----------------------------------------------------------------
Packit d0f5c2
  koi8-f                                        
Packit d0f5c2
  koi8-r cp878                                           [RFC1489]
Packit d0f5c2
  koi8-u                                                 [RFC2319]
Packit d0f5c2
  ----------------------------------------------------------------
Packit d0f5c2
Packit d0f5c2
=back
Packit d0f5c2
Packit d0f5c2
=head2 gsm0338 - Hentai Latin 1
Packit d0f5c2
Packit d0f5c2
GSM0338 is for GSM handsets. Though it shares alphanumerals with
Packit d0f5c2
ASCII, control character ranges and other parts are mapped very
Packit d0f5c2
differently, mainly to store Greek characters.  There are also escape
Packit d0f5c2
sequences (starting with 0x1B) to cover e.g. the Euro sign.  
Packit d0f5c2
Packit d0f5c2
This was once handled by L<Encode::Bytes> but because of all those
Packit d0f5c2
unusual specifications, Encode 2.20 has relocated the support to
Packit d0f5c2
L<Encode::GSM0338>. See L<Encode::GSM0338> for details.
Packit d0f5c2
Packit d0f5c2
=over 2
Packit d0f5c2
Packit d0f5c2
=item gsm0338 support before 2.19
Packit d0f5c2
Packit d0f5c2
Some special cases like a trailing 0x00 byte or a lone 0x1B byte are not
Packit d0f5c2
well-defined and decode() will return an empty string for them.
Packit d0f5c2
One possible workaround is
Packit d0f5c2
Packit d0f5c2
   $gsm =~ s/\x00\z/\x00\x00/;
Packit d0f5c2
   $uni = decode("gsm0338", $gsm);
Packit d0f5c2
   $uni .= "\xA0" if $gsm =~ /\x1B\z/;
Packit d0f5c2
Packit d0f5c2
Note that the Encode implementation of GSM0338 does not implement the
Packit d0f5c2
reuse of Latin capital letters as Greek capital letters (for example,
Packit d0f5c2
the 0x5A is U+005A (LATIN CAPITAL LETTER Z), not U+0396 (GREEK CAPITAL
Packit d0f5c2
LETTER ZETA).
Packit d0f5c2
Packit d0f5c2
The GSM0338 is also covered in Encode::Byte even though it is not
Packit d0f5c2
an "extended ASCII" encoding.
Packit d0f5c2
Packit d0f5c2
=back
Packit d0f5c2
Packit d0f5c2
=head2 CJK: Chinese, Japanese, Korean (Multibyte)
Packit d0f5c2
Packit d0f5c2
Note that Vietnamese is listed above.  Also read "Encoding vs Charset"
Packit d0f5c2
below.  Also note that these are implemented in distinct modules by
Packit d0f5c2
countries, due to the size concerns (simplified Chinese is mapped
Packit d0f5c2
to 'CN', continental China, while traditional Chinese is mapped to
Packit d0f5c2
'TW', Taiwan).  Please refer to their respective documentation pages.
Packit d0f5c2
Packit d0f5c2
=over 2
Packit d0f5c2
Packit d0f5c2
=item Encode::CN -- Continental China
Packit d0f5c2
Packit d0f5c2
  Standard      DOS/Win Macintosh                Comment/Reference
Packit d0f5c2
  ----------------------------------------------------------------
Packit d0f5c2
  euc-cn [1]            MacChineseSimp
Packit d0f5c2
  (gbk)         cp936 [2]
Packit d0f5c2
  gb12345-raw                      { GB12345 without CES }
Packit d0f5c2
  gb2312-raw                       { GB2312  without CES }
Packit d0f5c2
  hz
Packit d0f5c2
  iso-ir-165
Packit d0f5c2
  ----------------------------------------------------------------
Packit d0f5c2
Packit d0f5c2
  [1] GB2312 is aliased to this.  See L<Microsoft-related naming mess>
Packit d0f5c2
  [2] gbk is aliased to this.  See L<Microsoft-related naming mess>
Packit d0f5c2
Packit d0f5c2
=item Encode::JP -- Japan
Packit d0f5c2
Packit d0f5c2
  Standard      DOS/Win Macintosh                Comment/Reference
Packit d0f5c2
  ----------------------------------------------------------------
Packit d0f5c2
  euc-jp
Packit d0f5c2
  shiftjis      cp932   macJapanese
Packit d0f5c2
  7bit-jis
Packit d0f5c2
  iso-2022-jp                                            [RFC1468]
Packit d0f5c2
  iso-2022-jp-1                                          [RFC2237]
Packit d0f5c2
  jis0201-raw  { JIS X 0201 (roman + halfwidth kana) without CES }
Packit d0f5c2
  jis0208-raw  { JIS X 0208 (Kanji + fullwidth kana) without CES }
Packit d0f5c2
  jis0212-raw  { JIS X 0212 (Extended Kanji)         without CES }
Packit d0f5c2
  ----------------------------------------------------------------
Packit d0f5c2
Packit d0f5c2
=item Encode::KR -- Korea
Packit d0f5c2
Packit d0f5c2
  Standard      DOS/Win Macintosh                Comment/Reference
Packit d0f5c2
  ----------------------------------------------------------------
Packit d0f5c2
  euc-kr                MacKorean                        [RFC1557]
Packit d0f5c2
                cp949 [1]                    
Packit d0f5c2
  iso-2022-kr                                            [RFC1557]
Packit d0f5c2
  johab                                  [KS X 1001:1998, Annex 3]
Packit d0f5c2
  ksc5601-raw                              { KSC5601 without CES }
Packit d0f5c2
  ----------------------------------------------------------------
Packit d0f5c2
Packit d0f5c2
  [1] ks_c_5601-1987, (x-)?windows-949, and uhc are aliased to this.
Packit d0f5c2
  See below.
Packit d0f5c2
Packit d0f5c2
=item Encode::TW -- Taiwan
Packit d0f5c2
Packit d0f5c2
  Standard      DOS/Win Macintosh                Comment/Reference
Packit d0f5c2
  ----------------------------------------------------------------
Packit d0f5c2
  big5-eten     cp950   MacChineseTrad {big5 aliased to big5-eten}
Packit d0f5c2
  big5-hkscs                              
Packit d0f5c2
  ----------------------------------------------------------------
Packit d0f5c2
Packit d0f5c2
=item Encode::HanExtra -- More Chinese via CPAN
Packit d0f5c2
Packit d0f5c2
Due to the size concerns, additional Chinese encodings below are
Packit d0f5c2
distributed separately on CPAN, under the name Encode::HanExtra.
Packit d0f5c2
Packit d0f5c2
  Standard      DOS/Win Macintosh                Comment/Reference
Packit d0f5c2
  ----------------------------------------------------------------
Packit d0f5c2
  big5ext                                   CMEX's Big5e Extension
Packit d0f5c2
  big5plus                                  CMEX's Big5+ Extension
Packit d0f5c2
  cccii         Chinese Character Code for Information Interchange
Packit d0f5c2
  euc-tw                             EUC (Extended Unix Character)
Packit d0f5c2
  gb18030                          GBK with Traditional Characters
Packit d0f5c2
  ----------------------------------------------------------------
Packit d0f5c2
Packit d0f5c2
=item Encode::JIS2K -- JIS X 0213 encodings via CPAN
Packit d0f5c2
Packit d0f5c2
Due to size concerns, additional Japanese encodings below are
Packit d0f5c2
distributed separately on CPAN, under the name Encode::JIS2K.
Packit d0f5c2
Packit d0f5c2
  Standard      DOS/Win Macintosh                Comment/Reference
Packit d0f5c2
  ----------------------------------------------------------------
Packit d0f5c2
  euc-jisx0213
Packit d0f5c2
  shiftjisx0123
Packit d0f5c2
  iso-2022-jp-3
Packit d0f5c2
  jis0213-1-raw
Packit d0f5c2
  jis0213-2-raw
Packit d0f5c2
  ----------------------------------------------------------------
Packit d0f5c2
Packit d0f5c2
=back
Packit d0f5c2
Packit d0f5c2
=head2 Miscellaneous encodings
Packit d0f5c2
Packit d0f5c2
=over 2
Packit d0f5c2
Packit d0f5c2
=item Encode::EBCDIC
Packit d0f5c2
Packit d0f5c2
See L<perlebcdic> for details.
Packit d0f5c2
Packit d0f5c2
  ----------------------------------------------------------------
Packit d0f5c2
  cp37
Packit d0f5c2
  cp500  
Packit d0f5c2
  cp875  
Packit d0f5c2
  cp1026  
Packit d0f5c2
  cp1047  
Packit d0f5c2
  posix-bc
Packit d0f5c2
  ----------------------------------------------------------------
Packit d0f5c2
Packit d0f5c2
=item Encode::Symbols
Packit d0f5c2
Packit d0f5c2
For symbols  and dingbats.
Packit d0f5c2
Packit d0f5c2
  ----------------------------------------------------------------
Packit d0f5c2
  symbol
Packit d0f5c2
  dingbats
Packit d0f5c2
  MacDingbats
Packit d0f5c2
  AdobeZdingbat
Packit d0f5c2
  AdobeSymbol
Packit d0f5c2
  ----------------------------------------------------------------
Packit d0f5c2
Packit d0f5c2
=item Encode::MIME::Header
Packit d0f5c2
Packit d0f5c2
Strictly speaking, MIME header encoding documented in RFC 2047 is more
Packit d0f5c2
of encapsulation than encoding.  However, their support in modern
Packit d0f5c2
world is imperative so they are supported.
Packit d0f5c2
Packit d0f5c2
  ----------------------------------------------------------------
Packit d0f5c2
  MIME-Header                                            [RFC2047]
Packit d0f5c2
  MIME-B                                                 [RFC2047]
Packit d0f5c2
  MIME-Q                                                 [RFC2047]
Packit d0f5c2
  ----------------------------------------------------------------
Packit d0f5c2
Packit d0f5c2
=item Encode::Guess
Packit d0f5c2
Packit d0f5c2
This one is not a name of encoding but a utility that lets you pick up
Packit d0f5c2
the most appropriate encoding for a data out of given I<suspects>.  See
Packit d0f5c2
L<Encode::Guess> for details.
Packit d0f5c2
Packit d0f5c2
=back
Packit d0f5c2
Packit d0f5c2
=head1 Unsupported encodings
Packit d0f5c2
Packit d0f5c2
The following encodings are not supported as yet; some because they
Packit d0f5c2
are rarely used, some because of technical difficulties.  They may
Packit d0f5c2
be supported by external modules via CPAN in the future, however.
Packit d0f5c2
Packit d0f5c2
=over 2
Packit d0f5c2
Packit d0f5c2
=item   ISO-2022-JP-2 [RFC1554]
Packit d0f5c2
Packit d0f5c2
Not very popular yet.  Needs Unicode Database or equivalent to
Packit d0f5c2
implement encode() (because it includes JIS X 0208/0212, KSC5601, and
Packit d0f5c2
GB2312 simultaneously, whose code points in Unicode overlap.  So you
Packit d0f5c2
need to lookup the database to determine to what character set a given
Packit d0f5c2
Unicode character should belong). 
Packit d0f5c2
Packit d0f5c2
=item ISO-2022-CN [RFC1922]
Packit d0f5c2
Packit d0f5c2
Not very popular.  Needs CNS 11643-1 and -2 which are not available in
Packit d0f5c2
this module.  CNS 11643 is supported (via euc-tw) in Encode::HanExtra.
Packit d0f5c2
Audrey Tang may add support for this encoding in her module in future.
Packit d0f5c2
Packit d0f5c2
=item Various HP-UX encodings
Packit d0f5c2
Packit d0f5c2
The following are unsupported due to the lack of mapping data.
Packit d0f5c2
Packit d0f5c2
  '8'  - arabic8, greek8, hebrew8, kana8, thai8, and turkish8
Packit d0f5c2
  '15' - japanese15, korean15, and roi15
Packit d0f5c2
Packit d0f5c2
=item Cyrillic encoding ISO-IR-111
Packit d0f5c2
Packit d0f5c2
Anton Tagunov doubts its usefulness.
Packit d0f5c2
Packit d0f5c2
=item ISO-8859-8-1 [Hebrew]
Packit d0f5c2
Packit d0f5c2
None of the Encode team knows Hebrew enough (ISO-8859-8, cp1255 and
Packit d0f5c2
MacHebrew are supported because and just because there were mappings
Packit d0f5c2
available at L<http://www.unicode.org/>).  Contributions welcome.
Packit d0f5c2
Packit d0f5c2
=item ISIRI 3342, Iran System, ISIRI 2900 [Farsi]
Packit d0f5c2
Packit d0f5c2
Ditto.
Packit d0f5c2
Packit d0f5c2
=item Thai encoding TCVN
Packit d0f5c2
Packit d0f5c2
Ditto.
Packit d0f5c2
Packit d0f5c2
=item Vietnamese encodings VPS
Packit d0f5c2
Packit d0f5c2
Though Jungshik Shin has reported that Mozilla supports this encoding,
Packit d0f5c2
it was too late before 5.8.0 for us to add it.  In the future, it
Packit d0f5c2
may be available via a separate module.  See
Packit d0f5c2
L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.uf>
Packit d0f5c2
and
Packit d0f5c2
L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.ut>
Packit d0f5c2
if you are interested in helping us.
Packit d0f5c2
Packit d0f5c2
=item Various Mac encodings
Packit d0f5c2
Packit d0f5c2
The following are unsupported due to the lack of mapping data. 
Packit d0f5c2
Packit d0f5c2
  MacArmenian,  MacBengali,   MacBurmese,   MacEthiopic
Packit d0f5c2
  MacExtArabic, MacGeorgian,  MacKannada,   MacKhmer
Packit d0f5c2
  MacLaotian,   MacMalayalam, MacMongolian, MacOriya
Packit d0f5c2
  MacSinhalese, MacTamil,     MacTelugu,    MacTibetan
Packit d0f5c2
  MacVietnamese
Packit d0f5c2
Packit d0f5c2
The rest which are already available are based upon the vendor mappings
Packit d0f5c2
at L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/> .
Packit d0f5c2
Packit d0f5c2
=item (Mac) Indic encodings
Packit d0f5c2
Packit d0f5c2
The maps for the following are available at L<http://www.unicode.org/>
Packit d0f5c2
but remain unsupported because those encodings need an algorithmical
Packit d0f5c2
approach, currently unsupported by F<enc2xs>:
Packit d0f5c2
Packit d0f5c2
  MacDevanagari
Packit d0f5c2
  MacGurmukhi
Packit d0f5c2
  MacGujarati
Packit d0f5c2
Packit d0f5c2
For details, please see C<Unicode mapping issues and notes:> at
Packit d0f5c2
L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/DEVANAGA.TXT> .
Packit d0f5c2
Packit d0f5c2
I believe this issue is prevalent not only for Mac Indics but also in
Packit d0f5c2
other Indic encodings, but the above were the only Indic encodings
Packit d0f5c2
maps that I could find at L<http://www.unicode.org/> .
Packit d0f5c2
Packit d0f5c2
=back
Packit d0f5c2
Packit d0f5c2
=head1 Encoding vs. Charset -- terminology
Packit d0f5c2
Packit d0f5c2
We are used to using the term (character) I<encoding> and I
Packit d0f5c2
set> interchangeably.  But just as confusing the terms byte and
Packit d0f5c2
character is dangerous and the terms should be differentiated when
Packit d0f5c2
needed, we need to differentiate I<encoding> and I<character set>.
Packit d0f5c2
Packit d0f5c2
To understand that, here is a description of how we make computers
Packit d0f5c2
grok our characters.
Packit d0f5c2
Packit d0f5c2
=over 2
Packit d0f5c2
Packit d0f5c2
=item *
Packit d0f5c2
Packit d0f5c2
First we start with which characters to include.  We call this
Packit d0f5c2
collection of characters I<character repertoire>.
Packit d0f5c2
Packit d0f5c2
=item *
Packit d0f5c2
Packit d0f5c2
Then we have to give each character a unique ID so your computer can
Packit d0f5c2
tell the difference between 'a' and 'A'.  This itemized character
Packit d0f5c2
repertoire is now a I<character set>.
Packit d0f5c2
Packit d0f5c2
=item *
Packit d0f5c2
Packit d0f5c2
If your computer can grow the character set without further
Packit d0f5c2
processing, you can go ahead and use it.  This is called a I
Packit d0f5c2
character set> (CCS) or I<raw character encoding>.  ASCII is used this
Packit d0f5c2
way for most cases.
Packit d0f5c2
Packit d0f5c2
=item *
Packit d0f5c2
Packit d0f5c2
But in many cases, especially multi-byte CJK encodings, you have to
Packit d0f5c2
tweak a little more.  Your network connection may not accept any data
Packit d0f5c2
with the Most Significant Bit set, and your computer may not be able to
Packit d0f5c2
tell if a given byte is a whole character or just half of it.  So you
Packit d0f5c2
have to I<encode> the character set to use it.
Packit d0f5c2
Packit d0f5c2
A I<character encoding scheme> (CES) determines how to encode a given
Packit d0f5c2
character set, or a set of multiple character sets.  7bit ISO-2022 is
Packit d0f5c2
an example of a CES.  You switch between character sets via I
Packit d0f5c2
sequences>.
Packit d0f5c2
Packit d0f5c2
=back
Packit d0f5c2
Packit d0f5c2
Technically, or mathematically, speaking, a character set encoded in
Packit d0f5c2
such a CES that maps character by character may form a CCS.  EUC is such
Packit d0f5c2
an example.  The CES of EUC is as follows:
Packit d0f5c2
Packit d0f5c2
=over 2
Packit d0f5c2
Packit d0f5c2
=item *
Packit d0f5c2
Packit d0f5c2
Map ASCII unchanged.
Packit d0f5c2
Packit d0f5c2
=item *
Packit d0f5c2
Packit d0f5c2
Map such a character set that consists of 94 or 96 powered by N
Packit d0f5c2
members by adding 0x80 to each byte.
Packit d0f5c2
Packit d0f5c2
=item *
Packit d0f5c2
Packit d0f5c2
You can also use 0x8e and 0x8f to indicate that the following sequence of
Packit d0f5c2
characters belongs to yet another character set.  To each following byte
Packit d0f5c2
is added the value 0x80.
Packit d0f5c2
Packit d0f5c2
=back
Packit d0f5c2
Packit d0f5c2
By carefully looking at the encoded byte sequence, you can find that the
Packit d0f5c2
byte sequence conforms a unique number.  In that sense, EUC is a CCS
Packit d0f5c2
generated by a CES above from up to four CCS (complicated?).  UTF-8
Packit d0f5c2
falls into this category.  See L<perlUnicode/"UTF-8"> to find out how
Packit d0f5c2
UTF-8 maps Unicode to a byte sequence.
Packit d0f5c2
Packit d0f5c2
You may also have found out by now why 7bit ISO-2022 cannot comprise
Packit d0f5c2
a CCS.  If you look at a byte sequence \x21\x21, you can't tell if
Packit d0f5c2
it is two !'s or IDEOGRAPHIC SPACE.  EUC maps the latter to \xA1\xA1
Packit d0f5c2
so you have no trouble differentiating between "!!". and S<"  ">.
Packit d0f5c2
Packit d0f5c2
=head1 Encoding Classification (by Anton Tagunov and Dan Kogai)
Packit d0f5c2
Packit d0f5c2
This section tries to classify the supported encodings by their 
Packit d0f5c2
applicability for information exchange over the Internet and to 
Packit d0f5c2
choose the most suitable aliases to name them in the context of 
Packit d0f5c2
such communication.
Packit d0f5c2
Packit d0f5c2
=over 2
Packit d0f5c2
Packit d0f5c2
=item * 
Packit d0f5c2
Packit d0f5c2
To (en|de)code encodings marked by C<(**)>, you need 
Packit d0f5c2
C<Encode::HanExtra>, available from CPAN.
Packit d0f5c2
Packit d0f5c2
=back
Packit d0f5c2
Packit d0f5c2
Encoding names
Packit d0f5c2
Packit d0f5c2
  US-ASCII    UTF-8    ISO-8859-*  KOI8-R
Packit d0f5c2
  Shift_JIS   EUC-JP   ISO-2022-JP ISO-2022-JP-1
Packit d0f5c2
  EUC-KR      Big5     GB2312
Packit d0f5c2
Packit d0f5c2
are registered with IANA as preferred MIME names and may
Packit d0f5c2
be used over the Internet.
Packit d0f5c2
Packit d0f5c2
C<Shift_JIS> has been officialized by JIS X 0208:1997.
Packit d0f5c2
L<Microsoft-related naming mess> gives details.
Packit d0f5c2
Packit d0f5c2
C<GB2312> is the IANA name for C<EUC-CN>.
Packit d0f5c2
See L<Microsoft-related naming mess> for details.
Packit d0f5c2
Packit d0f5c2
C<GB_2312-80> I<raw> encoding is available as C<gb2312-raw>
Packit d0f5c2
with Encode. See L<Encode::CN> for details.
Packit d0f5c2
Packit d0f5c2
  EUC-CN
Packit d0f5c2
  KOI8-U        [RFC2319]
Packit d0f5c2
Packit d0f5c2
have not been registered with IANA (as of March 2002) but
Packit d0f5c2
seem to be supported by major web browsers. 
Packit d0f5c2
The IANA name for C<EUC-CN> is C<GB2312>.
Packit d0f5c2
Packit d0f5c2
  KS_C_5601-1987
Packit d0f5c2
Packit d0f5c2
is heavily misused.
Packit d0f5c2
See L<Microsoft-related naming mess> for details.
Packit d0f5c2
Packit d0f5c2
C<KS_C_5601-1987> I<raw> encoding is available as C<kcs5601-raw>
Packit d0f5c2
with Encode. See L<Encode::KR> for details.
Packit d0f5c2
Packit d0f5c2
  UTF-16 UTF-16BE UTF-16LE
Packit d0f5c2
Packit d0f5c2
are IANA-registered C<charset>s. See [RFC 2781] for details.
Packit d0f5c2
Jungshik Shin reports that UTF-16 with a BOM is well accepted
Packit d0f5c2
by MS IE 5/6 and NS 4/6. Beware however that
Packit d0f5c2
Packit d0f5c2
=over 2
Packit d0f5c2
Packit d0f5c2
=item *
Packit d0f5c2
Packit d0f5c2
C<UTF-16> support in any software you're going to be
Packit d0f5c2
using/interoperating with has probably been less tested
Packit d0f5c2
then C<UTF-8> support
Packit d0f5c2
Packit d0f5c2
=item *
Packit d0f5c2
Packit d0f5c2
C<UTF-8> coded data seamlessly passes traditional
Packit d0f5c2
command piping (C<cat>, C<more>, etc.) while C<UTF-16> coded
Packit d0f5c2
data is likely to cause confusion (with its zero bytes,
Packit d0f5c2
for example)
Packit d0f5c2
Packit d0f5c2
=item *
Packit d0f5c2
Packit d0f5c2
it is beyond the power of words to describe the way HTML browsers
Packit d0f5c2
encode non-C<ASCII> form data. To get a general impression, visit
Packit d0f5c2
L<http://www.alanflavell.org.uk/charset/form-i18n.html>.
Packit d0f5c2
While encoding of form data has stabilized for C<UTF-8> encoded pages
Packit d0f5c2
(at least IE 5/6, NS 6, and Opera 6 behave consistently), be sure to
Packit d0f5c2
expect fun (and cross-browser discrepancies) with C<UTF-16> encoded
Packit d0f5c2
pages!
Packit d0f5c2
Packit d0f5c2
=back
Packit d0f5c2
Packit d0f5c2
The rule of thumb is to use C<UTF-8> unless you know what
Packit d0f5c2
you're doing and unless you really benefit from using C<UTF-16>.
Packit d0f5c2
Packit d0f5c2
  ISO-IR-165    [RFC1345]
Packit d0f5c2
  VISCII
Packit d0f5c2
  GB 12345
Packit d0f5c2
  GB 18030 (**)  (see links below)
Packit d0f5c2
  EUC-TW   (**)
Packit d0f5c2
Packit d0f5c2
are totally valid encodings but not registered at IANA.
Packit d0f5c2
The names under which they are listed here are probably the
Packit d0f5c2
most widely-known names for these encodings and are recommended
Packit d0f5c2
names.
Packit d0f5c2
Packit d0f5c2
  BIG5PLUS (**)
Packit d0f5c2
Packit d0f5c2
is a proprietary name. 
Packit d0f5c2
Packit d0f5c2
=head2 Microsoft-related naming mess
Packit d0f5c2
Packit d0f5c2
Microsoft products misuse the following names:
Packit d0f5c2
Packit d0f5c2
=over 2
Packit d0f5c2
Packit d0f5c2
=item KS_C_5601-1987
Packit d0f5c2
Packit d0f5c2
Microsoft extension to C<EUC-KR>.
Packit d0f5c2
Packit d0f5c2
Proper names: C<CP949>, C<UHC>, C<x-windows-949> (as used by Mozilla).
Packit d0f5c2
Packit d0f5c2
See L<http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html>
Packit d0f5c2
for details.
Packit d0f5c2
Packit d0f5c2
Encode aliases C<KS_C_5601-1987> to C<cp949> to reflect this common
Packit d0f5c2
misusage. I<Raw> C<KS_C_5601-1987> encoding is available as
Packit d0f5c2
C<kcs5601-raw>.
Packit d0f5c2
Packit d0f5c2
See L<Encode::KR> for details.
Packit d0f5c2
Packit d0f5c2
=item GB2312
Packit d0f5c2
Packit d0f5c2
Microsoft extension to C<EUC-CN>.
Packit d0f5c2
Packit d0f5c2
Proper names: C<CP936>, C<GBK>.
Packit d0f5c2
Packit d0f5c2
C<GB2312> has been registered in the C<EUC-CN> meaning at
Packit d0f5c2
IANA. This has partially repaired the situation: Microsoft's 
Packit d0f5c2
C<GB2312> has become a superset of the official C<GB2312>.
Packit d0f5c2
Packit d0f5c2
Encode aliases C<GB2312> to C<euc-cn> in full agreement with
Packit d0f5c2
IANA registration. C<cp936> is supported separately.
Packit d0f5c2
I<Raw> C<GB_2312-80> encoding is available as C<gb2312-raw>.
Packit d0f5c2
Packit d0f5c2
See L<Encode::CN> for details.
Packit d0f5c2
Packit d0f5c2
=item Big5
Packit d0f5c2
Packit d0f5c2
Microsoft extension to C<Big5>.
Packit d0f5c2
Packit d0f5c2
Proper name: C<CP950>.
Packit d0f5c2
Packit d0f5c2
Encode separately supports C<Big5> and C<cp950>.
Packit d0f5c2
Packit d0f5c2
=item Shift_JIS
Packit d0f5c2
Packit d0f5c2
Microsoft's understanding of C<Shift_JIS>.
Packit d0f5c2
Packit d0f5c2
JIS has not endorsed the full Microsoft standard however.
Packit d0f5c2
The official C<Shift_JIS> includes only JIS X 0201 and JIS X 0208
Packit d0f5c2
character sets, while Microsoft has always used C<Shift_JIS>
Packit d0f5c2
to encode a wider character repertoire. See C<IANA> registration for
Packit d0f5c2
C<Windows-31J>.
Packit d0f5c2
Packit d0f5c2
As a historical predecessor, Microsoft's variant
Packit d0f5c2
probably has more rights for the name, though it may be objected
Packit d0f5c2
that Microsoft shouldn't have used JIS as part of the name
Packit d0f5c2
in the first place.
Packit d0f5c2
Packit d0f5c2
Unambiguous name: C<CP932>. C<IANA> name (also used by Mozilla, and
Packit d0f5c2
provided as an alias by Encode): C<Windows-31J>.
Packit d0f5c2
Packit d0f5c2
Encode separately supports C<Shift_JIS> and C<cp932>.
Packit d0f5c2
Packit d0f5c2
=back
Packit d0f5c2
Packit d0f5c2
=head1 Glossary
Packit d0f5c2
Packit d0f5c2
=over 2
Packit d0f5c2
Packit d0f5c2
=item character repertoire
Packit d0f5c2
Packit d0f5c2
A collection of unique characters.  A I<character> set in the strictest
Packit d0f5c2
sense. At this stage, characters are not numbered.
Packit d0f5c2
Packit d0f5c2
=item coded character set (CCS)
Packit d0f5c2
Packit d0f5c2
A character set that is mapped in a way computers can use directly.
Packit d0f5c2
Many character encodings, including EUC, fall in this category.
Packit d0f5c2
Packit d0f5c2
=item character encoding scheme (CES)
Packit d0f5c2
Packit d0f5c2
An algorithm to map a character set to a byte sequence.  You don't
Packit d0f5c2
have to be able to tell which character set a given byte sequence
Packit d0f5c2
belongs.  7-bit ISO-2022 is a CES but it cannot be a CCS.  EUC is an
Packit d0f5c2
example of being both a CCS and CES.
Packit d0f5c2
Packit d0f5c2
=item charset (in MIME context)
Packit d0f5c2
Packit d0f5c2
has long been used in the meaning of C<encoding>, CES.
Packit d0f5c2
Packit d0f5c2
While the word combination C<character set> has lost this meaning
Packit d0f5c2
in MIME context since [RFC 2130], the C<charset> abbreviation has
Packit d0f5c2
retained it. This is how [RFC 2277] and [RFC 2278] bless C<charset>:
Packit d0f5c2
Packit d0f5c2
 This document uses the term "charset" to mean a set of rules for
Packit d0f5c2
 mapping from a sequence of octets to a sequence of characters, such
Packit d0f5c2
 as the combination of a coded character set and a character encoding
Packit d0f5c2
 scheme; this is also what is used as an identifier in MIME "charset="
Packit d0f5c2
 parameters, and registered in the IANA charset registry ...  (Note
Packit d0f5c2
 that this is NOT a term used by other standards bodies, such as ISO).
Packit d0f5c2
 [RFC 2277]
Packit d0f5c2
Packit d0f5c2
=item EUC
Packit d0f5c2
Packit d0f5c2
Extended Unix Character.  See ISO-2022.
Packit d0f5c2
Packit d0f5c2
=item ISO-2022
Packit d0f5c2
Packit d0f5c2
A CES that was carefully designed to coexist with ASCII.  There are a 7
Packit d0f5c2
bit version and an 8 bit version.  
Packit d0f5c2
Packit d0f5c2
The 7 bit version switches character set via escape sequence so it
Packit d0f5c2
cannot form a CCS.  Since this is more difficult to handle in programs
Packit d0f5c2
than the 8 bit version, the 7 bit version is not very popular except for
Packit d0f5c2
iso-2022-jp, the I<de facto> standard CES for e-mails.
Packit d0f5c2
Packit d0f5c2
The 8 bit version can form a CCS.  EUC and ISO-8859 are two examples
Packit d0f5c2
thereof.  Pre-5.6 perl could use them as string literals.
Packit d0f5c2
Packit d0f5c2
=item UCS
Packit d0f5c2
Packit d0f5c2
Short for I<Universal Character Set>.  When you say just UCS, it means
Packit d0f5c2
I<Unicode>.
Packit d0f5c2
Packit d0f5c2
=item UCS-2
Packit d0f5c2
Packit d0f5c2
ISO/IEC 10646 encoding form: Universal Character Set coded in two
Packit d0f5c2
octets.
Packit d0f5c2
Packit d0f5c2
=item Unicode
Packit d0f5c2
Packit d0f5c2
A character set that aims to include all character repertoires of the
Packit d0f5c2
world.  Many character sets in various national as well as industrial
Packit d0f5c2
standards have become, in a way, just subsets of Unicode.
Packit d0f5c2
Packit d0f5c2
=item UTF
Packit d0f5c2
Packit d0f5c2
Short for I<Unicode Transformation Format>.  Determines how to map a
Packit d0f5c2
Unicode character into a byte sequence.
Packit d0f5c2
Packit d0f5c2
=item UTF-16
Packit d0f5c2
Packit d0f5c2
A UTF in 16-bit encoding.  Can either be in big endian or little
Packit d0f5c2
endian.  The big endian version is called UTF-16BE (equal to UCS-2 + 
Packit d0f5c2
surrogate support) and the little endian version is called UTF-16LE.
Packit d0f5c2
Packit d0f5c2
=back
Packit d0f5c2
Packit d0f5c2
=head1 See Also
Packit d0f5c2
Packit d0f5c2
L<Encode>, 
Packit d0f5c2
L<Encode::Byte>, 
Packit d0f5c2
L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>,
Packit d0f5c2
L<Encode::EBCDIC>, L<Encode::Symbol>
Packit d0f5c2
L<Encode::MIME::Header>, L<Encode::Guess>
Packit d0f5c2
Packit d0f5c2
=head1 References
Packit d0f5c2
Packit d0f5c2
=over 2
Packit d0f5c2
Packit d0f5c2
=item ECMA
Packit d0f5c2
Packit d0f5c2
European Computer Manufacturers Association
Packit d0f5c2
L<http://www.ecma.ch>
Packit d0f5c2
Packit d0f5c2
=over 2
Packit d0f5c2
Packit d0f5c2
=item ECMA-035 (eq C<ISO-2022>)
Packit d0f5c2
Packit d0f5c2
L<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM> 
Packit d0f5c2
Packit d0f5c2
The specification of ISO-2022 is available from the link above.
Packit d0f5c2
Packit d0f5c2
=back
Packit d0f5c2
Packit d0f5c2
=item IANA
Packit d0f5c2
Packit d0f5c2
Internet Assigned Numbers Authority
Packit d0f5c2
L<http://www.iana.org/>
Packit d0f5c2
Packit d0f5c2
=over 2
Packit d0f5c2
Packit d0f5c2
=item Assigned Charset Names by IANA
Packit d0f5c2
Packit d0f5c2
L<http://www.iana.org/assignments/character-sets>
Packit d0f5c2
Packit d0f5c2
Most of the C<canonical names> in Encode derive from this list
Packit d0f5c2
so you can directly apply the string you have extracted from MIME
Packit d0f5c2
header of mails and web pages.
Packit d0f5c2
Packit d0f5c2
=back
Packit d0f5c2
Packit d0f5c2
=item ISO
Packit d0f5c2
Packit d0f5c2
International Organization for Standardization
Packit d0f5c2
L<http://www.iso.ch/>
Packit d0f5c2
Packit d0f5c2
=item RFC
Packit d0f5c2
Packit d0f5c2
Request For Comments -- need I say more?
Packit d0f5c2
L<http://www.rfc-editor.org/>, L<http://www.ietf.org/rfc.html>,
Packit d0f5c2
L<http://www.faqs.org/rfcs/>
Packit d0f5c2
Packit d0f5c2
=item UC
Packit d0f5c2
Packit d0f5c2
Unicode Consortium
Packit d0f5c2
L<http://www.unicode.org/>
Packit d0f5c2
Packit d0f5c2
=over 2
Packit d0f5c2
Packit d0f5c2
=item Unicode Glossary
Packit d0f5c2
Packit d0f5c2
L<http://www.unicode.org/glossary/>
Packit d0f5c2
Packit d0f5c2
The glossary of this document is based upon this site.
Packit d0f5c2
Packit d0f5c2
=back
Packit d0f5c2
Packit d0f5c2
=back
Packit d0f5c2
Packit d0f5c2
=head2 Other Notable Sites
Packit d0f5c2
Packit d0f5c2
=over 2
Packit d0f5c2
Packit d0f5c2
=item czyborra.com
Packit d0f5c2
Packit d0f5c2
L<http://czyborra.com/>
Packit d0f5c2
Packit d0f5c2
Contains a lot of useful information, especially gory details of ISO
Packit d0f5c2
vs. vendor mappings.
Packit d0f5c2
Packit d0f5c2
=item CJK.inf
Packit d0f5c2
Packit d0f5c2
L<http://examples.oreilly.com/cjkvinfo/doc/cjk.inf>
Packit d0f5c2
Packit d0f5c2
Somewhat obsolete (last update in 1996), but still useful.  Also try
Packit d0f5c2
Packit d0f5c2
L<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf>
Packit d0f5c2
Packit d0f5c2
You will find brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030>.
Packit d0f5c2
Packit d0f5c2
=item Jungshik Shin's Hangul FAQ
Packit d0f5c2
Packit d0f5c2
L<http://jshin.net/faq>
Packit d0f5c2
Packit d0f5c2
And especially its subject 8.
Packit d0f5c2
Packit d0f5c2
L<http://jshin.net/faq/qa8.html>
Packit d0f5c2
Packit d0f5c2
A comprehensive overview of the Korean (C<KS *>) standards.
Packit d0f5c2
Packit d0f5c2
=item debian.org: "Introduction to i18n"
Packit d0f5c2
Packit d0f5c2
A brief description for most of the mentioned CJK encodings is
Packit d0f5c2
contained in
Packit d0f5c2
L<http://www.debian.org/doc/manuals/intro-i18n/ch-codes.en.html>
Packit d0f5c2
Packit d0f5c2
=back
Packit d0f5c2
Packit d0f5c2
=head2 Offline sources
Packit d0f5c2
Packit d0f5c2
=over 2
Packit d0f5c2
Packit d0f5c2
=item C<CJKV Information Processing> by Ken Lunde
Packit d0f5c2
Packit d0f5c2
CJKV Information Processing
Packit d0f5c2
1999 O'Reilly & Associates, ISBN : 1-56592-224-7
Packit d0f5c2
Packit d0f5c2
The modern successor of C<CJK.inf>.
Packit d0f5c2
Packit d0f5c2
Features a comprehensive coverage of CJKV character sets and
Packit d0f5c2
encodings along with many other issues faced by anyone trying
Packit d0f5c2
to better support CJKV languages/scripts in all the areas of
Packit d0f5c2
information processing.
Packit d0f5c2
Packit d0f5c2
To purchase this book, visit
Packit d0f5c2
L<http://oreilly.com/catalog/9780596514471/>
Packit d0f5c2
or your favourite bookstore.
Packit d0f5c2
Packit d0f5c2
=back
Packit d0f5c2
Packit d0f5c2
=cut