Blame lib/Encode/PerlIO.pod

Packit d0f5c2
=head1 NAME
Packit d0f5c2
Packit d0f5c2
Encode::PerlIO -- a detailed document on Encode and PerlIO
Packit d0f5c2
Packit d0f5c2
=head1 Overview
Packit d0f5c2
Packit d0f5c2
It is very common to want to do encoding transformations when
Packit d0f5c2
reading or writing files, network connections, pipes etc.
Packit d0f5c2
If Perl is configured to use the new 'perlio' IO system then
Packit d0f5c2
C<Encode> provides a "layer" (see L<PerlIO>) which can transform
Packit d0f5c2
data as it is read or written.
Packit d0f5c2
Packit d0f5c2
Here is how the blind poet would modernise the encoding:
Packit d0f5c2
Packit d0f5c2
    use Encode;
Packit d0f5c2
    open(my $iliad,'<:encoding(iso-8859-7)','iliad.greek');
Packit d0f5c2
    open(my $utf8,'>:utf8','iliad.utf8');
Packit d0f5c2
    my @epic = <$iliad>;
Packit d0f5c2
    print $utf8 @epic;
Packit d0f5c2
    close($utf8);
Packit d0f5c2
    close($illiad);
Packit d0f5c2
Packit d0f5c2
In addition, the new IO system can also be configured to read/write
Packit d0f5c2
UTF-8 encoded characters (as noted above, this is efficient):
Packit d0f5c2
Packit d0f5c2
    open(my $fh,'>:utf8','anything');
Packit d0f5c2
    print $fh "Any \x{0021} string \N{SMILEY FACE}\n";
Packit d0f5c2
Packit d0f5c2
Either of the above forms of "layer" specifications can be made the default
Packit d0f5c2
for a lexical scope with the C<use open ...> pragma. See L<open>.
Packit d0f5c2
Packit d0f5c2
Once a handle is open, its layers can be altered using C<binmode>.
Packit d0f5c2
Packit d0f5c2
Without any such configuration, or if Perl itself is built using the
Packit d0f5c2
system's own IO, then write operations assume that the file handle
Packit d0f5c2
accepts only I<bytes> and will C<die> if a character larger than 255 is
Packit d0f5c2
written to the handle. When reading, each octet from the handle becomes
Packit d0f5c2
a byte-in-a-character. Note that this default is the same behaviour
Packit d0f5c2
as bytes-only languages (including Perl before v5.6) would have,
Packit d0f5c2
and is sufficient to handle native 8-bit encodings e.g. iso-8859-1,
Packit d0f5c2
EBCDIC etc. and any legacy mechanisms for handling other encodings
Packit d0f5c2
and binary data.
Packit d0f5c2
Packit d0f5c2
In other cases, it is the program's responsibility to transform
Packit d0f5c2
characters into bytes using the API above before doing writes, and to
Packit d0f5c2
transform the bytes read from a handle into characters before doing
Packit d0f5c2
"character operations" (e.g. C<lc>, C</\W+/>, ...).
Packit d0f5c2
Packit d0f5c2
You can also use PerlIO to convert larger amounts of data you don't
Packit d0f5c2
want to bring into memory.  For example, to convert between ISO-8859-1
Packit d0f5c2
(Latin 1) and UTF-8 (or UTF-EBCDIC in EBCDIC machines):
Packit d0f5c2
Packit d0f5c2
    open(F, "<:encoding(iso-8859-1)", "data.txt") or die $!;
Packit d0f5c2
    open(G, ">:utf8",                 "data.utf") or die $!;
Packit d0f5c2
    while (<F>) { print G }
Packit d0f5c2
Packit d0f5c2
    # Could also do "print G <F>" but that would pull
Packit d0f5c2
    # the whole file into memory just to write it out again.
Packit d0f5c2
Packit d0f5c2
More examples:
Packit d0f5c2
Packit d0f5c2
    open(my $f, "<:encoding(cp1252)")
Packit d0f5c2
    open(my $g, ">:encoding(iso-8859-2)")
Packit d0f5c2
    open(my $h, ">:encoding(latin9)")       # iso-8859-15
Packit d0f5c2
Packit d0f5c2
See also L<encoding> for how to change the default encoding of the
Packit d0f5c2
data in your script.
Packit d0f5c2
Packit d0f5c2
=head1 How does it work?
Packit d0f5c2
Packit d0f5c2
Here is a crude diagram of how filehandle, PerlIO, and Encode
Packit d0f5c2
interact.
Packit d0f5c2
Packit d0f5c2
  filehandle <-> PerlIO        PerlIO <-> scalar (read/printed)
Packit d0f5c2
                       \      /
Packit d0f5c2
                        Encode   
Packit d0f5c2
Packit d0f5c2
When PerlIO receives data from either direction, it fills a buffer
Packit d0f5c2
(currently with 1024 bytes) and passes the buffer to Encode.
Packit d0f5c2
Encode tries to convert the valid part and passes it back to PerlIO,
Packit d0f5c2
leaving invalid parts (usually a partial character) in the buffer.
Packit d0f5c2
PerlIO then appends more data to the buffer, calls Encode again,
Packit d0f5c2
and so on until the data stream ends.
Packit d0f5c2
Packit d0f5c2
To do so, PerlIO always calls (de|en)code methods with CHECK set to 1.
Packit d0f5c2
This ensures that the method stops at the right place when it
Packit d0f5c2
encounters partial character.  The following is what happens when
Packit d0f5c2
PerlIO and Encode tries to encode (from utf8) more than 1024 bytes
Packit d0f5c2
and the buffer boundary happens to be in the middle of a character.
Packit d0f5c2
Packit d0f5c2
   A   B   C   ....   ~     \x{3000}    ....
Packit d0f5c2
  41  42  43   ....  7E   e3   80   80  ....
Packit d0f5c2
  <- buffer --------------->
Packit d0f5c2
  << encoded >>>>>>>>>>
Packit d0f5c2
                       <- next buffer ------
Packit d0f5c2
Packit d0f5c2
Encode converts from the beginning to \x7E, leaving \xe3 in the buffer
Packit d0f5c2
because it is invalid (partial character).
Packit d0f5c2
Packit d0f5c2
Unfortunately, this scheme does not work well with escape-based
Packit d0f5c2
encodings such as ISO-2022-JP.
Packit d0f5c2
Packit d0f5c2
=head1 Line Buffering
Packit d0f5c2
Packit d0f5c2
Now let's see what happens when you try to decode from ISO-2022-JP and
Packit d0f5c2
the buffer ends in the middle of a character.
Packit d0f5c2
Packit d0f5c2
              JIS208-ESC   \x{5f3e}
Packit d0f5c2
   A   B   C   ....   ~   \e   $   B  |DAN | ....
Packit d0f5c2
  41  42  43   ....  7E   1b  24  41  43  46 ....
Packit d0f5c2
  <- buffer --------------------------->
Packit d0f5c2
  << encoded >>>>>>>>>>>>>>>>>>>>>>>
Packit d0f5c2
Packit d0f5c2
As you see, the next buffer begins with \x43.  But \x43 is 'C' in
Packit d0f5c2
ASCII, which is wrong in this case because we are now in JISX 0208
Packit d0f5c2
area so it has to convert \x43\x46, not \x43.  Unlike utf8 and EUC,
Packit d0f5c2
in escape-based encodings you can't tell if a given octet is a whole
Packit d0f5c2
character or just part of it.
Packit d0f5c2
Packit d0f5c2
Fortunately PerlIO also supports line buffer if you tell PerlIO to use
Packit d0f5c2
one instead of fixed buffer.  Since ISO-2022-JP is guaranteed to revert to ASCII at the end of the line, partial
Packit d0f5c2
character will never happen when line buffer is used.
Packit d0f5c2
Packit d0f5c2
To tell PerlIO to use line buffer, implement -E<gt>needs_lines method
Packit d0f5c2
for your encoding object.  See  L<Encode::Encoding> for details.
Packit d0f5c2
Packit d0f5c2
Thanks to these efforts most encodings that come with Encode support
Packit d0f5c2
PerlIO but that still leaves following encodings.
Packit d0f5c2
Packit d0f5c2
  iso-2022-kr
Packit d0f5c2
  MIME-B
Packit d0f5c2
  MIME-Header
Packit d0f5c2
  MIME-Q
Packit d0f5c2
Packit d0f5c2
Fortunately iso-2022-kr is hardly used (according to Jungshik) and
Packit d0f5c2
MIME-* are very unlikely to be fed to PerlIO because they are for mail
Packit d0f5c2
headers.  See L<Encode::MIME::Header> for details.
Packit d0f5c2
Packit d0f5c2
=head2 How can I tell whether my encoding fully supports PerlIO ?
Packit d0f5c2
Packit d0f5c2
As of this writing, any encoding whose class belongs to Encode::XS and
Packit d0f5c2
Encode::Unicode works.  The Encode module has a C<perlio_ok> method
Packit d0f5c2
which you can use before applying PerlIO encoding to the filehandle.
Packit d0f5c2
Here is an example:
Packit d0f5c2
Packit d0f5c2
  my $use_perlio = perlio_ok($enc);
Packit d0f5c2
  my $layer = $use_perlio ? "<:raw" : "<:encoding($enc)";
Packit d0f5c2
  open my $fh, $layer, $file or die "$file : $!";
Packit d0f5c2
  while(<$fh>){
Packit d0f5c2
    $_ = decode($enc, $_) unless $use_perlio;
Packit d0f5c2
    # .... 
Packit d0f5c2
  }
Packit d0f5c2
Packit d0f5c2
=head1 SEE ALSO
Packit d0f5c2
Packit d0f5c2
L<Encode::Encoding>,
Packit d0f5c2
L<Encode::Supported>,
Packit d0f5c2
L<Encode::PerlIO>, 
Packit d0f5c2
L<encoding>,
Packit d0f5c2
L<perlebcdic>, 
Packit d0f5c2
L<perlfunc/open>, 
Packit d0f5c2
L<perlunicode>, 
Packit d0f5c2
L<utf8>, 
Packit d0f5c2
the Perl Unicode Mailing List E<lt>perl-unicode@perl.orgE<gt>
Packit d0f5c2
Packit d0f5c2
=cut
Packit d0f5c2