Tree - source-git/perl-Encode - CentOS Git server

source-git / perl-Encode

Blame lib/Encode/PerlIO.pod

Blob History Raw

Packit	d0f5c2	`=head1 NAME`
Packit	d0f5c2
Packit	d0f5c2	`Encode::PerlIO -- a detailed document on Encode and PerlIO`
Packit	d0f5c2
Packit	d0f5c2	`=head1 Overview`
Packit	d0f5c2
Packit	d0f5c2	`It is very common to want to do encoding transformations when`
Packit	d0f5c2	`reading or writing files, network connections, pipes etc.`
Packit	d0f5c2	`If Perl is configured to use the new 'perlio' IO system then`
Packit	d0f5c2	`C<Encode> provides a "layer" (see L<PerlIO>) which can transform`
Packit	d0f5c2	`data as it is read or written.`
Packit	d0f5c2
Packit	d0f5c2	`Here is how the blind poet would modernise the encoding:`
Packit	d0f5c2
Packit	d0f5c2	`use Encode;`
Packit	d0f5c2	`open(my $iliad,'<:encoding(iso-8859-7)','iliad.greek');`
Packit	d0f5c2	`open(my $utf8,'>:utf8','iliad.utf8');`
Packit	d0f5c2	`my @epic = <$iliad>;`
Packit	d0f5c2	`print $utf8 @epic;`
Packit	d0f5c2	`close($utf8);`
Packit	d0f5c2	`close($illiad);`
Packit	d0f5c2
Packit	d0f5c2	`In addition, the new IO system can also be configured to read/write`
Packit	d0f5c2	`UTF-8 encoded characters (as noted above, this is efficient):`
Packit	d0f5c2
Packit	d0f5c2	`open(my $fh,'>:utf8','anything');`
Packit	d0f5c2	`print $fh "Any \x{0021} string \N{SMILEY FACE}\n";`
Packit	d0f5c2
Packit	d0f5c2	`Either of the above forms of "layer" specifications can be made the default`
Packit	d0f5c2	`for a lexical scope with the C<use open ...> pragma. See L<open>.`
Packit	d0f5c2
Packit	d0f5c2	`Once a handle is open, its layers can be altered using C<binmode>.`
Packit	d0f5c2
Packit	d0f5c2	`Without any such configuration, or if Perl itself is built using the`
Packit	d0f5c2	`system's own IO, then write operations assume that the file handle`
Packit	d0f5c2	`accepts only I<bytes> and will C<die> if a character larger than 255 is`
Packit	d0f5c2	`written to the handle. When reading, each octet from the handle becomes`
Packit	d0f5c2	`a byte-in-a-character. Note that this default is the same behaviour`
Packit	d0f5c2	`as bytes-only languages (including Perl before v5.6) would have,`
Packit	d0f5c2	`and is sufficient to handle native 8-bit encodings e.g. iso-8859-1,`
Packit	d0f5c2	`EBCDIC etc. and any legacy mechanisms for handling other encodings`
Packit	d0f5c2	`and binary data.`
Packit	d0f5c2
Packit	d0f5c2	`In other cases, it is the program's responsibility to transform`
Packit	d0f5c2	`characters into bytes using the API above before doing writes, and to`
Packit	d0f5c2	`transform the bytes read from a handle into characters before doing`
Packit	d0f5c2	`"character operations" (e.g. C<lc>, C</\W+/>, ...).`
Packit	d0f5c2
Packit	d0f5c2	`You can also use PerlIO to convert larger amounts of data you don't`
Packit	d0f5c2	`want to bring into memory. For example, to convert between ISO-8859-1`
Packit	d0f5c2	`(Latin 1) and UTF-8 (or UTF-EBCDIC in EBCDIC machines):`
Packit	d0f5c2
Packit	d0f5c2	`open(F, "<:encoding(iso-8859-1)", "data.txt") or die $!;`
Packit	d0f5c2	`open(G, ">:utf8", "data.utf") or die $!;`
Packit	d0f5c2	`while (<F>) { print G }`
Packit	d0f5c2
Packit	d0f5c2	`# Could also do "print G <F>" but that would pull`
Packit	d0f5c2	`# the whole file into memory just to write it out again.`
Packit	d0f5c2
Packit	d0f5c2	`More examples:`
Packit	d0f5c2
Packit	d0f5c2	`open(my $f, "<:encoding(cp1252)")`
Packit	d0f5c2	`open(my $g, ">:encoding(iso-8859-2)")`
Packit	d0f5c2	`open(my $h, ">:encoding(latin9)") # iso-8859-15`
Packit	d0f5c2
Packit	d0f5c2	`See also L<encoding> for how to change the default encoding of the`
Packit	d0f5c2	`data in your script.`
Packit	d0f5c2
Packit	d0f5c2	`=head1 How does it work?`
Packit	d0f5c2
Packit	d0f5c2	`Here is a crude diagram of how filehandle, PerlIO, and Encode`
Packit	d0f5c2	`interact.`
Packit	d0f5c2
Packit	d0f5c2	`filehandle <-> PerlIO PerlIO <-> scalar (read/printed)`
Packit	d0f5c2	`\ /`
Packit	d0f5c2	`Encode`
Packit	d0f5c2
Packit	d0f5c2	`When PerlIO receives data from either direction, it fills a buffer`
Packit	d0f5c2	`(currently with 1024 bytes) and passes the buffer to Encode.`
Packit	d0f5c2	`Encode tries to convert the valid part and passes it back to PerlIO,`
Packit	d0f5c2	`leaving invalid parts (usually a partial character) in the buffer.`
Packit	d0f5c2	`PerlIO then appends more data to the buffer, calls Encode again,`
Packit	d0f5c2	`and so on until the data stream ends.`
Packit	d0f5c2
Packit	d0f5c2	`To do so, PerlIO always calls (de\|en)code methods with CHECK set to 1.`
Packit	d0f5c2	`This ensures that the method stops at the right place when it`
Packit	d0f5c2	`encounters partial character. The following is what happens when`
Packit	d0f5c2	`PerlIO and Encode tries to encode (from utf8) more than 1024 bytes`
Packit	d0f5c2	`and the buffer boundary happens to be in the middle of a character.`
Packit	d0f5c2
Packit	d0f5c2	`A B C .... ~ \x{3000} ....`
Packit	d0f5c2	`41 42 43 .... 7E e3 80 80 ....`
Packit	d0f5c2	`<- buffer --------------->`
Packit	d0f5c2	`<< encoded >>>>>>>>>>`
Packit	d0f5c2	`<- next buffer ------`
Packit	d0f5c2
Packit	d0f5c2	`Encode converts from the beginning to \x7E, leaving \xe3 in the buffer`
Packit	d0f5c2	`because it is invalid (partial character).`
Packit	d0f5c2
Packit	d0f5c2	`Unfortunately, this scheme does not work well with escape-based`
Packit	d0f5c2	`encodings such as ISO-2022-JP.`
Packit	d0f5c2
Packit	d0f5c2	`=head1 Line Buffering`
Packit	d0f5c2
Packit	d0f5c2	`Now let's see what happens when you try to decode from ISO-2022-JP and`
Packit	d0f5c2	`the buffer ends in the middle of a character.`
Packit	d0f5c2
Packit	d0f5c2	`JIS208-ESC \x{5f3e}`
Packit	d0f5c2	`A B C .... ~ \e $ B \|DAN \| ....`
Packit	d0f5c2	`41 42 43 .... 7E 1b 24 41 43 46 ....`
Packit	d0f5c2	`<- buffer --------------------------->`
Packit	d0f5c2	`<< encoded >>>>>>>>>>>>>>>>>>>>>>>`
Packit	d0f5c2
Packit	d0f5c2	`As you see, the next buffer begins with \x43. But \x43 is 'C' in`
Packit	d0f5c2	`ASCII, which is wrong in this case because we are now in JISX 0208`
Packit	d0f5c2	`area so it has to convert \x43\x46, not \x43. Unlike utf8 and EUC,`
Packit	d0f5c2	`in escape-based encodings you can't tell if a given octet is a whole`
Packit	d0f5c2	`character or just part of it.`
Packit	d0f5c2
Packit	d0f5c2	`Fortunately PerlIO also supports line buffer if you tell PerlIO to use`
Packit	d0f5c2	`one instead of fixed buffer. Since ISO-2022-JP is guaranteed to revert to ASCII at the end of the line, partial`
Packit	d0f5c2	`character will never happen when line buffer is used.`
Packit	d0f5c2
Packit	d0f5c2	`To tell PerlIO to use line buffer, implement -E<gt>needs_lines method`
Packit	d0f5c2	`for your encoding object. See L<Encode::Encoding> for details.`
Packit	d0f5c2
Packit	d0f5c2	`Thanks to these efforts most encodings that come with Encode support`
Packit	d0f5c2	`PerlIO but that still leaves following encodings.`
Packit	d0f5c2
Packit	d0f5c2	`iso-2022-kr`
Packit	d0f5c2	`MIME-B`
Packit	d0f5c2	`MIME-Header`
Packit	d0f5c2	`MIME-Q`
Packit	d0f5c2
Packit	d0f5c2	`Fortunately iso-2022-kr is hardly used (according to Jungshik) and`
Packit	d0f5c2	`MIME-* are very unlikely to be fed to PerlIO because they are for mail`
Packit	d0f5c2	`headers. See L<Encode::MIME::Header> for details.`
Packit	d0f5c2
Packit	d0f5c2	`=head2 How can I tell whether my encoding fully supports PerlIO ?`
Packit	d0f5c2
Packit	d0f5c2	`As of this writing, any encoding whose class belongs to Encode::XS and`
Packit	d0f5c2	`Encode::Unicode works. The Encode module has a C<perlio_ok> method`
Packit	d0f5c2	`which you can use before applying PerlIO encoding to the filehandle.`
Packit	d0f5c2	`Here is an example:`
Packit	d0f5c2
Packit	d0f5c2	`my $use_perlio = perlio_ok($enc);`
Packit	d0f5c2	`my $layer = $use_perlio ? "<:raw" : "<:encoding($enc)";`
Packit	d0f5c2	`open my $fh, $layer, $file or die "$file : $!";`
Packit	d0f5c2	`while(<$fh>){`
Packit	d0f5c2	`$_ = decode($enc, $_) unless $use_perlio;`
Packit	d0f5c2	`# ....`
Packit	d0f5c2	`}`
Packit	d0f5c2
Packit	d0f5c2	`=head1 SEE ALSO`
Packit	d0f5c2
Packit	d0f5c2	`L<Encode::Encoding>,`
Packit	d0f5c2	`L<Encode::Supported>,`
Packit	d0f5c2	`L<Encode::PerlIO>,`
Packit	d0f5c2	`L<encoding>,`
Packit	d0f5c2	`L<perlebcdic>,`
Packit	d0f5c2	`L<perlfunc/open>,`
Packit	d0f5c2	`L<perlunicode>,`
Packit	d0f5c2	`L<utf8>,`
Packit	d0f5c2	`the Perl Unicode Mailing List E<lt>perl-unicode@perl.orgE<gt>`
Packit	d0f5c2
Packit	d0f5c2	`=cut`
Packit	d0f5c2

source-git / perl-Encode

Source Code

Blame lib/Encode/PerlIO.pod