|
Packit |
be8974 |
=head1 Perl Slurp Ease
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
=head2 Introduction
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
One of the common Perl idioms is processing text files line by line:
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
while( <FH> ) {
|
|
Packit |
be8974 |
do something with $_
|
|
Packit |
be8974 |
}
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
This idiom has several variants, but the key point is that it reads in
|
|
Packit |
be8974 |
only one line from the file in each loop iteration. This has several
|
|
Packit |
be8974 |
advantages, including limiting memory use to one line, the ability to
|
|
Packit |
be8974 |
handle any size file (including data piped in via STDIN), and it is
|
|
Packit |
be8974 |
easily taught and understood to Perl newbies. In fact newbies are the
|
|
Packit |
be8974 |
ones who do silly things like this:
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
while( <FH> ) {
|
|
Packit |
be8974 |
push @lines, $_ ;
|
|
Packit |
be8974 |
}
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
foreach ( @lines ) {
|
|
Packit |
be8974 |
do something with $_
|
|
Packit |
be8974 |
}
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
Line by line processing is fine, but it isn't the only way to deal with
|
|
Packit |
be8974 |
reading files. The other common style is reading the entire file into a
|
|
Packit |
be8974 |
scalar or array, and that is commonly known as slurping. Now, slurping has
|
|
Packit |
be8974 |
somewhat of a poor reputation, and this article is an attempt at
|
|
Packit |
be8974 |
rehabilitating it. Slurping files has advantages and limitations, and is
|
|
Packit |
be8974 |
not something you should just do when line by line processing is fine.
|
|
Packit |
be8974 |
It is best when you need the entire file in memory for processing all at
|
|
Packit |
be8974 |
once. Slurping with in memory processing can be faster and lead to
|
|
Packit |
be8974 |
simpler code than line by line if done properly.
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
The biggest issue to watch for with slurping is file size. Slurping very
|
|
Packit |
be8974 |
large files or unknown amounts of data from STDIN can be disastrous to
|
|
Packit |
be8974 |
your memory usage and cause swap disk thrashing. You can slurp STDIN if
|
|
Packit |
be8974 |
you know that you can handle the maximum size input without
|
|
Packit |
be8974 |
detrimentally affecting your memory usage. So I advocate slurping only
|
|
Packit |
be8974 |
disk files and only when you know their size is reasonable and you have
|
|
Packit |
be8974 |
a real reason to process the file as a whole. Note that reasonable size
|
|
Packit |
be8974 |
these days is larger than the bad old days of limited RAM. Slurping in a
|
|
Packit |
be8974 |
megabyte is not an issue on most systems. But most of the
|
|
Packit |
be8974 |
files I tend to slurp in are much smaller than that. Typical files that
|
|
Packit |
be8974 |
work well with slurping are configuration files, (mini-)language scripts,
|
|
Packit |
be8974 |
some data (especially binary) files, and other files of known sizes
|
|
Packit |
be8974 |
which need fast processing.
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
Another major win for slurping over line by line is speed. Perl's IO
|
|
Packit |
be8974 |
system (like many others) is slow. Calling C<< <> >> for each line
|
|
Packit |
be8974 |
requires a check for the end of line, checks for EOF, copying a line,
|
|
Packit |
be8974 |
munging the internal handle structure, etc. Plenty of work for each line
|
|
Packit |
be8974 |
read in. Whereas slurping, if done correctly, will usually involve only
|
|
Packit |
be8974 |
one I/O call and no extra data copying. The same is true for writing
|
|
Packit |
be8974 |
files to disk, and we will cover that as well (even though the term
|
|
Packit |
be8974 |
slurping is traditionally a read operation, I use the term ``slurp'' for
|
|
Packit |
be8974 |
the concept of doing I/O with an entire file in one operation).
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
Finally, when you have slurped the entire file into memory, you can do
|
|
Packit |
be8974 |
operations on the data that are not possible or easily done with line by
|
|
Packit |
be8974 |
line processing. These include global search/replace (without regard for
|
|
Packit |
be8974 |
newlines), grabbing all matches with one call of C<//g>, complex parsing
|
|
Packit |
be8974 |
(which in many cases must ignore newlines), processing *ML (where line
|
|
Packit |
be8974 |
endings are just white space) and performing complex transformations
|
|
Packit |
be8974 |
such as template expansion.
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
=head2 Global Operations
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
Here are some simple global operations that can be done quickly and
|
|
Packit |
be8974 |
easily on an entire file that has been slurped in. They could also be
|
|
Packit |
be8974 |
done with line by line processing but that would be slower and require
|
|
Packit |
be8974 |
more code.
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
A common problem is reading in a file with key/value pairs. There are
|
|
Packit |
be8974 |
modules which do this but who needs them for simple formats? Just slurp
|
|
Packit |
be8974 |
in the file and do a single parse to grab all the key/value pairs.
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
my $text = read_file( $file ) ;
|
|
Packit |
be8974 |
my %config = $text =~ /^(\w+)=(.+)$/mg ;
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
That matches a key which starts a line (anywhere inside the string
|
|
Packit |
be8974 |
because of the C</m> modifier), the '=' char and the text to the end of the
|
|
Packit |
be8974 |
line (again, C</m> makes that work). In fact the ending C<$> is not even needed
|
|
Packit |
be8974 |
since C<.> will not normally match a newline. Since the key and value are
|
|
Packit |
be8974 |
grabbed and the C<m//> is in list context with the C</g> modifier, it will
|
|
Packit |
be8974 |
grab all key/value pairs and return them. The C<%config>hash will be
|
|
Packit |
be8974 |
assigned this list and now you have the file fully parsed into a hash.
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
Various projects I have worked on needed some simple templating and I
|
|
Packit |
be8974 |
wasn't in the mood to use a full module (please, no flames about your
|
|
Packit |
be8974 |
favorite template module :-). So I rolled my own by slurping in the
|
|
Packit |
be8974 |
template file, setting up a template hash and doing this one line:
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
$text =~ s/<%(.+?)%>/$template{$1}/g ;
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
That only works if the entire file was slurped in. With a little
|
|
Packit |
be8974 |
extra work it can handle chunks of text to be expanded:
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
$text =~ s/<%(\w+)_START%>(.+?)<%\1_END%>/ template($1, $2)/sge ;
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
Just supply a C<template> sub to expand the text between the markers and
|
|
Packit |
be8974 |
you have yourself a simple system with minimal code. Note that this will
|
|
Packit |
be8974 |
work and grab over multiple lines due the the C</s> modifier. This is
|
|
Packit |
be8974 |
something that is much trickier with line by line processing.
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
Note that this is a very simple templating system, and it can't directly
|
|
Packit |
be8974 |
handle nested tags and other complex features. But even if you use one
|
|
Packit |
be8974 |
of the myriad of template modules on the CPAN, you will gain by having
|
|
Packit |
be8974 |
speedier ways to read and write files.
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
Slurping in a file into an array also offers some useful advantages.
|
|
Packit |
be8974 |
One simple example is reading in a flat database where each record has
|
|
Packit |
be8974 |
fields separated by a character such as C<:>:
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
my @pw_fields = map [ split /:/ ], read_file( '/etc/passwd' ) ;
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
Random access to any line of the slurped file is another advantage. Also
|
|
Packit |
be8974 |
a line index could be built to speed up searching the array of lines.
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
=head2 Traditional Slurping
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
Perl has always supported slurping files with minimal code. Slurping of
|
|
Packit |
be8974 |
a file to a list of lines is trivial, just call the C<< <> >> operator
|
|
Packit |
be8974 |
in a list context:
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
my @lines = <FH> ;
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
and slurping to a scalar isn't much more work. Just set the built in
|
|
Packit |
be8974 |
variable C<$/> (the input record separator to the undefined value and read
|
|
Packit |
be8974 |
in the file with C<< <> >>:
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
{
|
|
Packit |
be8974 |
local( $/, *FH ) ;
|
|
Packit |
be8974 |
open( FH, $file ) or die "sudden flaming death\n"
|
|
Packit |
be8974 |
$text = <FH>
|
|
Packit |
be8974 |
}
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
Notice the use of C<local()>. It sets C<$/> to C<undef> for you and when
|
|
Packit |
be8974 |
the scope exits it will revert C<$/> back to its previous value (most
|
|
Packit |
be8974 |
likely "\n").
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
Here is a Perl idiom that allows the C<$text> variable to be declared,
|
|
Packit |
be8974 |
and there is no need for a tightly nested block. The C<do> block will
|
|
Packit |
be8974 |
execute C<< <FH> >> in a scalar context and slurp in the file named by
|
|
Packit |
be8974 |
C<$text>:
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
local( *FH ) ;
|
|
Packit |
be8974 |
open( FH, $file ) or die "sudden flaming death\n"
|
|
Packit |
be8974 |
my $text = do { local( $/ ) ; <FH> } ;
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
Both of those slurps used localized filehandles to be compatible with
|
|
Packit |
be8974 |
5.005. Here they are with 5.6.0 lexical autovivified handles:
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
{
|
|
Packit |
be8974 |
local( $/ ) ;
|
|
Packit |
be8974 |
open( my $fh, $file ) or die "sudden flaming death\n"
|
|
Packit |
be8974 |
$text = <$fh>
|
|
Packit |
be8974 |
}
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
open( my $fh, $file ) or die "sudden flaming death\n"
|
|
Packit |
be8974 |
my $text = do { local( $/ ) ; <$fh> } ;
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
And this is a variant of that idiom that removes the need for the open
|
|
Packit |
be8974 |
call:
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
my $text = do { local( @ARGV, $/ ) = $file ; <> } ;
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
The filename in C<$file> is assigned to a localized C<@ARGV> and the
|
|
Packit |
be8974 |
null filehandle is used which reads the data from the files in C<@ARGV>.
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
Instead of assigning to a scalar, all the above slurps can assign to an
|
|
Packit |
be8974 |
array and it will get the file but split into lines (using C<$/> as the
|
|
Packit |
be8974 |
end of line marker).
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
There is one common variant of those slurps which is very slow and not
|
|
Packit |
be8974 |
good code. You see it around, and it is almost always cargo cult code:
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
my $text = join( '', <FH> ) ;
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
That needlessly splits the input file into lines (C<join> provides a
|
|
Packit |
be8974 |
list context to C<< <FH> >>) and then joins up those lines again. The
|
|
Packit |
be8974 |
original coder of this idiom obviously never read I<perlvar> and learned
|
|
Packit |
be8974 |
how to use C<$/> to allow scalar slurping.
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
=head2 Write Slurping
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
While reading in entire files at one time is common, writing out entire
|
|
Packit |
be8974 |
files is also done. We call it ``slurping'' when we read in files, but
|
|
Packit |
be8974 |
there is no commonly accepted term for the write operation. I asked some
|
|
Packit |
be8974 |
Perl colleagues and got two interesting nominations. Peter Scott said to
|
|
Packit |
be8974 |
call it ``burping'' (rhymes with ``slurping'' and suggests movement in
|
|
Packit |
be8974 |
the opposite direction). Others suggested ``spewing'' which has a
|
|
Packit |
be8974 |
stronger visual image :-) Tell me your favorite or suggest your own. I
|
|
Packit |
be8974 |
will use both in this section so you can see how they work for you.
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
Spewing a file is a much simpler operation than slurping. You don't have
|
|
Packit |
be8974 |
context issues to worry about and there is no efficiency problem with
|
|
Packit |
be8974 |
returning a buffer. Here is a simple burp subroutine:
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
sub burp {
|
|
Packit |
be8974 |
my( $file_name ) = shift ;
|
|
Packit |
be8974 |
open( my $fh, ">$file_name" ) ||
|
|
Packit |
be8974 |
die "can't create $file_name $!" ;
|
|
Packit |
be8974 |
print $fh @_ ;
|
|
Packit |
be8974 |
}
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
Note that it doesn't copy the input text but passes @_ directly to
|
|
Packit |
be8974 |
print. We will look at faster variations of that later on.
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
=head2 Slurp on the CPAN
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
As you would expect there are modules in the CPAN that will slurp files
|
|
Packit |
be8974 |
for you. The two I found are called Slurp.pm (by Rob Casey - ROBAU on
|
|
Packit |
be8974 |
CPAN) and File::Slurp.pm (by David Muir Sharnoff - MUIR on CPAN).
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
Here is the code from Slurp.pm:
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
sub slurp {
|
|
Packit |
be8974 |
local( $/, @ARGV ) = ( wantarray ? $/ : undef, @_ );
|
|
Packit |
be8974 |
return <ARGV>;
|
|
Packit |
be8974 |
}
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
sub to_array {
|
|
Packit |
be8974 |
my @array = slurp( @_ );
|
|
Packit |
be8974 |
return wantarray ? @array : \@array;
|
|
Packit |
be8974 |
}
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
sub to_scalar {
|
|
Packit |
be8974 |
my $scalar = slurp( @_ );
|
|
Packit |
be8974 |
return $scalar;
|
|
Packit |
be8974 |
}
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
+The subroutine C<slurp()> uses the magic undefined value of C<$/> and
|
|
Packit |
be8974 |
the magic file +handle C<ARGV> to support slurping into a scalar or
|
|
Packit |
be8974 |
array. It also provides two wrapper subs that allow the caller to
|
|
Packit |
be8974 |
control the context of the slurp. And the C<to_array()> subroutine will
|
|
Packit |
be8974 |
return the list of slurped lines or a anonymous array of them according
|
|
Packit |
be8974 |
to its caller's context by checking C<wantarray>. It has 'slurp' in
|
|
Packit |
be8974 |
C<@EXPORT> and all three subroutines in C<@EXPORT_OK>.
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
namespace.>
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
The original File::Slurp.pm has this code:
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
sub read_file
|
|
Packit |
be8974 |
{
|
|
Packit |
be8974 |
my ($file) = @_;
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
local($/) = wantarray ? $/ : undef;
|
|
Packit |
be8974 |
local(*F);
|
|
Packit |
be8974 |
my $r;
|
|
Packit |
be8974 |
my (@r);
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
open(F, "<$file") || croak "open $file: $!";
|
|
Packit |
be8974 |
@r = <F>;
|
|
Packit |
be8974 |
close(F) || croak "close $file: $!";
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
return $r[0] unless wantarray;
|
|
Packit |
be8974 |
return @r;
|
|
Packit |
be8974 |
}
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
This module provides several subroutines including C<read_file()> (more
|
|
Packit |
be8974 |
on the others later). C<read_file()> behaves simularly to
|
|
Packit |
be8974 |
C<Slurp::slurp()> in that it will slurp a list of lines or a single
|
|
Packit |
be8974 |
scalar depending on the caller's context. It also uses the magic
|
|
Packit |
be8974 |
undefined value of C<$/> for scalar slurping but it uses an explicit
|
|
Packit |
be8974 |
open call rather than using a localized C<@ARGV> and the other module
|
|
Packit |
be8974 |
did. Also it doesn't provide a way to get an anonymous array of the
|
|
Packit |
be8974 |
lines but that can easily be rectified by calling it inside an anonymous
|
|
Packit |
be8974 |
array constuctor C<[]>.
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
Both of these modules make it easier for Perl coders to slurp in
|
|
Packit |
be8974 |
files. They both use the magic C<$/> to slurp in scalar mode and the
|
|
Packit |
be8974 |
natural behavior of C<< <> >> in list context to slurp as lines. But
|
|
Packit |
be8974 |
neither is optmized for speed nor can they handle C<binmode()> to
|
|
Packit |
be8974 |
support binary or unicode files. See below for more on slurp features
|
|
Packit |
be8974 |
and speedups.
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
=head2 Slurping API Design
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
The slurp modules on CPAN are have a very simple API and don't support
|
|
Packit |
be8974 |
C<binmode()>. This section will cover various API design issues such as
|
|
Packit |
be8974 |
efficient return by reference, C<binmode()> and calling variations.
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
Let's start with the call variations. Slurped files can be returned in
|
|
Packit |
be8974 |
four formats: as a single scalar, as a reference to a scalar, as a list
|
|
Packit |
be8974 |
of lines or as an anonymous array of lines. But the caller can only
|
|
Packit |
be8974 |
provide two contexts: scalar or list. So we have to either provide an
|
|
Packit |
be8974 |
API with more than one subroutine (as Slurp.pm did) or just provide one
|
|
Packit |
be8974 |
subroutine which only returns a scalar or a list (not an anonymous
|
|
Packit |
be8974 |
array) as File::Slurp does.
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
I have used my own C<read_file()> subroutine for years and it has the
|
|
Packit |
be8974 |
same API as File::Slurp: a single subroutine that returns a scalar or a
|
|
Packit |
be8974 |
list of lines depending on context. But I recognize the interest of
|
|
Packit |
be8974 |
those that want an anonymous array for line slurping. For one thing, it
|
|
Packit |
be8974 |
is easier to pass around to other subs and for another, it eliminates
|
|
Packit |
be8974 |
the extra copying of the lines via C<return>. So my module provides only
|
|
Packit |
be8974 |
one slurp subroutine that returns the file data based on context and any
|
|
Packit |
be8974 |
format options passed in. There is no need for a specific
|
|
Packit |
be8974 |
slurp-in-as-a-scalar or list subroutine as the general C<read_file()>
|
|
Packit |
be8974 |
sub will do that by default in the appropriate context. If you want
|
|
Packit |
be8974 |
C<read_file()> to return a scalar reference or anonymous array of lines,
|
|
Packit |
be8974 |
you can request those formats with options. You can even pass in a
|
|
Packit |
be8974 |
reference to a scalar (e.g. a previously allocated buffer) and have that
|
|
Packit |
be8974 |
filled with the slurped data (and that is one of the fastest slurp
|
|
Packit |
be8974 |
modes. see the benchmark section for more on that). If you want to
|
|
Packit |
be8974 |
slurp a scalar into an array, just select the desired array element and
|
|
Packit |
be8974 |
that will provide scalar context to the C<read_file()> subroutine.
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
The next area to cover is what to name the slurp sub. I will go with
|
|
Packit |
be8974 |
C<read_file()>. It is descriptive and keeps compatibilty with the
|
|
Packit |
be8974 |
current simple and don't use the 'slurp' nickname (though that nickname
|
|
Packit |
be8974 |
is in the module name). Also I decided to keep the File::Slurp
|
|
Packit |
be8974 |
namespace which was graciously handed over to me by its current owner,
|
|
Packit |
be8974 |
David Muir.
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
Another critical area when designing APIs is how to pass in
|
|
Packit |
be8974 |
arguments. The C<read_file()> subroutine takes one required argument
|
|
Packit |
be8974 |
which is the file name. To support C<binmode()> we need another optional
|
|
Packit |
be8974 |
argument. A third optional argument is needed to support returning a
|
|
Packit |
be8974 |
slurped scalar by reference. My first thought was to design the API with
|
|
Packit |
be8974 |
3 positional arguments - file name, buffer reference and binmode. But if
|
|
Packit |
be8974 |
you want to set the binmode and not pass in a buffer reference, you have
|
|
Packit |
be8974 |
to fill the second argument with C<undef> and that is ugly. So I decided
|
|
Packit |
be8974 |
to make the filename argument positional and the other two named. The
|
|
Packit |
be8974 |
subroutine starts off like this:
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
sub read_file {
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
my( $file_name, %args ) = @_ ;
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
my $buf ;
|
|
Packit |
be8974 |
my $buf_ref = $args{'buf'} || \$buf ;
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
The other sub (C<read_file_lines()>) will only take an optional binmode
|
|
Packit |
be8974 |
(so you can read files with binary delimiters). It doesn't need a buffer
|
|
Packit |
be8974 |
reference argument since it can return an anonymous array if the called
|
|
Packit |
be8974 |
in a scalar context. So this subroutine could use positional arguments,
|
|
Packit |
be8974 |
but to keep its API similar to the API of C<read_file()>, it will also
|
|
Packit |
be8974 |
use pass by name for the optional arguments. This also means that new
|
|
Packit |
be8974 |
optional arguments can be added later without breaking any legacy
|
|
Packit |
be8974 |
code. A bonus with keeping the API the same for both subs will be seen
|
|
Packit |
be8974 |
how the two subs are optimized to work together.
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
Write slurping (or spewing or burping :-)) needs to have its API
|
|
Packit |
be8974 |
designed as well. The biggest issue is not only needing to support
|
|
Packit |
be8974 |
optional arguments but a list of arguments to be written is needed. Perl
|
|
Packit |
be8974 |
6 will be able to handle that with optional named arguments and a final
|
|
Packit |
be8974 |
slurp argument. Since this is Perl 5 we have to do it using some
|
|
Packit |
be8974 |
cleverness. The first argument is the file name and it will be
|
|
Packit |
be8974 |
positional as with the C<read_file> subroutine. But how can we pass in
|
|
Packit |
be8974 |
the optional arguments and also a list of data? The solution lies in the
|
|
Packit |
be8974 |
fact that the data list should never contain a reference.
|
|
Packit |
be8974 |
Burping/spewing works only on plain data. So if the next argument is a
|
|
Packit |
be8974 |
hash reference, we can assume it cointains the optional arguments and
|
|
Packit |
be8974 |
the rest of the arguments is the data list. So the C<write_file()>
|
|
Packit |
be8974 |
subroutine will start off like this:
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
sub write_file {
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
my $file_name = shift ;
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
my $args = ( ref $_[0] eq 'HASH' ) ? shift : {} ;
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
Whether or not optional arguments are passed in, we leave the data list
|
|
Packit |
be8974 |
in C<@_> to minimize any more copying. You call C<write_file()> like this:
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
write_file( 'foo', { binmode => ':raw' }, @data ) ;
|
|
Packit |
be8974 |
write_file( 'junk', { append => 1 }, @more_junk ) ;
|
|
Packit |
be8974 |
write_file( 'bar', @spew ) ;
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
=head2 Fast Slurping
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
Somewhere along the line, I learned about a way to slurp files faster
|
|
Packit |
be8974 |
than by setting $/ to undef. The method is very simple, you do a single
|
|
Packit |
be8974 |
read call with the size of the file (which the -s operator provides).
|
|
Packit |
be8974 |
This bypasses the I/O loop inside perl that checks for EOF and does all
|
|
Packit |
be8974 |
sorts of processing. I then decided to experiment and found that
|
|
Packit |
be8974 |
sysread is even faster as you would expect. sysread bypasses all of
|
|
Packit |
be8974 |
Perl's stdio and reads the file from the kernel buffers directly into a
|
|
Packit |
be8974 |
Perl scalar. This is why the slurp code in File::Slurp uses
|
|
Packit |
be8974 |
sysopen/sysread/syswrite. All the rest of the code is just to support
|
|
Packit |
be8974 |
the various options and data passing techniques.
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
=head2 Benchmarks
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
Benchmarks can be enlightening, informative, frustrating and
|
|
Packit |
be8974 |
deceiving. It would make no sense to create a new and more complex slurp
|
|
Packit |
be8974 |
module unless it also gained signifigantly in speed. So I created a
|
|
Packit |
be8974 |
benchmark script which compares various slurp methods with differing
|
|
Packit |
be8974 |
file sizes and calling contexts. This script can be run from the main
|
|
Packit |
be8974 |
directory of the tarball like this:
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
perl -Ilib extras/slurp_bench.pl
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
If you pass in an argument on the command line, it will be passed to
|
|
Packit |
be8974 |
timethese() and it will control the duration. It defaults to -2 which
|
|
Packit |
be8974 |
makes each benchmark run to at least 2 seconds of cpu time.
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
The following numbers are from a run I did on my 300Mhz sparc. You will
|
|
Packit |
be8974 |
most likely get much faster counts on your boxes but the relative speeds
|
|
Packit |
be8974 |
shouldn't change by much. If you see major differences on your
|
|
Packit |
be8974 |
benchmarks, please send me the results and your Perl and OS
|
|
Packit |
be8974 |
versions. Also you can play with the benchmark script and add more slurp
|
|
Packit |
be8974 |
variations or data files.
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
The rest of this section will be discussing the results of the
|
|
Packit |
be8974 |
benchmarks. You can refer to extras/slurp_bench.pl to see the code for
|
|
Packit |
be8974 |
the individual benchmarks. If the benchmark name starts with cpan_, it
|
|
Packit |
be8974 |
is either from Slurp.pm or File::Slurp.pm. Those starting with new_ are
|
|
Packit |
be8974 |
from the new File::Slurp.pm. Those that start with file_contents_ are
|
|
Packit |
be8974 |
from a client's code base. The rest are variations I created to
|
|
Packit |
be8974 |
highlight certain aspects of the benchmarks.
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
The short and long file data is made like this:
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
my @lines = ( 'abc' x 30 . "\n") x 100 ;
|
|
Packit |
be8974 |
my $text = join( '', @lines ) ;
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
@lines = ( 'abc' x 40 . "\n") x 1000 ;
|
|
Packit |
be8974 |
$text = join( '', @lines ) ;
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
So the short file is 9,100 bytes and the long file is 121,000
|
|
Packit |
be8974 |
bytes.
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
=head3 Scalar Slurp of Short File
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
file_contents 651/s
|
|
Packit |
be8974 |
file_contents_no_OO 828/s
|
|
Packit |
be8974 |
cpan_read_file 1866/s
|
|
Packit |
be8974 |
cpan_slurp 1934/s
|
|
Packit |
be8974 |
read_file 2079/s
|
|
Packit |
be8974 |
new 2270/s
|
|
Packit |
be8974 |
new_buf_ref 2403/s
|
|
Packit |
be8974 |
new_scalar_ref 2415/s
|
|
Packit |
be8974 |
sysread_file 2572/s
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
=head3 Scalar Slurp of Long File
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
file_contents_no_OO 82.9/s
|
|
Packit |
be8974 |
file_contents 85.4/s
|
|
Packit |
be8974 |
cpan_read_file 250/s
|
|
Packit |
be8974 |
cpan_slurp 257/s
|
|
Packit |
be8974 |
read_file 323/s
|
|
Packit |
be8974 |
new 468/s
|
|
Packit |
be8974 |
sysread_file 489/s
|
|
Packit |
be8974 |
new_scalar_ref 766/s
|
|
Packit |
be8974 |
new_buf_ref 767/s
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
The primary inference you get from looking at the mumbers above is that
|
|
Packit |
be8974 |
when slurping a file into a scalar, the longer the file, the more time
|
|
Packit |
be8974 |
you save by returning the result via a scalar reference. The time for
|
|
Packit |
be8974 |
the extra buffer copy can add up. The new module came out on top overall
|
|
Packit |
be8974 |
except for the very simple sysread_file entry which was added to
|
|
Packit |
be8974 |
highlight the overhead of the more flexible new module which isn't that
|
|
Packit |
be8974 |
much. The file_contents entries are always the worst since they do a
|
|
Packit |
be8974 |
list slurp and then a join, which is a classic newbie and cargo culted
|
|
Packit |
be8974 |
style which is extremely slow. Also the OO code in file_contents slows
|
|
Packit |
be8974 |
it down even more (I added the file_contents_no_OO entry to show this).
|
|
Packit |
be8974 |
The two CPAN modules are decent with small files but they are laggards
|
|
Packit |
be8974 |
compared to the new module when the file gets much larger.
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
=head3 List Slurp of Short File
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
cpan_read_file 589/s
|
|
Packit |
be8974 |
cpan_slurp_to_array 620/s
|
|
Packit |
be8974 |
read_file 824/s
|
|
Packit |
be8974 |
new_array_ref 824/s
|
|
Packit |
be8974 |
sysread_file 828/s
|
|
Packit |
be8974 |
new 829/s
|
|
Packit |
be8974 |
new_in_anon_array 833/s
|
|
Packit |
be8974 |
cpan_slurp_to_array_ref 836/s
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
=head3 List Slurp of Long File
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
cpan_read_file 62.4/s
|
|
Packit |
be8974 |
cpan_slurp_to_array 62.7/s
|
|
Packit |
be8974 |
read_file 92.9/s
|
|
Packit |
be8974 |
sysread_file 94.8/s
|
|
Packit |
be8974 |
new_array_ref 95.5/s
|
|
Packit |
be8974 |
new 96.2/s
|
|
Packit |
be8974 |
cpan_slurp_to_array_ref 96.3/s
|
|
Packit |
be8974 |
new_in_anon_array 97.2/s
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
This is perhaps the most interesting result of this benchmark. Five
|
|
Packit |
be8974 |
different entries have effectively tied for the lead. The logical
|
|
Packit |
be8974 |
conclusion is that splitting the input into lines is the bounding
|
|
Packit |
be8974 |
operation, no matter how the file gets slurped. This is the only
|
|
Packit |
be8974 |
benchmark where the new module isn't the clear winner (in the long file
|
|
Packit |
be8974 |
entries - it is no worse than a close second in the short file
|
|
Packit |
be8974 |
entries).
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
Note: In the benchmark information for all the spew entries, the extra
|
|
Packit |
be8974 |
number at the end of each line is how many wallclock seconds the whole
|
|
Packit |
be8974 |
entry took. The benchmarks were run for at least 2 CPU seconds per
|
|
Packit |
be8974 |
entry. The unusually large wallclock times will be discussed below.
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
=head3 Scalar Spew of Short File
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
cpan_write_file 1035/s 38
|
|
Packit |
be8974 |
print_file 1055/s 41
|
|
Packit |
be8974 |
syswrite_file 1135/s 44
|
|
Packit |
be8974 |
new 1519/s 2
|
|
Packit |
be8974 |
print_join_file 1766/s 2
|
|
Packit |
be8974 |
new_ref 1900/s 2
|
|
Packit |
be8974 |
syswrite_file2 2138/s 2
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
=head3 Scalar Spew of Long File
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
cpan_write_file 164/s 20
|
|
Packit |
be8974 |
print_file 211/s 26
|
|
Packit |
be8974 |
syswrite_file 236/s 25
|
|
Packit |
be8974 |
print_join_file 277/s 2
|
|
Packit |
be8974 |
new 295/s 2
|
|
Packit |
be8974 |
syswrite_file2 428/s 2
|
|
Packit |
be8974 |
new_ref 608/s 2
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
In the scalar spew entries, the new module API wins when it is passed a
|
|
Packit |
be8974 |
reference to the scalar buffer. The C<syswrite_file2> entry beats it
|
|
Packit |
be8974 |
with the shorter file due to its simpler code. The old CPAN module is
|
|
Packit |
be8974 |
the slowest due to its extra copying of the data and its use of print.
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
=head3 List Spew of Short File
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
cpan_write_file 794/s 29
|
|
Packit |
be8974 |
syswrite_file 1000/s 38
|
|
Packit |
be8974 |
print_file 1013/s 42
|
|
Packit |
be8974 |
new 1399/s 2
|
|
Packit |
be8974 |
print_join_file 1557/s 2
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
=head3 List Spew of Long File
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
cpan_write_file 112/s 12
|
|
Packit |
be8974 |
print_file 179/s 21
|
|
Packit |
be8974 |
syswrite_file 181/s 19
|
|
Packit |
be8974 |
print_join_file 205/s 2
|
|
Packit |
be8974 |
new 228/s 2
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
Again, the simple C<print_join_file> entry beats the new module when
|
|
Packit |
be8974 |
spewing a short list of lines to a file. But is loses to the new module
|
|
Packit |
be8974 |
when the file size gets longer. The old CPAN module lags behind the
|
|
Packit |
be8974 |
others since it first makes an extra copy of the lines and then it calls
|
|
Packit |
be8974 |
C<print> on the output list and that is much slower than passing to
|
|
Packit |
be8974 |
C<print> a single scalar generated by join. The C<print_file> entry
|
|
Packit |
be8974 |
shows the advantage of directly printing C<@_> and the
|
|
Packit |
be8974 |
C<print_join_file> adds the join optimization.
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
Now about those long wallclock times. If you look carefully at the
|
|
Packit |
be8974 |
benchmark code of all the spew entries, you will find that some always
|
|
Packit |
be8974 |
write to new files and some overwrite existing files. When I asked David
|
|
Packit |
be8974 |
Muir why the old File::Slurp module had an C<overwrite> subroutine, he
|
|
Packit |
be8974 |
answered that by overwriting a file, you always guarantee something
|
|
Packit |
be8974 |
readable is in the file. If you create a new file, there is a moment
|
|
Packit |
be8974 |
when the new file is created but has no data in it. I feel this is not a
|
|
Packit |
be8974 |
good enough answer. Even when overwriting, you can write a shorter file
|
|
Packit |
be8974 |
than the existing file and then you have to truncate the file to the new
|
|
Packit |
be8974 |
size. There is a small race window there where another process can slurp
|
|
Packit |
be8974 |
in the file with the new data followed by leftover junk from the
|
|
Packit |
be8974 |
previous version of the file. This reinforces the point that the only
|
|
Packit |
be8974 |
way to ensure consistant file data is the proper use of file locks.
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
But what about those long times? Well it is all about the difference
|
|
Packit |
be8974 |
between creating files and overwriting existing ones. The former have to
|
|
Packit |
be8974 |
allocate new inodes (or the equivilent on other file systems) and the
|
|
Packit |
be8974 |
latter can reuse the exising inode. This mean the overwrite will save on
|
|
Packit |
be8974 |
disk seeks as well as on cpu time. In fact when running this benchmark,
|
|
Packit |
be8974 |
I could hear my disk going crazy allocating inodes during the spew
|
|
Packit |
be8974 |
operations. This speedup in both cpu and wallclock is why the new module
|
|
Packit |
be8974 |
always does overwriting when spewing files. It also does the proper
|
|
Packit |
be8974 |
truncate (and this is checked in the tests by spewing shorter files
|
|
Packit |
be8974 |
after longer ones had previously been written). The C<overwrite>
|
|
Packit |
be8974 |
subroutine is just an typeglob alias to C<write_file> and is there for
|
|
Packit |
be8974 |
backwards compatibilty with the old File::Slurp module.
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
=head3 Benchmark Conclusion
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
Other than a few cases where a simpler entry beat it out, the new
|
|
Packit |
be8974 |
File::Slurp module is either the speed leader or among the leaders. Its
|
|
Packit |
be8974 |
special APIs for passing buffers by reference prove to be very useful
|
|
Packit |
be8974 |
speedups. Also it uses all the other optimizations including using
|
|
Packit |
be8974 |
C<sysread/syswrite> and joining output lines. I expect many projects
|
|
Packit |
be8974 |
that extensively use slurping will notice the speed improvements,
|
|
Packit |
be8974 |
especially if they rewrite their code to take advantage of the new API
|
|
Packit |
be8974 |
features. Even if they don't touch their code and use the simple API
|
|
Packit |
be8974 |
they will get a significant speedup.
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
=head2 Error Handling
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
Slurp subroutines are subject to conditions such as not being able to
|
|
Packit |
be8974 |
open the file, or I/O errors. How these errors are handled, and what the
|
|
Packit |
be8974 |
caller will see, are important aspects of the design of an API. The
|
|
Packit |
be8974 |
classic error handling for slurping has been to call C<die()> or even
|
|
Packit |
be8974 |
better, C<croak()>. But sometimes you want the slurp to either
|
|
Packit |
be8974 |
C<warn()>/C<carp()> or allow your code to handle the error. Sure, this
|
|
Packit |
be8974 |
can be done by wrapping the slurp in a C<eval> block to catch a fatal
|
|
Packit |
be8974 |
error, but not everyone wants all that extra code. So I have added
|
|
Packit |
be8974 |
another option to all the subroutines which selects the error
|
|
Packit |
be8974 |
handling. If the 'err_mode' option is 'croak' (which is also the
|
|
Packit |
be8974 |
default), the called subroutine will croak. If set to 'carp' then carp
|
|
Packit |
be8974 |
will be called. Set to any other string (use 'quiet' when you want to
|
|
Packit |
be8974 |
be explicit) and no error handler is called. Then the caller can use the
|
|
Packit |
be8974 |
error status from the call.
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
C<write_file()> doesn't use the return value for data so it can return a
|
|
Packit |
be8974 |
false status value in-band to mark an error. C<read_file()> does use its
|
|
Packit |
be8974 |
return value for data, but we can still make it pass back the error
|
|
Packit |
be8974 |
status. A successful read in any scalar mode will return either a
|
|
Packit |
be8974 |
defined data string or a reference to a scalar or array. So a bare
|
|
Packit |
be8974 |
return would work here. But if you slurp in lines by calling it in a
|
|
Packit |
be8974 |
list context, a bare C<return> will return an empty list, which is the
|
|
Packit |
be8974 |
same value it would get from an existing but empty file. So now,
|
|
Packit |
be8974 |
C<read_file()> will do something I normally strongly advocate against,
|
|
Packit |
be8974 |
i.e., returning an explicit C<undef> value. In the scalar context this
|
|
Packit |
be8974 |
still returns a error, and in list context, the returned first value
|
|
Packit |
be8974 |
will be C<undef>, and that is not legal data for the first element. So
|
|
Packit |
be8974 |
the list context also gets a error status it can detect:
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
my @lines = read_file( $file_name, err_mode => 'quiet' ) ;
|
|
Packit |
be8974 |
your_handle_error( "$file_name can't be read\n" ) unless
|
|
Packit |
be8974 |
@lines && defined $lines[0] ;
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
=head2 File::FastSlurp
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
sub read_file {
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
my( $file_name, %args ) = @_ ;
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
my $buf ;
|
|
Packit |
be8974 |
my $buf_ref = $args{'buf_ref'} || \$buf ;
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
my $mode = O_RDONLY ;
|
|
Packit |
be8974 |
$mode |= O_BINARY if $args{'binmode'} ;
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
local( *FH ) ;
|
|
Packit |
be8974 |
sysopen( FH, $file_name, $mode ) or
|
|
Packit |
be8974 |
carp "Can't open $file_name: $!" ;
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
my $size_left = -s FH ;
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
while( $size_left > 0 ) {
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
my $read_cnt = sysread( FH, ${$buf_ref},
|
|
Packit |
be8974 |
$size_left, length ${$buf_ref} ) ;
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
unless( $read_cnt ) {
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
carp "read error in file $file_name: $!" ;
|
|
Packit |
be8974 |
last ;
|
|
Packit |
be8974 |
}
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
$size_left -= $read_cnt ;
|
|
Packit |
be8974 |
}
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
# handle void context (return scalar by buffer reference)
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
return unless defined wantarray ;
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
# handle list context
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
return split m|?<$/|g, ${$buf_ref} if wantarray ;
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
# handle scalar context
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
return ${$buf_ref} ;
|
|
Packit |
be8974 |
}
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
sub write_file {
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
my $file_name = shift ;
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
my $args = ( ref $_[0] eq 'HASH' ) ? shift : {} ;
|
|
Packit |
be8974 |
my $buf = join '', @_ ;
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
my $mode = O_WRONLY ;
|
|
Packit |
be8974 |
$mode |= O_BINARY if $args->{'binmode'} ;
|
|
Packit |
be8974 |
$mode |= O_APPEND if $args->{'append'} ;
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
local( *FH ) ;
|
|
Packit |
be8974 |
sysopen( FH, $file_name, $mode ) or
|
|
Packit |
be8974 |
carp "Can't open $file_name: $!" ;
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
my $size_left = length( $buf ) ;
|
|
Packit |
be8974 |
my $offset = 0 ;
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
while( $size_left > 0 ) {
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
my $write_cnt = syswrite( FH, $buf,
|
|
Packit |
be8974 |
$size_left, $offset ) ;
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
unless( $write_cnt ) {
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
carp "write error in file $file_name: $!" ;
|
|
Packit |
be8974 |
last ;
|
|
Packit |
be8974 |
}
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
$size_left -= $write_cnt ;
|
|
Packit |
be8974 |
$offset += $write_cnt ;
|
|
Packit |
be8974 |
}
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
return ;
|
|
Packit |
be8974 |
}
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
=head2 Slurping in Perl 6
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
As usual with Perl 6, much of the work in this article will be put to
|
|
Packit |
be8974 |
pasture. Perl 6 will allow you to set a 'slurp' property on file handles
|
|
Packit |
be8974 |
and when you read from such a handle, the file is slurped. List and
|
|
Packit |
be8974 |
scalar context will still be supported so you can slurp into lines or a
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
optimized and bypass the stdio subsystem since it can use the slurp
|
|
Packit |
be8974 |
property to trigger a call to special code. Otherwise some enterprising
|
|
Packit |
be8974 |
individual will just create a File::FastSlurp module for Perl 6. The
|
|
Packit |
be8974 |
code in the Perl 5 module could easily be modified to Perl 6 syntax and
|
|
Packit |
be8974 |
semantics. Any volunteers?
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
=head2 In Summary
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
We have compared classic line by line processing with munging a whole
|
|
Packit |
be8974 |
file in memory. Slurping files can speed up your programs and simplify
|
|
Packit |
be8974 |
your code if done properly. You must still be aware to not slurp
|
|
Packit |
be8974 |
humongous files (logs, DNA sequences, etc.) or STDIN where you don't
|
|
Packit |
be8974 |
know how much data you will read in. But slurping megabyte sized files
|
|
Packit |
be8974 |
is not an major issue on today's systems with the typical amount of RAM
|
|
Packit |
be8974 |
installed. When Perl was first being used in depth (Perl 4), slurping
|
|
Packit |
be8974 |
was limited by the smaller RAM size of 10 years ago. Now, you should be
|
|
Packit |
be8974 |
able to slurp almost any reasonably sized file, whether it contains
|
|
Packit |
be8974 |
configuration, source code, data, etc.
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
=head2 Acknowledgements
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
|
|
Packit |
be8974 |
|