Blame extras/slurp_article.pod

Packit be8974
=head1 Perl Slurp Ease
Packit be8974
Packit be8974
=head2 Introduction
Packit be8974
 
Packit be8974
Packit be8974
One of the common Perl idioms is processing text files line by line:
Packit be8974
Packit be8974
	while( <FH> ) {
Packit be8974
		do something with $_
Packit be8974
	}
Packit be8974
Packit be8974
This idiom has several variants, but the key point is that it reads in
Packit be8974
only one line from the file in each loop iteration. This has several
Packit be8974
advantages, including limiting memory use to one line, the ability to
Packit be8974
handle any size file (including data piped in via STDIN), and it is
Packit be8974
easily taught and understood to Perl newbies. In fact newbies are the
Packit be8974
ones who do silly things like this:
Packit be8974
Packit be8974
	while( <FH> ) {
Packit be8974
		push @lines, $_ ;
Packit be8974
	}
Packit be8974
Packit be8974
	foreach ( @lines ) {
Packit be8974
		do something with $_
Packit be8974
	}
Packit be8974
Packit be8974
Line by line processing is fine, but it isn't the only way to deal with
Packit be8974
reading files. The other common style is reading the entire file into a
Packit be8974
scalar or array, and that is commonly known as slurping. Now, slurping has
Packit be8974
somewhat of a poor reputation, and this article is an attempt at
Packit be8974
rehabilitating it. Slurping files has advantages and limitations, and is
Packit be8974
not something you should just do when line by line processing is fine.
Packit be8974
It is best when you need the entire file in memory for processing all at
Packit be8974
once. Slurping with in memory processing can be faster and lead to
Packit be8974
simpler code than line by line if done properly.
Packit be8974
Packit be8974
The biggest issue to watch for with slurping is file size. Slurping very
Packit be8974
large files or unknown amounts of data from STDIN can be disastrous to
Packit be8974
your memory usage and cause swap disk thrashing.  You can slurp STDIN if
Packit be8974
you know that you can handle the maximum size input without
Packit be8974
detrimentally affecting your memory usage. So I advocate slurping only
Packit be8974
disk files and only when you know their size is reasonable and you have
Packit be8974
a real reason to process the file as a whole.  Note that reasonable size
Packit be8974
these days is larger than the bad old days of limited RAM. Slurping in a
Packit be8974
megabyte is not an issue on most systems. But most of the
Packit be8974
files I tend to slurp in are much smaller than that. Typical files that
Packit be8974
work well with slurping are configuration files, (mini-)language scripts,
Packit be8974
some data (especially binary) files, and other files of known sizes
Packit be8974
which need fast processing.
Packit be8974
Packit be8974
Another major win for slurping over line by line is speed. Perl's IO
Packit be8974
system (like many others) is slow. Calling C<< <> >> for each line
Packit be8974
requires a check for the end of line, checks for EOF, copying a line,
Packit be8974
munging the internal handle structure, etc. Plenty of work for each line
Packit be8974
read in. Whereas slurping, if done correctly, will usually involve only
Packit be8974
one I/O call and no extra data copying. The same is true for writing
Packit be8974
files to disk, and we will cover that as well (even though the term
Packit be8974
slurping is traditionally a read operation, I use the term ``slurp'' for
Packit be8974
the concept of doing I/O with an entire file in one operation).
Packit be8974
Packit be8974
Finally, when you have slurped the entire file into memory, you can do
Packit be8974
operations on the data that are not possible or easily done with line by
Packit be8974
line processing. These include global search/replace (without regard for
Packit be8974
newlines), grabbing all matches with one call of C<//g>, complex parsing
Packit be8974
(which in many cases must ignore newlines), processing *ML (where line
Packit be8974
endings are just white space) and performing complex transformations
Packit be8974
such as template expansion.
Packit be8974
Packit be8974
=head2 Global Operations
Packit be8974
Packit be8974
Here are some simple global operations that can be done quickly and
Packit be8974
easily on an entire file that has been slurped in. They could also be
Packit be8974
done with line by line processing but that would be slower and require
Packit be8974
more code.
Packit be8974
Packit be8974
A common problem is reading in a file with key/value pairs. There are
Packit be8974
modules which do this but who needs them for simple formats? Just slurp
Packit be8974
in the file and do a single parse to grab all the key/value pairs.
Packit be8974
Packit be8974
	my $text = read_file( $file ) ;
Packit be8974
	my %config = $text =~ /^(\w+)=(.+)$/mg ;
Packit be8974
Packit be8974
That matches a key which starts a line (anywhere inside the string
Packit be8974
because of the C</m> modifier), the '=' char and the text to the end of the
Packit be8974
line (again, C</m> makes that work). In fact the ending C<$> is not even needed
Packit be8974
since C<.> will not normally match a newline. Since the key and value are
Packit be8974
grabbed and the C<m//> is in list context with the C</g> modifier, it will
Packit be8974
grab all key/value pairs and return them. The C<%config>hash will be
Packit be8974
assigned this list and now you have the file fully parsed into a hash.
Packit be8974
Packit be8974
Various projects I have worked on needed some simple templating and I
Packit be8974
wasn't in the mood to use a full module (please, no flames about your
Packit be8974
favorite template module :-). So I rolled my own by slurping in the
Packit be8974
template file, setting up a template hash and doing this one line:
Packit be8974
Packit be8974
	$text =~ s/<%(.+?)%>/$template{$1}/g ;
Packit be8974
Packit be8974
That only works if the entire file was slurped in. With a little
Packit be8974
extra work it can handle chunks of text to be expanded:
Packit be8974
Packit be8974
	$text =~ s/<%(\w+)_START%>(.+?)<%\1_END%>/ template($1, $2)/sge ;
Packit be8974
Packit be8974
Just supply a C<template> sub to expand the text between the markers and
Packit be8974
you have yourself a simple system with minimal code. Note that this will
Packit be8974
work and grab over multiple lines due the the C</s> modifier. This is
Packit be8974
something that is much trickier with line by line processing.
Packit be8974
Packit be8974
Note that this is a very simple templating system, and it can't directly
Packit be8974
handle nested tags and other complex features. But even if you use one
Packit be8974
of the myriad of template modules on the CPAN, you will gain by having
Packit be8974
speedier ways to read and write files.
Packit be8974
Packit be8974
Slurping in a file into an array also offers some useful advantages. 
Packit be8974
One simple example is reading in a flat database where each record has
Packit be8974
fields separated by a character such as C<:>:
Packit be8974
Packit be8974
	my @pw_fields = map [ split /:/ ], read_file( '/etc/passwd' ) ;
Packit be8974
Packit be8974
Random access to any line of the slurped file is another advantage. Also
Packit be8974
a line index could be built to speed up searching the array of lines.
Packit be8974
Packit be8974
Packit be8974
=head2 Traditional Slurping
Packit be8974
Packit be8974
Perl has always supported slurping files with minimal code. Slurping of
Packit be8974
a file to a list of lines is trivial, just call the C<< <> >> operator
Packit be8974
in a list context:
Packit be8974
Packit be8974
	my @lines = <FH> ;
Packit be8974
Packit be8974
and slurping to a scalar isn't much more work. Just set the built in
Packit be8974
variable C<$/> (the input record separator to the undefined value and read
Packit be8974
in the file with C<< <> >>:
Packit be8974
Packit be8974
	{
Packit be8974
		local( $/, *FH ) ;
Packit be8974
		open( FH, $file ) or die "sudden flaming death\n"
Packit be8974
		$text = <FH>
Packit be8974
	}
Packit be8974
Packit be8974
Notice the use of C<local()>. It sets C<$/> to C<undef> for you and when
Packit be8974
the scope exits it will revert C<$/> back to its previous value (most
Packit be8974
likely "\n").
Packit be8974
Packit be8974
Here is a Perl idiom that allows the C<$text> variable to be declared,
Packit be8974
and there is no need for a tightly nested block. The C<do> block will
Packit be8974
execute C<< <FH> >> in a scalar context and slurp in the file named by
Packit be8974
C<$text>:
Packit be8974
Packit be8974
	local( *FH ) ;
Packit be8974
	open( FH, $file ) or die "sudden flaming death\n"
Packit be8974
	my $text = do { local( $/ ) ; <FH> } ;
Packit be8974
Packit be8974
Both of those slurps used localized filehandles to be compatible with
Packit be8974
5.005. Here they are with 5.6.0 lexical autovivified handles:
Packit be8974
Packit be8974
	{
Packit be8974
		local( $/ ) ;
Packit be8974
		open( my $fh, $file ) or die "sudden flaming death\n"
Packit be8974
		$text = <$fh>
Packit be8974
	}
Packit be8974
Packit be8974
	open( my $fh, $file ) or die "sudden flaming death\n"
Packit be8974
	my $text = do { local( $/ ) ; <$fh> } ;
Packit be8974
Packit be8974
And this is a variant of that idiom that removes the need for the open
Packit be8974
call:
Packit be8974
Packit be8974
	my $text = do { local( @ARGV, $/ ) = $file ; <> } ;
Packit be8974
Packit be8974
The filename in C<$file> is assigned to a localized C<@ARGV> and the
Packit be8974
null filehandle is used which reads the data from the files in C<@ARGV>.
Packit be8974
Packit be8974
Instead of assigning to a scalar, all the above slurps can assign to an
Packit be8974
array and it will get the file but split into lines (using C<$/> as the
Packit be8974
end of line marker).
Packit be8974
Packit be8974
There is one common variant of those slurps which is very slow and not
Packit be8974
good code. You see it around, and it is almost always cargo cult code:
Packit be8974
Packit be8974
	my $text = join( '', <FH> ) ;
Packit be8974
Packit be8974
That needlessly splits the input file into lines (C<join> provides a
Packit be8974
list context to C<< <FH> >>) and then joins up those lines again. The
Packit be8974
original coder of this idiom obviously never read I<perlvar> and learned
Packit be8974
how to use C<$/> to allow scalar slurping.
Packit be8974
Packit be8974
=head2 Write Slurping
Packit be8974
Packit be8974
While reading in entire files at one time is common, writing out entire
Packit be8974
files is also done. We call it ``slurping'' when we read in files, but
Packit be8974
there is no commonly accepted term for the write operation. I asked some
Packit be8974
Perl colleagues and got two interesting nominations. Peter Scott said to
Packit be8974
call it ``burping'' (rhymes with ``slurping'' and suggests movement in
Packit be8974
the opposite direction). Others suggested ``spewing'' which has a
Packit be8974
stronger visual image :-) Tell me your favorite or suggest your own. I
Packit be8974
will use both in this section so you can see how they work for you.
Packit be8974
Packit be8974
Spewing a file is a much simpler operation than slurping. You don't have
Packit be8974
context issues to worry about and there is no efficiency problem with
Packit be8974
returning a buffer. Here is a simple burp subroutine:
Packit be8974
Packit be8974
	sub burp {
Packit be8974
		my( $file_name ) = shift ;
Packit be8974
		open( my $fh, ">$file_name" ) || 
Packit be8974
				 die "can't create $file_name $!" ;
Packit be8974
		print $fh @_ ;
Packit be8974
	}
Packit be8974
Packit be8974
Note that it doesn't copy the input text but passes @_ directly to
Packit be8974
print. We will look at faster variations of that later on.
Packit be8974
Packit be8974
=head2 Slurp on the CPAN
Packit be8974
Packit be8974
As you would expect there are modules in the CPAN that will slurp files
Packit be8974
for you. The two I found are called Slurp.pm (by Rob Casey - ROBAU on
Packit be8974
CPAN) and File::Slurp.pm (by David Muir Sharnoff - MUIR on CPAN).
Packit be8974
Packit be8974
Here is the code from Slurp.pm:
Packit be8974
Packit be8974
    sub slurp { 
Packit be8974
	local( $/, @ARGV ) = ( wantarray ? $/ : undef, @_ ); 
Packit be8974
	return <ARGV>;
Packit be8974
    }
Packit be8974
Packit be8974
    sub to_array {
Packit be8974
	my @array = slurp( @_ );
Packit be8974
	return wantarray ? @array : \@array;
Packit be8974
    }
Packit be8974
Packit be8974
    sub to_scalar {
Packit be8974
	my $scalar = slurp( @_ );
Packit be8974
	return $scalar;
Packit be8974
    }
Packit be8974
Packit be8974
+The subroutine C<slurp()> uses the magic undefined value of C<$/> and
Packit be8974
the magic file +handle C<ARGV> to support slurping into a scalar or
Packit be8974
array. It also provides two wrapper subs that allow the caller to
Packit be8974
control the context of the slurp. And the C<to_array()> subroutine will
Packit be8974
return the list of slurped lines or a anonymous array of them according
Packit be8974
to its caller's context by checking C<wantarray>. It has 'slurp' in
Packit be8974
C<@EXPORT> and all three subroutines in C<@EXPORT_OK>.
Packit be8974
Packit be8974
Packit be8974
namespace.>
Packit be8974
Packit be8974
The original File::Slurp.pm has this code:
Packit be8974
Packit be8974
sub read_file
Packit be8974
{
Packit be8974
	my ($file) = @_;
Packit be8974
Packit be8974
	local($/) = wantarray ? $/ : undef;
Packit be8974
	local(*F);
Packit be8974
	my $r;
Packit be8974
	my (@r);
Packit be8974
Packit be8974
	open(F, "<$file") || croak "open $file: $!";
Packit be8974
	@r = <F>;
Packit be8974
	close(F) || croak "close $file: $!";
Packit be8974
Packit be8974
	return $r[0] unless wantarray;
Packit be8974
	return @r;
Packit be8974
}
Packit be8974
Packit be8974
This module provides several subroutines including C<read_file()> (more
Packit be8974
on the others later). C<read_file()> behaves simularly to
Packit be8974
C<Slurp::slurp()> in that it will slurp a list of lines or a single
Packit be8974
scalar depending on the caller's context. It also uses the magic
Packit be8974
undefined value of C<$/> for scalar slurping but it uses an explicit
Packit be8974
open call rather than using a localized C<@ARGV> and the other module
Packit be8974
did. Also it doesn't provide a way to get an anonymous array of the
Packit be8974
lines but that can easily be rectified by calling it inside an anonymous
Packit be8974
array constuctor C<[]>.
Packit be8974
Packit be8974
Both of these modules make it easier for Perl coders to slurp in
Packit be8974
files. They both use the magic C<$/> to slurp in scalar mode and the
Packit be8974
natural behavior of C<< <> >> in list context to slurp as lines. But
Packit be8974
neither is optmized for speed nor can they handle C<binmode()> to
Packit be8974
support binary or unicode files. See below for more on slurp features
Packit be8974
and speedups.
Packit be8974
Packit be8974
=head2 Slurping API Design
Packit be8974
Packit be8974
The slurp modules on CPAN are have a very simple API and don't support
Packit be8974
C<binmode()>. This section will cover various API design issues such as
Packit be8974
efficient return by reference, C<binmode()> and calling variations.
Packit be8974
Packit be8974
Let's start with the call variations. Slurped files can be returned in
Packit be8974
four formats: as a single scalar, as a reference to a scalar, as a list
Packit be8974
of lines or as an anonymous array of lines. But the caller can only
Packit be8974
provide two contexts: scalar or list. So we have to either provide an
Packit be8974
API with more than one subroutine (as Slurp.pm did) or just provide one
Packit be8974
subroutine which only returns a scalar or a list (not an anonymous
Packit be8974
array) as File::Slurp does.
Packit be8974
Packit be8974
I have used my own C<read_file()> subroutine for years and it has the
Packit be8974
same API as File::Slurp: a single subroutine that returns a scalar or a
Packit be8974
list of lines depending on context. But I recognize the interest of
Packit be8974
those that want an anonymous array for line slurping. For one thing, it
Packit be8974
is easier to pass around to other subs and for another, it eliminates
Packit be8974
the extra copying of the lines via C<return>. So my module provides only
Packit be8974
one slurp subroutine that returns the file data based on context and any
Packit be8974
format options passed in. There is no need for a specific
Packit be8974
slurp-in-as-a-scalar or list subroutine as the general C<read_file()>
Packit be8974
sub will do that by default in the appropriate context. If you want
Packit be8974
C<read_file()> to return a scalar reference or anonymous array of lines,
Packit be8974
you can request those formats with options. You can even pass in a
Packit be8974
reference to a scalar (e.g. a previously allocated buffer) and have that
Packit be8974
filled with the slurped data (and that is one of the fastest slurp
Packit be8974
modes. see the benchmark section for more on that).  If you want to
Packit be8974
slurp a scalar into an array, just select the desired array element and
Packit be8974
that will provide scalar context to the C<read_file()> subroutine.
Packit be8974
Packit be8974
The next area to cover is what to name the slurp sub. I will go with
Packit be8974
C<read_file()>. It is descriptive and keeps compatibilty with the
Packit be8974
current simple and don't use the 'slurp' nickname (though that nickname
Packit be8974
is in the module name). Also I decided to keep the  File::Slurp
Packit be8974
namespace which was graciously handed over to me by its current owner,
Packit be8974
David Muir.
Packit be8974
Packit be8974
Another critical area when designing APIs is how to pass in
Packit be8974
arguments. The C<read_file()> subroutine takes one required argument
Packit be8974
which is the file name. To support C<binmode()> we need another optional
Packit be8974
argument. A third optional argument is needed to support returning a
Packit be8974
slurped scalar by reference. My first thought was to design the API with
Packit be8974
3 positional arguments - file name, buffer reference and binmode. But if
Packit be8974
you want to set the binmode and not pass in a buffer reference, you have
Packit be8974
to fill the second argument with C<undef> and that is ugly. So I decided
Packit be8974
to make the filename argument positional and the other two named. The
Packit be8974
subroutine starts off like this:
Packit be8974
Packit be8974
	sub read_file {
Packit be8974
Packit be8974
		my( $file_name, %args ) = @_ ;
Packit be8974
Packit be8974
		my $buf ;
Packit be8974
		my $buf_ref = $args{'buf'} || \$buf ;
Packit be8974
Packit be8974
The other sub (C<read_file_lines()>) will only take an optional binmode
Packit be8974
(so you can read files with binary delimiters). It doesn't need a buffer
Packit be8974
reference argument since it can return an anonymous array if the called
Packit be8974
in a scalar context. So this subroutine could use positional arguments,
Packit be8974
but to keep its API similar to the API of C<read_file()>, it will also
Packit be8974
use pass by name for the optional arguments. This also means that new
Packit be8974
optional arguments can be added later without breaking any legacy
Packit be8974
code. A bonus with keeping the API the same for both subs will be seen
Packit be8974
how the two subs are optimized to work together.
Packit be8974
Packit be8974
Write slurping (or spewing or burping :-)) needs to have its API
Packit be8974
designed as well. The biggest issue is not only needing to support
Packit be8974
optional arguments but a list of arguments to be written is needed. Perl
Packit be8974
6 will be able to handle that with optional named arguments and a final
Packit be8974
slurp argument. Since this is Perl 5 we have to do it using some
Packit be8974
cleverness. The first argument is the file name and it will be
Packit be8974
positional as with the C<read_file> subroutine. But how can we pass in
Packit be8974
the optional arguments and also a list of data? The solution lies in the
Packit be8974
fact that the data list should never contain a reference.
Packit be8974
Burping/spewing works only on plain data. So if the next argument is a
Packit be8974
hash reference, we can assume it cointains the optional arguments and
Packit be8974
the rest of the arguments is the data list. So the C<write_file()>
Packit be8974
subroutine will start off like this:
Packit be8974
Packit be8974
	sub write_file {
Packit be8974
Packit be8974
		my $file_name = shift ;
Packit be8974
Packit be8974
		my $args = ( ref $_[0] eq 'HASH' ) ? shift : {} ;
Packit be8974
Packit be8974
Whether or not optional arguments are passed in, we leave the data list
Packit be8974
in C<@_> to minimize any more copying. You call C<write_file()> like this:
Packit be8974
Packit be8974
	write_file( 'foo', { binmode => ':raw' }, @data ) ;
Packit be8974
	write_file( 'junk', { append => 1 }, @more_junk ) ;
Packit be8974
	write_file( 'bar', @spew ) ;
Packit be8974
Packit be8974
=head2 Fast Slurping
Packit be8974
Packit be8974
Somewhere along the line, I learned about a way to slurp files faster
Packit be8974
than by setting $/ to undef. The method is very simple, you do a single
Packit be8974
read call with the size of the file (which the -s operator provides).
Packit be8974
This bypasses the I/O loop inside perl that checks for EOF and does all
Packit be8974
sorts of processing. I then decided to experiment and found that
Packit be8974
sysread is even faster as you would expect. sysread bypasses all of
Packit be8974
Perl's stdio and reads the file from the kernel buffers directly into a
Packit be8974
Perl scalar. This is why the slurp code in File::Slurp uses
Packit be8974
sysopen/sysread/syswrite. All the rest of the code is just to support
Packit be8974
the various options and data passing techniques.
Packit be8974
Packit be8974
 
Packit be8974
=head2 Benchmarks
Packit be8974
Packit be8974
Benchmarks can be enlightening, informative, frustrating and
Packit be8974
deceiving. It would make no sense to create a new and more complex slurp
Packit be8974
module unless it also gained signifigantly in speed. So I created a
Packit be8974
benchmark script which compares various slurp methods with differing
Packit be8974
file sizes and calling contexts. This script can be run from the main
Packit be8974
directory of the tarball like this:
Packit be8974
Packit be8974
	perl -Ilib extras/slurp_bench.pl
Packit be8974
Packit be8974
If you pass in an argument on the command line, it will be passed to
Packit be8974
timethese() and it will control the duration. It defaults to -2 which
Packit be8974
makes each benchmark run to at least 2 seconds of cpu time.
Packit be8974
Packit be8974
The following numbers are from a run I did on my 300Mhz sparc. You will
Packit be8974
most likely get much faster counts on your boxes but the relative speeds
Packit be8974
shouldn't change by much. If you see major differences on your
Packit be8974
benchmarks, please send me the results and your Perl and OS
Packit be8974
versions. Also you can play with the benchmark script and add more slurp
Packit be8974
variations or data files.
Packit be8974
Packit be8974
The rest of this section will be discussing the results of the
Packit be8974
benchmarks. You can refer to extras/slurp_bench.pl to see the code for
Packit be8974
the individual benchmarks. If the benchmark name starts with cpan_, it
Packit be8974
is either from Slurp.pm or File::Slurp.pm. Those starting with new_ are
Packit be8974
from the new File::Slurp.pm. Those that start with file_contents_ are
Packit be8974
from a client's code base. The rest are variations I created to
Packit be8974
highlight certain aspects of the benchmarks.
Packit be8974
Packit be8974
The short and long file data is made like this:
Packit be8974
Packit be8974
	my @lines = ( 'abc' x 30 . "\n")  x 100 ;
Packit be8974
	my $text = join( '', @lines ) ;
Packit be8974
Packit be8974
	@lines = ( 'abc' x 40 . "\n")  x 1000 ;
Packit be8974
	$text = join( '', @lines ) ;
Packit be8974
Packit be8974
So the short file is 9,100 bytes and the long file is 121,000
Packit be8974
bytes. 
Packit be8974
Packit be8974
=head3 	Scalar Slurp of Short File
Packit be8974
Packit be8974
	file_contents        651/s
Packit be8974
	file_contents_no_OO  828/s
Packit be8974
	cpan_read_file      1866/s
Packit be8974
	cpan_slurp          1934/s
Packit be8974
	read_file           2079/s
Packit be8974
	new                 2270/s
Packit be8974
	new_buf_ref         2403/s
Packit be8974
	new_scalar_ref      2415/s
Packit be8974
	sysread_file        2572/s
Packit be8974
Packit be8974
=head3 	Scalar Slurp of Long File
Packit be8974
Packit be8974
	file_contents_no_OO 82.9/s
Packit be8974
	file_contents       85.4/s
Packit be8974
	cpan_read_file       250/s
Packit be8974
	cpan_slurp           257/s
Packit be8974
	read_file            323/s
Packit be8974
	new                  468/s
Packit be8974
	sysread_file         489/s
Packit be8974
	new_scalar_ref       766/s
Packit be8974
	new_buf_ref          767/s
Packit be8974
Packit be8974
The primary inference you get from looking at the mumbers above is that
Packit be8974
when slurping a file into a scalar, the longer the file, the more time
Packit be8974
you save by returning the result via a scalar reference. The time for
Packit be8974
the extra buffer copy can add up. The new module came out on top overall
Packit be8974
except for the very simple sysread_file entry which was added to
Packit be8974
highlight the overhead of the more flexible new module which isn't that
Packit be8974
much. The file_contents entries are always the worst since they do a
Packit be8974
list slurp and then a join, which is a classic newbie and cargo culted
Packit be8974
style which is extremely slow. Also the OO code in file_contents slows
Packit be8974
it down even more (I added the file_contents_no_OO entry to show this).
Packit be8974
The two CPAN modules are decent with small files but they are laggards
Packit be8974
compared to the new module when the file gets much larger.
Packit be8974
Packit be8974
=head3 	List Slurp of Short File
Packit be8974
Packit be8974
	cpan_read_file          589/s
Packit be8974
	cpan_slurp_to_array     620/s
Packit be8974
	read_file               824/s
Packit be8974
	new_array_ref           824/s
Packit be8974
	sysread_file            828/s
Packit be8974
	new                     829/s
Packit be8974
	new_in_anon_array       833/s
Packit be8974
	cpan_slurp_to_array_ref 836/s
Packit be8974
Packit be8974
=head3 	List Slurp of Long File
Packit be8974
Packit be8974
	cpan_read_file          62.4/s
Packit be8974
	cpan_slurp_to_array     62.7/s
Packit be8974
	read_file               92.9/s
Packit be8974
	sysread_file            94.8/s
Packit be8974
	new_array_ref           95.5/s
Packit be8974
	new                     96.2/s
Packit be8974
	cpan_slurp_to_array_ref 96.3/s
Packit be8974
	new_in_anon_array       97.2/s
Packit be8974
Packit be8974
This is perhaps the most interesting result of this benchmark. Five
Packit be8974
different entries have effectively tied for the lead. The logical
Packit be8974
conclusion is that splitting the input into lines is the bounding
Packit be8974
operation, no matter how the file gets slurped. This is the only
Packit be8974
benchmark where the new module isn't the clear winner (in the long file
Packit be8974
entries - it is no worse than a close second in the short file
Packit be8974
entries). 
Packit be8974
Packit be8974
Packit be8974
Note: In the benchmark information for all the spew entries, the extra
Packit be8974
number at the end of each line is how many wallclock seconds the whole
Packit be8974
entry took. The benchmarks were run for at least 2 CPU seconds per
Packit be8974
entry. The unusually large wallclock times will be discussed below.
Packit be8974
Packit be8974
=head3 	Scalar Spew of Short File
Packit be8974
Packit be8974
	cpan_write_file 1035/s	38
Packit be8974
	print_file      1055/s	41
Packit be8974
	syswrite_file   1135/s	44
Packit be8974
	new             1519/s	2
Packit be8974
	print_join_file 1766/s	2
Packit be8974
	new_ref         1900/s	2
Packit be8974
	syswrite_file2  2138/s	2
Packit be8974
Packit be8974
=head3 	Scalar Spew of Long File
Packit be8974
Packit be8974
	cpan_write_file 164/s	20
Packit be8974
	print_file      211/s	26
Packit be8974
	syswrite_file   236/s	25
Packit be8974
	print_join_file 277/s	2
Packit be8974
	new             295/s	2
Packit be8974
	syswrite_file2  428/s	2
Packit be8974
	new_ref         608/s	2
Packit be8974
Packit be8974
In the scalar spew entries, the new module API wins when it is passed a
Packit be8974
reference to the scalar buffer. The C<syswrite_file2> entry beats it
Packit be8974
with the shorter file due to its simpler code. The old CPAN module is
Packit be8974
the slowest due to its extra copying of the data and its use of print.
Packit be8974
Packit be8974
=head3 List Spew of Short File
Packit be8974
Packit be8974
	cpan_write_file  794/s	29
Packit be8974
	syswrite_file   1000/s	38
Packit be8974
	print_file      1013/s	42
Packit be8974
	new             1399/s	2
Packit be8974
	print_join_file 1557/s	2
Packit be8974
Packit be8974
=head3 	List Spew of Long File
Packit be8974
Packit be8974
	cpan_write_file 112/s	12
Packit be8974
	print_file      179/s	21
Packit be8974
	syswrite_file   181/s	19
Packit be8974
	print_join_file 205/s	2
Packit be8974
	new             228/s	2
Packit be8974
Packit be8974
Again, the simple C<print_join_file> entry beats the new module when
Packit be8974
spewing a short list of lines to a file. But is loses to the new module
Packit be8974
when the file size gets longer. The old CPAN module lags behind the
Packit be8974
others since it first makes an extra copy of the lines and then it calls
Packit be8974
C<print> on the output list and that is much slower than passing to
Packit be8974
C<print> a single scalar generated by join. The C<print_file> entry
Packit be8974
shows the advantage of directly printing C<@_> and the
Packit be8974
C<print_join_file> adds the join optimization.
Packit be8974
Packit be8974
Now about those long wallclock times. If you look carefully at the
Packit be8974
benchmark code of all the spew entries, you will find that some always
Packit be8974
write to new files and some overwrite existing files. When I asked David
Packit be8974
Muir why the old File::Slurp module had an C<overwrite> subroutine, he
Packit be8974
answered that by overwriting a file, you always guarantee something
Packit be8974
readable is in the file. If you create a new file, there is a moment
Packit be8974
when the new file is created but has no data in it. I feel this is not a
Packit be8974
good enough answer. Even when overwriting, you can write a shorter file
Packit be8974
than the existing file and then you have to truncate the file to the new
Packit be8974
size. There is a small race window there where another process can slurp
Packit be8974
in the file with the new data followed by leftover junk from the
Packit be8974
previous version of the file. This reinforces the point that the only
Packit be8974
way to ensure consistant file data is the proper use of file locks.
Packit be8974
Packit be8974
But what about those long times? Well it is all about the difference
Packit be8974
between creating files and overwriting existing ones. The former have to
Packit be8974
allocate new inodes (or the equivilent on other file systems) and the
Packit be8974
latter can reuse the exising inode. This mean the overwrite will save on
Packit be8974
disk seeks as well as on cpu time. In fact when running this benchmark,
Packit be8974
I could hear my disk going crazy allocating inodes during the spew
Packit be8974
operations. This speedup in both cpu and wallclock is why the new module
Packit be8974
always does overwriting when spewing files. It also does the proper
Packit be8974
truncate (and this is checked in the tests by spewing shorter files
Packit be8974
after longer ones had previously been written). The C<overwrite>
Packit be8974
subroutine is just an typeglob alias to C<write_file> and is there for
Packit be8974
backwards compatibilty with the old File::Slurp module.
Packit be8974
Packit be8974
=head3 Benchmark Conclusion
Packit be8974
Packit be8974
Other than a few cases where a simpler entry beat it out, the new
Packit be8974
File::Slurp module is either the speed leader or among the leaders. Its
Packit be8974
special APIs for passing buffers by reference prove to be very useful
Packit be8974
speedups. Also it uses all the other optimizations including using
Packit be8974
C<sysread/syswrite> and joining output lines. I expect many projects
Packit be8974
that extensively use slurping will notice the speed improvements,
Packit be8974
especially if they rewrite their code to take advantage of the new API
Packit be8974
features. Even if they don't touch their code and use the simple API
Packit be8974
they will get a significant speedup.
Packit be8974
Packit be8974
=head2 Error Handling
Packit be8974
Packit be8974
Slurp subroutines are subject to conditions such as not being able to
Packit be8974
open the file, or I/O errors. How these errors are handled, and what the
Packit be8974
caller will see, are important aspects of the design of an API. The
Packit be8974
classic error handling for slurping has been to call C<die()> or even
Packit be8974
better, C<croak()>. But sometimes you want the slurp to either
Packit be8974
C<warn()>/C<carp()> or allow your code to handle the error. Sure, this
Packit be8974
can be done by wrapping the slurp in a C<eval> block to catch a fatal
Packit be8974
error, but not everyone wants all that extra code. So I have added
Packit be8974
another option to all the subroutines which selects the error
Packit be8974
handling. If the 'err_mode' option is 'croak' (which is also the
Packit be8974
default), the called subroutine will croak. If set to 'carp' then carp
Packit be8974
will be called. Set to any other string (use 'quiet' when you want to
Packit be8974
be explicit) and no error handler is called. Then the caller can use the
Packit be8974
error status from the call.
Packit be8974
Packit be8974
C<write_file()> doesn't use the return value for data so it can return a
Packit be8974
false status value in-band to mark an error. C<read_file()> does use its
Packit be8974
return value for data, but we can still make it pass back the error
Packit be8974
status. A successful read in any scalar mode will return either a
Packit be8974
defined data string or a reference to a scalar or array. So a bare
Packit be8974
return would work here. But if you slurp in lines by calling it in a
Packit be8974
list context, a bare C<return> will return an empty list, which is the
Packit be8974
same value it would get from an existing but empty file. So now,
Packit be8974
C<read_file()> will do something I normally strongly advocate against,
Packit be8974
i.e., returning an explicit C<undef> value. In the scalar context this
Packit be8974
still returns a error, and in list context, the returned first value
Packit be8974
will be C<undef>, and that is not legal data for the first element. So
Packit be8974
the list context also gets a error status it can detect:
Packit be8974
Packit be8974
	my @lines = read_file( $file_name, err_mode => 'quiet' ) ;
Packit be8974
	your_handle_error( "$file_name can't be read\n" ) unless
Packit be8974
					@lines && defined $lines[0] ;
Packit be8974
Packit be8974
Packit be8974
=head2 File::FastSlurp
Packit be8974
Packit be8974
	sub read_file {
Packit be8974
Packit be8974
		my( $file_name, %args ) = @_ ;
Packit be8974
Packit be8974
		my $buf ;
Packit be8974
		my $buf_ref = $args{'buf_ref'} || \$buf ;
Packit be8974
Packit be8974
		my $mode = O_RDONLY ;
Packit be8974
		$mode |= O_BINARY if $args{'binmode'} ;
Packit be8974
Packit be8974
		local( *FH ) ;
Packit be8974
		sysopen( FH, $file_name, $mode ) or
Packit be8974
					carp "Can't open $file_name: $!" ;
Packit be8974
Packit be8974
		my $size_left = -s FH ;
Packit be8974
Packit be8974
		while( $size_left > 0 ) {
Packit be8974
Packit be8974
			my $read_cnt = sysread( FH, ${$buf_ref},
Packit be8974
					$size_left, length ${$buf_ref} ) ;
Packit be8974
Packit be8974
			unless( $read_cnt ) {
Packit be8974
Packit be8974
				carp "read error in file $file_name: $!" ;
Packit be8974
				last ;
Packit be8974
			}
Packit be8974
Packit be8974
			$size_left -= $read_cnt ;
Packit be8974
		}
Packit be8974
Packit be8974
	# handle void context (return scalar by buffer reference)
Packit be8974
Packit be8974
		return unless defined wantarray ;
Packit be8974
Packit be8974
	# handle list context
Packit be8974
Packit be8974
		return split m|?<$/|g, ${$buf_ref} if wantarray ;
Packit be8974
Packit be8974
	# handle scalar context
Packit be8974
Packit be8974
		return ${$buf_ref} ;
Packit be8974
	}
Packit be8974
Packit be8974
	sub write_file {
Packit be8974
Packit be8974
		my $file_name = shift ;
Packit be8974
Packit be8974
		my $args = ( ref $_[0] eq 'HASH' ) ? shift : {} ;
Packit be8974
		my $buf = join '', @_ ;
Packit be8974
Packit be8974
Packit be8974
		my $mode = O_WRONLY ;
Packit be8974
		$mode |= O_BINARY if $args->{'binmode'} ;
Packit be8974
		$mode |= O_APPEND if $args->{'append'} ;
Packit be8974
Packit be8974
		local( *FH ) ;
Packit be8974
		sysopen( FH, $file_name, $mode ) or
Packit be8974
					carp "Can't open $file_name: $!" ;
Packit be8974
Packit be8974
		my $size_left = length( $buf ) ;
Packit be8974
		my $offset = 0 ;
Packit be8974
Packit be8974
		while( $size_left > 0 ) {
Packit be8974
Packit be8974
			my $write_cnt = syswrite( FH, $buf,
Packit be8974
					$size_left, $offset ) ;
Packit be8974
Packit be8974
			unless( $write_cnt ) {
Packit be8974
Packit be8974
				carp "write error in file $file_name: $!" ;
Packit be8974
				last ;
Packit be8974
			}
Packit be8974
Packit be8974
			$size_left -= $write_cnt ;
Packit be8974
			$offset += $write_cnt ;
Packit be8974
		}
Packit be8974
Packit be8974
		return ;
Packit be8974
	}
Packit be8974
Packit be8974
=head2 Slurping in Perl 6
Packit be8974
Packit be8974
As usual with Perl 6, much of the work in this article will be put to
Packit be8974
pasture. Perl 6 will allow you to set a 'slurp' property on file handles
Packit be8974
and when you read from such a handle, the file is slurped. List and
Packit be8974
scalar context will still be supported so you can slurp into lines or a
Packit be8974
Packit be8974
optimized and bypass the stdio subsystem since it can use the slurp
Packit be8974
property to trigger a call to special code. Otherwise some enterprising
Packit be8974
individual will just create a File::FastSlurp module for Perl 6. The
Packit be8974
code in the Perl 5 module could easily be modified to Perl 6 syntax and
Packit be8974
semantics. Any volunteers?
Packit be8974
Packit be8974
=head2 In Summary
Packit be8974
Packit be8974
We have compared classic line by line processing with munging a whole
Packit be8974
file in memory. Slurping files can speed up your programs and simplify
Packit be8974
your code if done properly. You must still be aware to not slurp
Packit be8974
humongous files (logs, DNA sequences, etc.) or STDIN where you don't
Packit be8974
know how much data you will read in. But slurping megabyte sized files
Packit be8974
is not an major issue on today's systems with the typical amount of RAM
Packit be8974
installed. When Perl was first being used in depth (Perl 4), slurping
Packit be8974
was limited by the smaller RAM size of 10 years ago. Now, you should be
Packit be8974
able to slurp almost any reasonably sized file, whether it contains
Packit be8974
configuration, source code, data, etc.
Packit be8974
Packit be8974
=head2 Acknowledgements
Packit be8974
Packit be8974
Packit be8974
Packit be8974
Packit be8974