Tree - source-git/perl-File-Slurp

source-git / perl-File-Slurp

Blame extras/slurp_article.pod

Blob History Raw

Packit	be8974	`=head1 Perl Slurp Ease`
Packit	be8974
Packit	be8974	`=head2 Introduction`
Packit	be8974
Packit	be8974
Packit	be8974	`One of the common Perl idioms is processing text files line by line:`
Packit	be8974
Packit	be8974	`while( <FH> ) {`
Packit	be8974	`do something with $_`
Packit	be8974	`}`
Packit	be8974
Packit	be8974	`This idiom has several variants, but the key point is that it reads in`
Packit	be8974	`only one line from the file in each loop iteration. This has several`
Packit	be8974	`advantages, including limiting memory use to one line, the ability to`
Packit	be8974	`handle any size file (including data piped in via STDIN), and it is`
Packit	be8974	`easily taught and understood to Perl newbies. In fact newbies are the`
Packit	be8974	`ones who do silly things like this:`
Packit	be8974
Packit	be8974	`while( <FH> ) {`
Packit	be8974	`push @lines, $_ ;`
Packit	be8974	`}`
Packit	be8974
Packit	be8974	`foreach ( @lines ) {`
Packit	be8974	`do something with $_`
Packit	be8974	`}`
Packit	be8974
Packit	be8974	`Line by line processing is fine, but it isn't the only way to deal with`
Packit	be8974	`reading files. The other common style is reading the entire file into a`
Packit	be8974	`scalar or array, and that is commonly known as slurping. Now, slurping has`
Packit	be8974	`somewhat of a poor reputation, and this article is an attempt at`
Packit	be8974	`rehabilitating it. Slurping files has advantages and limitations, and is`
Packit	be8974	`not something you should just do when line by line processing is fine.`
Packit	be8974	`It is best when you need the entire file in memory for processing all at`
Packit	be8974	`once. Slurping with in memory processing can be faster and lead to`
Packit	be8974	`simpler code than line by line if done properly.`
Packit	be8974
Packit	be8974	`The biggest issue to watch for with slurping is file size. Slurping very`
Packit	be8974	`large files or unknown amounts of data from STDIN can be disastrous to`
Packit	be8974	`your memory usage and cause swap disk thrashing. You can slurp STDIN if`
Packit	be8974	`you know that you can handle the maximum size input without`
Packit	be8974	`detrimentally affecting your memory usage. So I advocate slurping only`
Packit	be8974	`disk files and only when you know their size is reasonable and you have`
Packit	be8974	`a real reason to process the file as a whole. Note that reasonable size`
Packit	be8974	`these days is larger than the bad old days of limited RAM. Slurping in a`
Packit	be8974	`megabyte is not an issue on most systems. But most of the`
Packit	be8974	`files I tend to slurp in are much smaller than that. Typical files that`
Packit	be8974	`work well with slurping are configuration files, (mini-)language scripts,`
Packit	be8974	`some data (especially binary) files, and other files of known sizes`
Packit	be8974	`which need fast processing.`
Packit	be8974
Packit	be8974	`Another major win for slurping over line by line is speed. Perl's IO`
Packit	be8974	`system (like many others) is slow. Calling C<< <> >> for each line`
Packit	be8974	`requires a check for the end of line, checks for EOF, copying a line,`
Packit	be8974	`munging the internal handle structure, etc. Plenty of work for each line`
Packit	be8974	`read in. Whereas slurping, if done correctly, will usually involve only`
Packit	be8974	`one I/O call and no extra data copying. The same is true for writing`
Packit	be8974	`files to disk, and we will cover that as well (even though the term`
Packit	be8974	slurping is traditionally a read operation, I use the term ``slurp'' for
Packit	be8974	`the concept of doing I/O with an entire file in one operation).`
Packit	be8974
Packit	be8974	`Finally, when you have slurped the entire file into memory, you can do`
Packit	be8974	`operations on the data that are not possible or easily done with line by`
Packit	be8974	`line processing. These include global search/replace (without regard for`
Packit	be8974	`newlines), grabbing all matches with one call of C<//g>, complex parsing`
Packit	be8974	`(which in many cases must ignore newlines), processing *ML (where line`
Packit	be8974	`endings are just white space) and performing complex transformations`
Packit	be8974	`such as template expansion.`
Packit	be8974
Packit	be8974	`=head2 Global Operations`
Packit	be8974
Packit	be8974	`Here are some simple global operations that can be done quickly and`
Packit	be8974	`easily on an entire file that has been slurped in. They could also be`
Packit	be8974	`done with line by line processing but that would be slower and require`
Packit	be8974	`more code.`
Packit	be8974
Packit	be8974	`A common problem is reading in a file with key/value pairs. There are`
Packit	be8974	`modules which do this but who needs them for simple formats? Just slurp`
Packit	be8974	`in the file and do a single parse to grab all the key/value pairs.`
Packit	be8974
Packit	be8974	`my $text = read_file( $file ) ;`
Packit	be8974	`my %config = $text =~ /^(\w+)=(.+)$/mg ;`
Packit	be8974
Packit	be8974	`That matches a key which starts a line (anywhere inside the string`
Packit	be8974	`because of the C</m> modifier), the '=' char and the text to the end of the`
Packit	be8974	`line (again, C</m> makes that work). In fact the ending C<$> is not even needed`
Packit	be8974	`since C<.> will not normally match a newline. Since the key and value are`
Packit	be8974	`grabbed and the C<m//> is in list context with the C</g> modifier, it will`
Packit	be8974	`grab all key/value pairs and return them. The C<%config>hash will be`
Packit	be8974	`assigned this list and now you have the file fully parsed into a hash.`
Packit	be8974
Packit	be8974	`Various projects I have worked on needed some simple templating and I`
Packit	be8974	`wasn't in the mood to use a full module (please, no flames about your`
Packit	be8974	`favorite template module :-). So I rolled my own by slurping in the`
Packit	be8974	`template file, setting up a template hash and doing this one line:`
Packit	be8974
Packit	be8974	`$text =~ s/<%(.+?)%>/$template{$1}/g ;`
Packit	be8974
Packit	be8974	`That only works if the entire file was slurped in. With a little`
Packit	be8974	`extra work it can handle chunks of text to be expanded:`
Packit	be8974
Packit	be8974	`$text =~ s/<%(\w+)_START%>(.+?)<%\1_END%>/ template($1, $2)/sge ;`
Packit	be8974
Packit	be8974	`Just supply a C<template> sub to expand the text between the markers and`
Packit	be8974	`you have yourself a simple system with minimal code. Note that this will`
Packit	be8974	`work and grab over multiple lines due the the C</s> modifier. This is`
Packit	be8974	`something that is much trickier with line by line processing.`
Packit	be8974
Packit	be8974	`Note that this is a very simple templating system, and it can't directly`
Packit	be8974	`handle nested tags and other complex features. But even if you use one`
Packit	be8974	`of the myriad of template modules on the CPAN, you will gain by having`
Packit	be8974	`speedier ways to read and write files.`
Packit	be8974
Packit	be8974	`Slurping in a file into an array also offers some useful advantages.`
Packit	be8974	`One simple example is reading in a flat database where each record has`
Packit	be8974	`fields separated by a character such as C<:>:`
Packit	be8974
Packit	be8974	`my @pw_fields = map [ split /:/ ], read_file( '/etc/passwd' ) ;`
Packit	be8974
Packit	be8974	`Random access to any line of the slurped file is another advantage. Also`
Packit	be8974	`a line index could be built to speed up searching the array of lines.`
Packit	be8974
Packit	be8974
Packit	be8974	`=head2 Traditional Slurping`
Packit	be8974
Packit	be8974	`Perl has always supported slurping files with minimal code. Slurping of`
Packit	be8974	`a file to a list of lines is trivial, just call the C<< <> >> operator`
Packit	be8974	`in a list context:`
Packit	be8974
Packit	be8974	`my @lines = <FH> ;`
Packit	be8974
Packit	be8974	`and slurping to a scalar isn't much more work. Just set the built in`
Packit	be8974	`variable C<$/> (the input record separator to the undefined value and read`
Packit	be8974	`in the file with C<< <> >>:`
Packit	be8974
Packit	be8974	`{`
Packit	be8974	`local( $/, *FH ) ;`
Packit	be8974	`open( FH, $file ) or die "sudden flaming death\n"`
Packit	be8974	`$text = <FH>`
Packit	be8974	`}`
Packit	be8974
Packit	be8974	`Notice the use of C<local()>. It sets C<$/> to C<undef> for you and when`
Packit	be8974	`the scope exits it will revert C<$/> back to its previous value (most`
Packit	be8974	`likely "\n").`
Packit	be8974
Packit	be8974	`Here is a Perl idiom that allows the C<$text> variable to be declared,`
Packit	be8974	`and there is no need for a tightly nested block. The C<do> block will`
Packit	be8974	`execute C<< <FH> >> in a scalar context and slurp in the file named by`
Packit	be8974	`C<$text>:`
Packit	be8974
Packit	be8974	`local( *FH ) ;`
Packit	be8974	`open( FH, $file ) or die "sudden flaming death\n"`
Packit	be8974	`my $text = do { local( $/ ) ; <FH> } ;`
Packit	be8974
Packit	be8974	`Both of those slurps used localized filehandles to be compatible with`
Packit	be8974	`5.005. Here they are with 5.6.0 lexical autovivified handles:`
Packit	be8974
Packit	be8974	`{`
Packit	be8974	`local( $/ ) ;`
Packit	be8974	`open( my $fh, $file ) or die "sudden flaming death\n"`
Packit	be8974	`$text = <$fh>`
Packit	be8974	`}`
Packit	be8974
Packit	be8974	`open( my $fh, $file ) or die "sudden flaming death\n"`
Packit	be8974	`my $text = do { local( $/ ) ; <$fh> } ;`
Packit	be8974
Packit	be8974	`And this is a variant of that idiom that removes the need for the open`
Packit	be8974	`call:`
Packit	be8974
Packit	be8974	`my $text = do { local( @ARGV, $/ ) = $file ; <> } ;`
Packit	be8974
Packit	be8974	`The filename in C<$file> is assigned to a localized C<@ARGV> and the`
Packit	be8974	`null filehandle is used which reads the data from the files in C<@ARGV>.`
Packit	be8974
Packit	be8974	`Instead of assigning to a scalar, all the above slurps can assign to an`
Packit	be8974	`array and it will get the file but split into lines (using C<$/> as the`
Packit	be8974	`end of line marker).`
Packit	be8974
Packit	be8974	`There is one common variant of those slurps which is very slow and not`
Packit	be8974	`good code. You see it around, and it is almost always cargo cult code:`
Packit	be8974
Packit	be8974	`my $text = join( '', <FH> ) ;`
Packit	be8974
Packit	be8974	`That needlessly splits the input file into lines (C<join> provides a`
Packit	be8974	`list context to C<< <FH> >>) and then joins up those lines again. The`
Packit	be8974	`original coder of this idiom obviously never read I<perlvar> and learned`
Packit	be8974	`how to use C<$/> to allow scalar slurping.`
Packit	be8974
Packit	be8974	`=head2 Write Slurping`
Packit	be8974
Packit	be8974	`While reading in entire files at one time is common, writing out entire`
Packit	be8974	files is also done. We call it ``slurping'' when we read in files, but
Packit	be8974	`there is no commonly accepted term for the write operation. I asked some`
Packit	be8974	`Perl colleagues and got two interesting nominations. Peter Scott said to`
Packit	be8974	call it ``burping'' (rhymes with ``slurping'' and suggests movement in
Packit	be8974	the opposite direction). Others suggested ``spewing'' which has a
Packit	be8974	`stronger visual image :-) Tell me your favorite or suggest your own. I`
Packit	be8974	`will use both in this section so you can see how they work for you.`
Packit	be8974
Packit	be8974	`Spewing a file is a much simpler operation than slurping. You don't have`
Packit	be8974	`context issues to worry about and there is no efficiency problem with`
Packit	be8974	`returning a buffer. Here is a simple burp subroutine:`
Packit	be8974
Packit	be8974	`sub burp {`
Packit	be8974	`my( $file_name ) = shift ;`
Packit	be8974	`open( my $fh, ">$file_name" ) \|\|`
Packit	be8974	`die "can't create $file_name $!" ;`
Packit	be8974	`print $fh @_ ;`
Packit	be8974	`}`
Packit	be8974
Packit	be8974	`Note that it doesn't copy the input text but passes @_ directly to`
Packit	be8974	`print. We will look at faster variations of that later on.`
Packit	be8974
Packit	be8974	`=head2 Slurp on the CPAN`
Packit	be8974
Packit	be8974	`As you would expect there are modules in the CPAN that will slurp files`
Packit	be8974	`for you. The two I found are called Slurp.pm (by Rob Casey - ROBAU on`
Packit	be8974	`CPAN) and File::Slurp.pm (by David Muir Sharnoff - MUIR on CPAN).`
Packit	be8974
Packit	be8974	`Here is the code from Slurp.pm:`
Packit	be8974
Packit	be8974	`sub slurp {`
Packit	be8974	`local( $/, @ARGV ) = ( wantarray ? $/ : undef, @_ );`
Packit	be8974	`return <ARGV>;`
Packit	be8974	`}`
Packit	be8974
Packit	be8974	`sub to_array {`
Packit	be8974	`my @array = slurp( @_ );`
Packit	be8974	`return wantarray ? @array : \@array;`
Packit	be8974	`}`
Packit	be8974
Packit	be8974	`sub to_scalar {`
Packit	be8974	`my $scalar = slurp( @_ );`
Packit	be8974	`return $scalar;`
Packit	be8974	`}`
Packit	be8974
Packit	be8974	`+The subroutine C<slurp()> uses the magic undefined value of C<$/> and`
Packit	be8974	`the magic file +handle C<ARGV> to support slurping into a scalar or`
Packit	be8974	`array. It also provides two wrapper subs that allow the caller to`
Packit	be8974	`control the context of the slurp. And the C<to_array()> subroutine will`
Packit	be8974	`return the list of slurped lines or a anonymous array of them according`
Packit	be8974	`to its caller's context by checking C<wantarray>. It has 'slurp' in`
Packit	be8974	`C<@EXPORT> and all three subroutines in C<@EXPORT_OK>.`
Packit	be8974
Packit	be8974
Packit	be8974	`namespace.>`
Packit	be8974
Packit	be8974	`The original File::Slurp.pm has this code:`
Packit	be8974
Packit	be8974	`sub read_file`
Packit	be8974	`{`
Packit	be8974	`my ($file) = @_;`
Packit	be8974
Packit	be8974	`local($/) = wantarray ? $/ : undef;`
Packit	be8974	`local(*F);`
Packit	be8974	`my $r;`
Packit	be8974	`my (@r);`
Packit	be8974
Packit	be8974	`open(F, "<$file") \|\| croak "open $file: $!";`
Packit	be8974	`@r = <F>;`
Packit	be8974	`close(F) \|\| croak "close $file: $!";`
Packit	be8974
Packit	be8974	`return $r[0] unless wantarray;`
Packit	be8974	`return @r;`
Packit	be8974	`}`
Packit	be8974
Packit	be8974	`This module provides several subroutines including C<read_file()> (more`
Packit	be8974	`on the others later). C<read_file()> behaves simularly to`
Packit	be8974	`C<Slurp::slurp()> in that it will slurp a list of lines or a single`
Packit	be8974	`scalar depending on the caller's context. It also uses the magic`
Packit	be8974	`undefined value of C<$/> for scalar slurping but it uses an explicit`
Packit	be8974	`open call rather than using a localized C<@ARGV> and the other module`
Packit	be8974	`did. Also it doesn't provide a way to get an anonymous array of the`
Packit	be8974	`lines but that can easily be rectified by calling it inside an anonymous`
Packit	be8974	`array constuctor C<[]>.`
Packit	be8974
Packit	be8974	`Both of these modules make it easier for Perl coders to slurp in`
Packit	be8974	`files. They both use the magic C<$/> to slurp in scalar mode and the`
Packit	be8974	`natural behavior of C<< <> >> in list context to slurp as lines. But`
Packit	be8974	`neither is optmized for speed nor can they handle C<binmode()> to`
Packit	be8974	`support binary or unicode files. See below for more on slurp features`
Packit	be8974	`and speedups.`
Packit	be8974
Packit	be8974	`=head2 Slurping API Design`
Packit	be8974
Packit	be8974	`The slurp modules on CPAN are have a very simple API and don't support`
Packit	be8974	`C<binmode()>. This section will cover various API design issues such as`
Packit	be8974	`efficient return by reference, C<binmode()> and calling variations.`
Packit	be8974
Packit	be8974	`Let's start with the call variations. Slurped files can be returned in`
Packit	be8974	`four formats: as a single scalar, as a reference to a scalar, as a list`
Packit	be8974	`of lines or as an anonymous array of lines. But the caller can only`
Packit	be8974	`provide two contexts: scalar or list. So we have to either provide an`
Packit	be8974	`API with more than one subroutine (as Slurp.pm did) or just provide one`
Packit	be8974	`subroutine which only returns a scalar or a list (not an anonymous`
Packit	be8974	`array) as File::Slurp does.`
Packit	be8974
Packit	be8974	`I have used my own C<read_file()> subroutine for years and it has the`
Packit	be8974	`same API as File::Slurp: a single subroutine that returns a scalar or a`
Packit	be8974	`list of lines depending on context. But I recognize the interest of`
Packit	be8974	`those that want an anonymous array for line slurping. For one thing, it`
Packit	be8974	`is easier to pass around to other subs and for another, it eliminates`
Packit	be8974	`the extra copying of the lines via C<return>. So my module provides only`
Packit	be8974	`one slurp subroutine that returns the file data based on context and any`
Packit	be8974	`format options passed in. There is no need for a specific`
Packit	be8974	`slurp-in-as-a-scalar or list subroutine as the general C<read_file()>`
Packit	be8974	`sub will do that by default in the appropriate context. If you want`
Packit	be8974	`C<read_file()> to return a scalar reference or anonymous array of lines,`
Packit	be8974	`you can request those formats with options. You can even pass in a`
Packit	be8974	`reference to a scalar (e.g. a previously allocated buffer) and have that`
Packit	be8974	`filled with the slurped data (and that is one of the fastest slurp`
Packit	be8974	`modes. see the benchmark section for more on that). If you want to`
Packit	be8974	`slurp a scalar into an array, just select the desired array element and`
Packit	be8974	`that will provide scalar context to the C<read_file()> subroutine.`
Packit	be8974
Packit	be8974	`The next area to cover is what to name the slurp sub. I will go with`
Packit	be8974	`C<read_file()>. It is descriptive and keeps compatibilty with the`
Packit	be8974	`current simple and don't use the 'slurp' nickname (though that nickname`
Packit	be8974	`is in the module name). Also I decided to keep the File::Slurp`
Packit	be8974	`namespace which was graciously handed over to me by its current owner,`
Packit	be8974	`David Muir.`
Packit	be8974
Packit	be8974	`Another critical area when designing APIs is how to pass in`
Packit	be8974	`arguments. The C<read_file()> subroutine takes one required argument`
Packit	be8974	`which is the file name. To support C<binmode()> we need another optional`
Packit	be8974	`argument. A third optional argument is needed to support returning a`
Packit	be8974	`slurped scalar by reference. My first thought was to design the API with`
Packit	be8974	`3 positional arguments - file name, buffer reference and binmode. But if`
Packit	be8974	`you want to set the binmode and not pass in a buffer reference, you have`
Packit	be8974	`to fill the second argument with C<undef> and that is ugly. So I decided`
Packit	be8974	`to make the filename argument positional and the other two named. The`
Packit	be8974	`subroutine starts off like this:`
Packit	be8974
Packit	be8974	`sub read_file {`
Packit	be8974
Packit	be8974	`my( $file_name, %args ) = @_ ;`
Packit	be8974
Packit	be8974	`my $buf ;`
Packit	be8974	`my $buf_ref = $args{'buf'} \|\| \$buf ;`
Packit	be8974
Packit	be8974	`The other sub (C<read_file_lines()>) will only take an optional binmode`
Packit	be8974	`(so you can read files with binary delimiters). It doesn't need a buffer`
Packit	be8974	`reference argument since it can return an anonymous array if the called`
Packit	be8974	`in a scalar context. So this subroutine could use positional arguments,`
Packit	be8974	`but to keep its API similar to the API of C<read_file()>, it will also`
Packit	be8974	`use pass by name for the optional arguments. This also means that new`
Packit	be8974	`optional arguments can be added later without breaking any legacy`
Packit	be8974	`code. A bonus with keeping the API the same for both subs will be seen`
Packit	be8974	`how the two subs are optimized to work together.`
Packit	be8974
Packit	be8974	`Write slurping (or spewing or burping :-)) needs to have its API`
Packit	be8974	`designed as well. The biggest issue is not only needing to support`
Packit	be8974	`optional arguments but a list of arguments to be written is needed. Perl`
Packit	be8974	`6 will be able to handle that with optional named arguments and a final`
Packit	be8974	`slurp argument. Since this is Perl 5 we have to do it using some`
Packit	be8974	`cleverness. The first argument is the file name and it will be`
Packit	be8974	`positional as with the C<read_file> subroutine. But how can we pass in`
Packit	be8974	`the optional arguments and also a list of data? The solution lies in the`
Packit	be8974	`fact that the data list should never contain a reference.`
Packit	be8974	`Burping/spewing works only on plain data. So if the next argument is a`
Packit	be8974	`hash reference, we can assume it cointains the optional arguments and`
Packit	be8974	`the rest of the arguments is the data list. So the C<write_file()>`
Packit	be8974	`subroutine will start off like this:`
Packit	be8974
Packit	be8974	`sub write_file {`
Packit	be8974
Packit	be8974	`my $file_name = shift ;`
Packit	be8974
Packit	be8974	`my $args = ( ref $_[0] eq 'HASH' ) ? shift : {} ;`
Packit	be8974
Packit	be8974	`Whether or not optional arguments are passed in, we leave the data list`
Packit	be8974	`in C<@_> to minimize any more copying. You call C<write_file()> like this:`
Packit	be8974
Packit	be8974	`write_file( 'foo', { binmode => ':raw' }, @data ) ;`
Packit	be8974	`write_file( 'junk', { append => 1 }, @more_junk ) ;`
Packit	be8974	`write_file( 'bar', @spew ) ;`
Packit	be8974
Packit	be8974	`=head2 Fast Slurping`
Packit	be8974
Packit	be8974	`Somewhere along the line, I learned about a way to slurp files faster`
Packit	be8974	`than by setting $/ to undef. The method is very simple, you do a single`
Packit	be8974	`read call with the size of the file (which the -s operator provides).`
Packit	be8974	`This bypasses the I/O loop inside perl that checks for EOF and does all`
Packit	be8974	`sorts of processing. I then decided to experiment and found that`
Packit	be8974	`sysread is even faster as you would expect. sysread bypasses all of`
Packit	be8974	`Perl's stdio and reads the file from the kernel buffers directly into a`
Packit	be8974	`Perl scalar. This is why the slurp code in File::Slurp uses`
Packit	be8974	`sysopen/sysread/syswrite. All the rest of the code is just to support`
Packit	be8974	`the various options and data passing techniques.`
Packit	be8974
Packit	be8974
Packit	be8974	`=head2 Benchmarks`
Packit	be8974
Packit	be8974	`Benchmarks can be enlightening, informative, frustrating and`
Packit	be8974	`deceiving. It would make no sense to create a new and more complex slurp`
Packit	be8974	`module unless it also gained signifigantly in speed. So I created a`
Packit	be8974	`benchmark script which compares various slurp methods with differing`
Packit	be8974	`file sizes and calling contexts. This script can be run from the main`
Packit	be8974	`directory of the tarball like this:`
Packit	be8974
Packit	be8974	`perl -Ilib extras/slurp_bench.pl`
Packit	be8974
Packit	be8974	`If you pass in an argument on the command line, it will be passed to`
Packit	be8974	`timethese() and it will control the duration. It defaults to -2 which`
Packit	be8974	`makes each benchmark run to at least 2 seconds of cpu time.`
Packit	be8974
Packit	be8974	`The following numbers are from a run I did on my 300Mhz sparc. You will`
Packit	be8974	`most likely get much faster counts on your boxes but the relative speeds`
Packit	be8974	`shouldn't change by much. If you see major differences on your`
Packit	be8974	`benchmarks, please send me the results and your Perl and OS`
Packit	be8974	`versions. Also you can play with the benchmark script and add more slurp`
Packit	be8974	`variations or data files.`
Packit	be8974
Packit	be8974	`The rest of this section will be discussing the results of the`
Packit	be8974	`benchmarks. You can refer to extras/slurp_bench.pl to see the code for`
Packit	be8974	`the individual benchmarks. If the benchmark name starts with cpan_, it`
Packit	be8974	`is either from Slurp.pm or File::Slurp.pm. Those starting with new_ are`
Packit	be8974	`from the new File::Slurp.pm. Those that start with file_contents_ are`
Packit	be8974	`from a client's code base. The rest are variations I created to`
Packit	be8974	`highlight certain aspects of the benchmarks.`
Packit	be8974
Packit	be8974	`The short and long file data is made like this:`
Packit	be8974
Packit	be8974	`my @lines = ( 'abc' x 30 . "\n") x 100 ;`
Packit	be8974	`my $text = join( '', @lines ) ;`
Packit	be8974
Packit	be8974	`@lines = ( 'abc' x 40 . "\n") x 1000 ;`
Packit	be8974	`$text = join( '', @lines ) ;`
Packit	be8974
Packit	be8974	`So the short file is 9,100 bytes and the long file is 121,000`
Packit	be8974	`bytes.`
Packit	be8974
Packit	be8974	`=head3 Scalar Slurp of Short File`
Packit	be8974
Packit	be8974	`file_contents 651/s`
Packit	be8974	`file_contents_no_OO 828/s`
Packit	be8974	`cpan_read_file 1866/s`
Packit	be8974	`cpan_slurp 1934/s`
Packit	be8974	`read_file 2079/s`
Packit	be8974	`new 2270/s`
Packit	be8974	`new_buf_ref 2403/s`
Packit	be8974	`new_scalar_ref 2415/s`
Packit	be8974	`sysread_file 2572/s`
Packit	be8974
Packit	be8974	`=head3 Scalar Slurp of Long File`
Packit	be8974
Packit	be8974	`file_contents_no_OO 82.9/s`
Packit	be8974	`file_contents 85.4/s`
Packit	be8974	`cpan_read_file 250/s`
Packit	be8974	`cpan_slurp 257/s`
Packit	be8974	`read_file 323/s`
Packit	be8974	`new 468/s`
Packit	be8974	`sysread_file 489/s`
Packit	be8974	`new_scalar_ref 766/s`
Packit	be8974	`new_buf_ref 767/s`
Packit	be8974
Packit	be8974	`The primary inference you get from looking at the mumbers above is that`
Packit	be8974	`when slurping a file into a scalar, the longer the file, the more time`
Packit	be8974	`you save by returning the result via a scalar reference. The time for`
Packit	be8974	`the extra buffer copy can add up. The new module came out on top overall`
Packit	be8974	`except for the very simple sysread_file entry which was added to`
Packit	be8974	`highlight the overhead of the more flexible new module which isn't that`
Packit	be8974	`much. The file_contents entries are always the worst since they do a`
Packit	be8974	`list slurp and then a join, which is a classic newbie and cargo culted`
Packit	be8974	`style which is extremely slow. Also the OO code in file_contents slows`
Packit	be8974	`it down even more (I added the file_contents_no_OO entry to show this).`
Packit	be8974	`The two CPAN modules are decent with small files but they are laggards`
Packit	be8974	`compared to the new module when the file gets much larger.`
Packit	be8974
Packit	be8974	`=head3 List Slurp of Short File`
Packit	be8974
Packit	be8974	`cpan_read_file 589/s`
Packit	be8974	`cpan_slurp_to_array 620/s`
Packit	be8974	`read_file 824/s`
Packit	be8974	`new_array_ref 824/s`
Packit	be8974	`sysread_file 828/s`
Packit	be8974	`new 829/s`
Packit	be8974	`new_in_anon_array 833/s`
Packit	be8974	`cpan_slurp_to_array_ref 836/s`
Packit	be8974
Packit	be8974	`=head3 List Slurp of Long File`
Packit	be8974
Packit	be8974	`cpan_read_file 62.4/s`
Packit	be8974	`cpan_slurp_to_array 62.7/s`
Packit	be8974	`read_file 92.9/s`
Packit	be8974	`sysread_file 94.8/s`
Packit	be8974	`new_array_ref 95.5/s`
Packit	be8974	`new 96.2/s`
Packit	be8974	`cpan_slurp_to_array_ref 96.3/s`
Packit	be8974	`new_in_anon_array 97.2/s`
Packit	be8974
Packit	be8974	`This is perhaps the most interesting result of this benchmark. Five`
Packit	be8974	`different entries have effectively tied for the lead. The logical`
Packit	be8974	`conclusion is that splitting the input into lines is the bounding`
Packit	be8974	`operation, no matter how the file gets slurped. This is the only`
Packit	be8974	`benchmark where the new module isn't the clear winner (in the long file`
Packit	be8974	`entries - it is no worse than a close second in the short file`
Packit	be8974	`entries).`
Packit	be8974
Packit	be8974
Packit	be8974	`Note: In the benchmark information for all the spew entries, the extra`
Packit	be8974	`number at the end of each line is how many wallclock seconds the whole`
Packit	be8974	`entry took. The benchmarks were run for at least 2 CPU seconds per`
Packit	be8974	`entry. The unusually large wallclock times will be discussed below.`
Packit	be8974
Packit	be8974	`=head3 Scalar Spew of Short File`
Packit	be8974
Packit	be8974	`cpan_write_file 1035/s 38`
Packit	be8974	`print_file 1055/s 41`
Packit	be8974	`syswrite_file 1135/s 44`
Packit	be8974	`new 1519/s 2`
Packit	be8974	`print_join_file 1766/s 2`
Packit	be8974	`new_ref 1900/s 2`
Packit	be8974	`syswrite_file2 2138/s 2`
Packit	be8974
Packit	be8974	`=head3 Scalar Spew of Long File`
Packit	be8974
Packit	be8974	`cpan_write_file 164/s 20`
Packit	be8974	`print_file 211/s 26`
Packit	be8974	`syswrite_file 236/s 25`
Packit	be8974	`print_join_file 277/s 2`
Packit	be8974	`new 295/s 2`
Packit	be8974	`syswrite_file2 428/s 2`
Packit	be8974	`new_ref 608/s 2`
Packit	be8974
Packit	be8974	`In the scalar spew entries, the new module API wins when it is passed a`
Packit	be8974	`reference to the scalar buffer. The C<syswrite_file2> entry beats it`
Packit	be8974	`with the shorter file due to its simpler code. The old CPAN module is`
Packit	be8974	`the slowest due to its extra copying of the data and its use of print.`
Packit	be8974
Packit	be8974	`=head3 List Spew of Short File`
Packit	be8974
Packit	be8974	`cpan_write_file 794/s 29`
Packit	be8974	`syswrite_file 1000/s 38`
Packit	be8974	`print_file 1013/s 42`
Packit	be8974	`new 1399/s 2`
Packit	be8974	`print_join_file 1557/s 2`
Packit	be8974
Packit	be8974	`=head3 List Spew of Long File`
Packit	be8974
Packit	be8974	`cpan_write_file 112/s 12`
Packit	be8974	`print_file 179/s 21`
Packit	be8974	`syswrite_file 181/s 19`
Packit	be8974	`print_join_file 205/s 2`
Packit	be8974	`new 228/s 2`
Packit	be8974
Packit	be8974	`Again, the simple C<print_join_file> entry beats the new module when`
Packit	be8974	`spewing a short list of lines to a file. But is loses to the new module`
Packit	be8974	`when the file size gets longer. The old CPAN module lags behind the`
Packit	be8974	`others since it first makes an extra copy of the lines and then it calls`
Packit	be8974	`C<print> on the output list and that is much slower than passing to`
Packit	be8974	`C<print> a single scalar generated by join. The C<print_file> entry`
Packit	be8974	`shows the advantage of directly printing C<@_> and the`
Packit	be8974	`C<print_join_file> adds the join optimization.`
Packit	be8974
Packit	be8974	`Now about those long wallclock times. If you look carefully at the`
Packit	be8974	`benchmark code of all the spew entries, you will find that some always`
Packit	be8974	`write to new files and some overwrite existing files. When I asked David`
Packit	be8974	`Muir why the old File::Slurp module had an C<overwrite> subroutine, he`
Packit	be8974	`answered that by overwriting a file, you always guarantee something`
Packit	be8974	`readable is in the file. If you create a new file, there is a moment`
Packit	be8974	`when the new file is created but has no data in it. I feel this is not a`
Packit	be8974	`good enough answer. Even when overwriting, you can write a shorter file`
Packit	be8974	`than the existing file and then you have to truncate the file to the new`
Packit	be8974	`size. There is a small race window there where another process can slurp`
Packit	be8974	`in the file with the new data followed by leftover junk from the`
Packit	be8974	`previous version of the file. This reinforces the point that the only`
Packit	be8974	`way to ensure consistant file data is the proper use of file locks.`
Packit	be8974
Packit	be8974	`But what about those long times? Well it is all about the difference`
Packit	be8974	`between creating files and overwriting existing ones. The former have to`
Packit	be8974	`allocate new inodes (or the equivilent on other file systems) and the`
Packit	be8974	`latter can reuse the exising inode. This mean the overwrite will save on`
Packit	be8974	`disk seeks as well as on cpu time. In fact when running this benchmark,`
Packit	be8974	`I could hear my disk going crazy allocating inodes during the spew`
Packit	be8974	`operations. This speedup in both cpu and wallclock is why the new module`
Packit	be8974	`always does overwriting when spewing files. It also does the proper`
Packit	be8974	`truncate (and this is checked in the tests by spewing shorter files`
Packit	be8974	`after longer ones had previously been written). The C<overwrite>`
Packit	be8974	`subroutine is just an typeglob alias to C<write_file> and is there for`
Packit	be8974	`backwards compatibilty with the old File::Slurp module.`
Packit	be8974
Packit	be8974	`=head3 Benchmark Conclusion`
Packit	be8974
Packit	be8974	`Other than a few cases where a simpler entry beat it out, the new`
Packit	be8974	`File::Slurp module is either the speed leader or among the leaders. Its`
Packit	be8974	`special APIs for passing buffers by reference prove to be very useful`
Packit	be8974	`speedups. Also it uses all the other optimizations including using`
Packit	be8974	`C<sysread/syswrite> and joining output lines. I expect many projects`
Packit	be8974	`that extensively use slurping will notice the speed improvements,`
Packit	be8974	`especially if they rewrite their code to take advantage of the new API`
Packit	be8974	`features. Even if they don't touch their code and use the simple API`
Packit	be8974	`they will get a significant speedup.`
Packit	be8974
Packit	be8974	`=head2 Error Handling`
Packit	be8974
Packit	be8974	`Slurp subroutines are subject to conditions such as not being able to`
Packit	be8974	`open the file, or I/O errors. How these errors are handled, and what the`
Packit	be8974	`caller will see, are important aspects of the design of an API. The`
Packit	be8974	`classic error handling for slurping has been to call C<die()> or even`
Packit	be8974	`better, C<croak()>. But sometimes you want the slurp to either`
Packit	be8974	`C<warn()>/C<carp()> or allow your code to handle the error. Sure, this`
Packit	be8974	`can be done by wrapping the slurp in a C<eval> block to catch a fatal`
Packit	be8974	`error, but not everyone wants all that extra code. So I have added`
Packit	be8974	`another option to all the subroutines which selects the error`
Packit	be8974	`handling. If the 'err_mode' option is 'croak' (which is also the`
Packit	be8974	`default), the called subroutine will croak. If set to 'carp' then carp`
Packit	be8974	`will be called. Set to any other string (use 'quiet' when you want to`
Packit	be8974	`be explicit) and no error handler is called. Then the caller can use the`
Packit	be8974	`error status from the call.`
Packit	be8974
Packit	be8974	`C<write_file()> doesn't use the return value for data so it can return a`
Packit	be8974	`false status value in-band to mark an error. C<read_file()> does use its`
Packit	be8974	`return value for data, but we can still make it pass back the error`
Packit	be8974	`status. A successful read in any scalar mode will return either a`
Packit	be8974	`defined data string or a reference to a scalar or array. So a bare`
Packit	be8974	`return would work here. But if you slurp in lines by calling it in a`
Packit	be8974	`list context, a bare C<return> will return an empty list, which is the`
Packit	be8974	`same value it would get from an existing but empty file. So now,`
Packit	be8974	`C<read_file()> will do something I normally strongly advocate against,`
Packit	be8974	`i.e., returning an explicit C<undef> value. In the scalar context this`
Packit	be8974	`still returns a error, and in list context, the returned first value`
Packit	be8974	`will be C<undef>, and that is not legal data for the first element. So`
Packit	be8974	`the list context also gets a error status it can detect:`
Packit	be8974
Packit	be8974	`my @lines = read_file( $file_name, err_mode => 'quiet' ) ;`
Packit	be8974	`your_handle_error( "$file_name can't be read\n" ) unless`
Packit	be8974	`@lines && defined $lines[0] ;`
Packit	be8974
Packit	be8974
Packit	be8974	`=head2 File::FastSlurp`
Packit	be8974
Packit	be8974	`sub read_file {`
Packit	be8974
Packit	be8974	`my( $file_name, %args ) = @_ ;`
Packit	be8974
Packit	be8974	`my $buf ;`
Packit	be8974	`my $buf_ref = $args{'buf_ref'} \|\| \$buf ;`
Packit	be8974
Packit	be8974	`my $mode = O_RDONLY ;`
Packit	be8974	`$mode \|= O_BINARY if $args{'binmode'} ;`
Packit	be8974
Packit	be8974	`local( *FH ) ;`
Packit	be8974	`sysopen( FH, $file_name, $mode ) or`
Packit	be8974	`carp "Can't open $file_name: $!" ;`
Packit	be8974
Packit	be8974	`my $size_left = -s FH ;`
Packit	be8974
Packit	be8974	`while( $size_left > 0 ) {`
Packit	be8974
Packit	be8974	`my $read_cnt = sysread( FH, ${$buf_ref},`
Packit	be8974	`$size_left, length ${$buf_ref} ) ;`
Packit	be8974
Packit	be8974	`unless( $read_cnt ) {`
Packit	be8974
Packit	be8974	`carp "read error in file $file_name: $!" ;`
Packit	be8974	`last ;`
Packit	be8974	`}`
Packit	be8974
Packit	be8974	`$size_left -= $read_cnt ;`
Packit	be8974	`}`
Packit	be8974
Packit	be8974	`# handle void context (return scalar by buffer reference)`
Packit	be8974
Packit	be8974	`return unless defined wantarray ;`
Packit	be8974
Packit	be8974	`# handle list context`
Packit	be8974
Packit	be8974	`return split m\|?<$/\|g, ${$buf_ref} if wantarray ;`
Packit	be8974
Packit	be8974	`# handle scalar context`
Packit	be8974
Packit	be8974	`return ${$buf_ref} ;`
Packit	be8974	`}`
Packit	be8974
Packit	be8974	`sub write_file {`
Packit	be8974
Packit	be8974	`my $file_name = shift ;`
Packit	be8974
Packit	be8974	`my $args = ( ref $_[0] eq 'HASH' ) ? shift : {} ;`
Packit	be8974	`my $buf = join '', @_ ;`
Packit	be8974
Packit	be8974
Packit	be8974	`my $mode = O_WRONLY ;`
Packit	be8974	`$mode \|= O_BINARY if $args->{'binmode'} ;`
Packit	be8974	`$mode \|= O_APPEND if $args->{'append'} ;`
Packit	be8974
Packit	be8974	`local( *FH ) ;`
Packit	be8974	`sysopen( FH, $file_name, $mode ) or`
Packit	be8974	`carp "Can't open $file_name: $!" ;`
Packit	be8974
Packit	be8974	`my $size_left = length( $buf ) ;`
Packit	be8974	`my $offset = 0 ;`
Packit	be8974
Packit	be8974	`while( $size_left > 0 ) {`
Packit	be8974
Packit	be8974	`my $write_cnt = syswrite( FH, $buf,`
Packit	be8974	`$size_left, $offset ) ;`
Packit	be8974
Packit	be8974	`unless( $write_cnt ) {`
Packit	be8974
Packit	be8974	`carp "write error in file $file_name: $!" ;`
Packit	be8974	`last ;`
Packit	be8974	`}`
Packit	be8974
Packit	be8974	`$size_left -= $write_cnt ;`
Packit	be8974	`$offset += $write_cnt ;`
Packit	be8974	`}`
Packit	be8974
Packit	be8974	`return ;`
Packit	be8974	`}`
Packit	be8974
Packit	be8974	`=head2 Slurping in Perl 6`
Packit	be8974
Packit	be8974	`As usual with Perl 6, much of the work in this article will be put to`
Packit	be8974	`pasture. Perl 6 will allow you to set a 'slurp' property on file handles`
Packit	be8974	`and when you read from such a handle, the file is slurped. List and`
Packit	be8974	`scalar context will still be supported so you can slurp into lines or a`
Packit	be8974
Packit	be8974	`optimized and bypass the stdio subsystem since it can use the slurp`
Packit	be8974	`property to trigger a call to special code. Otherwise some enterprising`
Packit	be8974	`individual will just create a File::FastSlurp module for Perl 6. The`
Packit	be8974	`code in the Perl 5 module could easily be modified to Perl 6 syntax and`
Packit	be8974	`semantics. Any volunteers?`
Packit	be8974
Packit	be8974	`=head2 In Summary`
Packit	be8974
Packit	be8974	`We have compared classic line by line processing with munging a whole`
Packit	be8974	`file in memory. Slurping files can speed up your programs and simplify`
Packit	be8974	`your code if done properly. You must still be aware to not slurp`
Packit	be8974	`humongous files (logs, DNA sequences, etc.) or STDIN where you don't`
Packit	be8974	`know how much data you will read in. But slurping megabyte sized files`
Packit	be8974	`is not an major issue on today's systems with the typical amount of RAM`
Packit	be8974	`installed. When Perl was first being used in depth (Perl 4), slurping`
Packit	be8974	`was limited by the smaller RAM size of 10 years ago. Now, you should be`
Packit	be8974	`able to slurp almost any reasonably sized file, whether it contains`
Packit	be8974	`configuration, source code, data, etc.`
Packit	be8974
Packit	be8974	`=head2 Acknowledgements`
Packit	be8974
Packit	be8974
Packit	be8974
Packit	be8974
Packit	be8974

source-git / perl-File-Slurp

Source Code

Blame extras/slurp_article.pod