Blame tune/README

Packit 5c3484
Copyright 2000-2002, 2004 Free Software Foundation, Inc.
Packit 5c3484
Packit 5c3484
This file is part of the GNU MP Library.
Packit 5c3484
Packit 5c3484
The GNU MP Library is free software; you can redistribute it and/or modify
Packit 5c3484
it under the terms of either:
Packit 5c3484
Packit 5c3484
  * the GNU Lesser General Public License as published by the Free
Packit 5c3484
    Software Foundation; either version 3 of the License, or (at your
Packit 5c3484
    option) any later version.
Packit 5c3484
Packit 5c3484
or
Packit 5c3484
Packit 5c3484
  * the GNU General Public License as published by the Free Software
Packit 5c3484
    Foundation; either version 2 of the License, or (at your option) any
Packit 5c3484
    later version.
Packit 5c3484
Packit 5c3484
or both in parallel, as here.
Packit 5c3484
Packit 5c3484
The GNU MP Library is distributed in the hope that it will be useful, but
Packit 5c3484
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
Packit 5c3484
or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
Packit 5c3484
for more details.
Packit 5c3484
Packit 5c3484
You should have received copies of the GNU General Public License and the
Packit 5c3484
GNU Lesser General Public License along with the GNU MP Library.  If not,
Packit 5c3484
see https://www.gnu.org/licenses/.
Packit 5c3484
Packit 5c3484
Packit 5c3484
Packit 5c3484
Packit 5c3484
Packit 5c3484
               GMP SPEED MEASURING AND PARAMETER TUNING
Packit 5c3484
Packit 5c3484
Packit 5c3484
The programs in this directory are for knowledgeable users who want to
Packit 5c3484
measure GMP routines on their machine, and perhaps tweak some settings or
Packit 5c3484
identify things that can be improved.
Packit 5c3484
Packit 5c3484
The programs here are tools, not ready to run solutions.  Nothing is built
Packit 5c3484
in a normal "make all", but various Makefile targets described below exist.
Packit 5c3484
Packit 5c3484
Relatively few systems and CPUs have been tested, so be sure to verify that
Packit 5c3484
results are sensible before relying on them.
Packit 5c3484
Packit 5c3484
Packit 5c3484
Packit 5c3484
Packit 5c3484
MISCELLANEOUS NOTES
Packit 5c3484
Packit 5c3484
--enable-assert
Packit 5c3484
Packit 5c3484
    Don't configure with --enable-assert, since the extra code added by
Packit 5c3484
    assertion checking may influence measurements.
Packit 5c3484
Packit 5c3484
Direct mapped caches
Packit 5c3484
Packit 5c3484
    Some effort has been made to accommodate CPUs with direct mapped caches,
Packit 5c3484
    by putting data blocks more or less contiguously on the stack.  But this
Packit 5c3484
    will depend on TMP_ALLOC using alloca, and even then it may or may not
Packit 5c3484
    be enough.
Packit 5c3484
Packit 5c3484
FreeBSD 4.2 i486 getrusage
Packit 5c3484
Packit 5c3484
    This getrusage seems to be a bit doubtful, it looks like it's
Packit 5c3484
    microsecond accurate, but sometimes ru_utime remains unchanged after a
Packit 5c3484
    time of many microseconds has elapsed.  It'd be good to detect this in
Packit 5c3484
    the time.c initializations, but for now the suggestion is to pretend it
Packit 5c3484
    doesn't exist.
Packit 5c3484
Packit 5c3484
        ./configure ac_cv_func_getrusage=no
Packit 5c3484
Packit 5c3484
NetBSD 1.4.1 m68k macintosh time base
Packit 5c3484
Packit 5c3484
    On this system it's been found getrusage often goes backwards, making it
Packit 5c3484
    unusable (time.c getrusage_backwards_p detects this).  gettimeofday
Packit 5c3484
    sometimes doesn't update atomically when it crosses a 1 second boundary.
Packit 5c3484
    Not sure what to do about this.  Expect possible intermittent failures.
Packit 5c3484
Packit 5c3484
SCO OpenUNIX 8 /etc/hw
Packit 5c3484
Packit 5c3484
    /etc/hw takes about a second to return the cpu frequency, which suggests
Packit 5c3484
    perhaps it's measuring each time it runs.  If this is annoying when
Packit 5c3484
    running the speed program repeatedly then set a GMP_CPU_FREQUENCY
Packit 5c3484
    environment variable (see TIME BASE section below).
Packit 5c3484
Packit 5c3484
Timing on GNU/Linux
Packit 5c3484
Packit 5c3484
    On Linux, timing currently uses the cycle counter. This is unreliable,
Packit 5c3484
    since the counter is not saved and restored at context switches (unlike
Packit 5c3484
    FreeBSD and Solaris where the cycle counter is "virtualized").
Packit 5c3484
Packit 5c3484
    Using the clock_gettime method with CLOCK_PROCESS_CPUTIME_ID (posix) or
Packit 5c3484
    CLOCK_VIRTUAL (BSD) should be more reliable. To get clock_gettime
Packit 5c3484
    with glibc, one has to link with -lrt (which also drags in the pthreads
Packit 5c3484
    threading library). configure.in must be hacked to detect this and
Packit 5c3484
    arrange proper linking. Something like
Packit 5c3484
Packit 5c3484
      old_LIBS="$LIBS"
Packit 5c3484
      AC_SEARCH_LIBS(clock_gettime, rt, [AC_DEFINE(HAVE_CLOCK_GETTIME)])
Packit 5c3484
      TUNE_LIBS="$LIBS"
Packit 5c3484
      LIBS="$old_LIBS"
Packit 5c3484
Packit 5c3484
      AC_SUBST(TUNE_LIBS)
Packit 5c3484
Packit 5c3484
    might work.
Packit 5c3484
Packit 5c3484
Low resolution timebase
Packit 5c3484
Packit 5c3484
    Parameter tuning can be very time consuming if the only timebase
Packit 5c3484
    available is a 10 millisecond clock tick, to the point of being
Packit 5c3484
    unusable.  This is currently the case on VAX and ARM systems.
Packit 5c3484
Packit 5c3484
Packit 5c3484
Packit 5c3484
Packit 5c3484
PARAMETER TUNING
Packit 5c3484
Packit 5c3484
The "tuneup" program runs some tests designed to find the best settings for
Packit 5c3484
various thresholds, like MUL_TOOM22_THRESHOLD.  Its output can be put
Packit 5c3484
into gmp-mparam.h.  The program is built and run with
Packit 5c3484
Packit 5c3484
        make tune
Packit 5c3484
Packit 5c3484
If the thresholds indicated are grossly different from the values in the
Packit 5c3484
selected gmp-mparam.h then there may be a performance boost in applicable
Packit 5c3484
size ranges by changing gmp-mparam.h accordingly.
Packit 5c3484
Packit 5c3484
Be sure to do a full reconfigure and rebuild to get any newly set thresholds
Packit 5c3484
to take effect.  A partial rebuild is enough sometimes, but a fresh
Packit 5c3484
configure and make is certain to be correct.
Packit 5c3484
Packit 5c3484
If a CPU has specific tuned parameters coming from a gmp-mparam.h in one of
Packit 5c3484
the mpn subdirectories then the values from "make tune" should be similar.
Packit 5c3484
But check that the configured CPU is right and there are no machine specific
Packit 5c3484
effects causing a difference.
Packit 5c3484
Packit 5c3484
It's hoped the compiler and options used won't have too much effect on
Packit 5c3484
thresholds, since for most CPUs they ultimately come down to comparisons
Packit 5c3484
between assembler subroutines.  Missing out on the longlong.h macros by not
Packit 5c3484
using gcc will probably have an effect.
Packit 5c3484
Packit 5c3484
Some thresholds produced by the tune program are merely single values chosen
Packit 5c3484
from what's a range of sizes where two algorithms are pretty much the same
Packit 5c3484
speed.  When this happens the program is likely to give somewhat different
Packit 5c3484
values on successive runs.  This is noticeable on the toom3 thresholds for
Packit 5c3484
instance.
Packit 5c3484
Packit 5c3484
Packit 5c3484
Packit 5c3484
Packit 5c3484
SPEED PROGRAM
Packit 5c3484
Packit 5c3484
The "speed" program can be used for measuring and comparing various
Packit 5c3484
routines, and producing tables of data or gnuplot graphs.  Compile it with
Packit 5c3484
Packit 5c3484
	make speed
Packit 5c3484
Packit 5c3484
(Or on DOS systems "make speed.exe".)
Packit 5c3484
Packit 5c3484
Here are some examples of how to use it.  Check the code for all the
Packit 5c3484
options.
Packit 5c3484
Packit 5c3484
Draw a graph of mpn_mul_n, stepping through sizes by 10 or a factor of 1.05
Packit 5c3484
(whichever is greater).
Packit 5c3484
Packit 5c3484
        ./speed -s 10-5000 -t 10 -f 1.05 -P foo mpn_mul_n
Packit 5c3484
	gnuplot foo.gnuplot
Packit 5c3484
Packit 5c3484
Compare mpn_add_n and an mpn_lshift by 1, showing times in cycles and
Packit 5c3484
showing under mpn_lshift the difference between it and mpn_add_n.
Packit 5c3484
Packit 5c3484
	./speed -s 1-40 -c -d mpn_add_n mpn_lshift.1
Packit 5c3484
Packit 5c3484
Using option -c for times in cycles is interesting but normally only
Packit 5c3484
necessary when looking carefully at assembler subroutines.  You might think
Packit 5c3484
it would always give an integer value, but this doesn't happen in practice,
Packit 5c3484
probably due to overheads in the time measurements.
Packit 5c3484
Packit 5c3484
In the free-form output the "#" symbol against a measurement means the
Packit 5c3484
corresponding routine is fastest at that size.  This is a convenient visual
Packit 5c3484
cue when comparing different routines.  The graph data files <name>.data
Packit 5c3484
don't get this since it would upset gnuplot or other data viewers.
Packit 5c3484
Packit 5c3484
Packit 5c3484
Packit 5c3484
Packit 5c3484
TIME BASE
Packit 5c3484
Packit 5c3484
The time measuring method is determined in time.c, based on what the
Packit 5c3484
configured host has available.  A cycle counter is preferred, possibly
Packit 5c3484
supplemented by another method if the counter has a limited range.  A
Packit 5c3484
microsecond accurate getrusage() or gettimeofday() will work quite well too.
Packit 5c3484
Packit 5c3484
The cycle counters (except possibly on alpha) and gettimeofday() will depend
Packit 5c3484
on the machine being otherwise idle, or rather on other jobs not stealing
Packit 5c3484
CPU time from the measuring program.  Short routines (those that complete
Packit 5c3484
within a timeslice) should work even on a busy machine.
Packit 5c3484
Packit 5c3484
Some trouble is taken by speed_measure() in common.c to avoid ill effects
Packit 5c3484
from sporadic interrupts, or other intermittent things (like cron waking up
Packit 5c3484
every minute).  But generally an idle machine will be necessary to be
Packit 5c3484
certain of consistent results.
Packit 5c3484
Packit 5c3484
The CPU frequency is needed to convert between cycles and seconds, or for
Packit 5c3484
when a cycle counter is supplemented by getrusage() etc.  The speed program
Packit 5c3484
will convert as necessary according to the output format requested.  The
Packit 5c3484
tune program will work with either cycles or seconds.
Packit 5c3484
Packit 5c3484
freq.c knows how to get the frequency on some systems, or can measure a
Packit 5c3484
cycle counter against gettimeofday() or getrusage(), but when that fails, or
Packit 5c3484
needs to be overridden, an environment variable GMP_CPU_FREQUENCY can be
Packit 5c3484
used (in Hertz).  For example in "bash" on a 650 MHz machine,
Packit 5c3484
Packit 5c3484
	export GMP_CPU_FREQUENCY=650e6
Packit 5c3484
Packit 5c3484
A high precision time base makes it possible to get accurate measurements in
Packit 5c3484
a shorter time.
Packit 5c3484
Packit 5c3484
Packit 5c3484
Packit 5c3484
Packit 5c3484
EXAMPLE COMPARISONS - VARIOUS
Packit 5c3484
Packit 5c3484
Here are some ideas for things that can be done with the speed program.
Packit 5c3484
Packit 5c3484
There's always going to be a certain amount of overhead in the time
Packit 5c3484
measurements, due to reading the time base, and in the loop that runs a
Packit 5c3484
routine enough times to get a reading of the desired precision.  Noop
Packit 5c3484
functions taking various arguments are available to measure this.  The
Packit 5c3484
"overhead" printed by the speed program each time in its intro is the "noop"
Packit 5c3484
routine, but note that this is just for information, it isn't deducted from
Packit 5c3484
the times printed or anything.
Packit 5c3484
Packit 5c3484
	./speed -s 1 noop noop_wxs noop_wxys
Packit 5c3484
Packit 5c3484
To see how many cycles per limb a routine is taking, look at the time
Packit 5c3484
increase when the size increments, using option -D.  This avoids fixed
Packit 5c3484
overheads in the measuring.  Also, remember many of the assembler routines
Packit 5c3484
have unrolled loops, so it might be necessary to compare times at, say, 16,
Packit 5c3484
32, 48, 64 etc to see what the unrolled part is taking, as opposed to any
Packit 5c3484
finishing off.
Packit 5c3484
Packit 5c3484
        ./speed -s 16-64 -t 16 -C -D mpn_add_n
Packit 5c3484
Packit 5c3484
The -C option on its own gives cycles per limb, but is really only useful at
Packit 5c3484
big sizes where fixed overheads are small compared to the code doing the
Packit 5c3484
real work.  Remember of course memory caching and/or page swapping will
Packit 5c3484
affect results at large sizes.
Packit 5c3484
Packit 5c3484
        ./speed -s 500000 -C mpn_add_n
Packit 5c3484
Packit 5c3484
Once a calculation stops fitting in the CPU data cache, it's going to start
Packit 5c3484
taking longer.  Exactly where this happens depends on the cache priming in
Packit 5c3484
the measuring routines, and on what sort of "least recently used" the
Packit 5c3484
hardware does.  Here's an example for a CPU with a 16kbyte L1 data cache and
Packit 5c3484
32-bit limb, showing a suddenly steeper curve for mpn_add_n at about 2000
Packit 5c3484
limbs.
Packit 5c3484
Packit 5c3484
        ./speed -s 1-4000 -t 5 -f 1.02 -P foo mpn_add_n
Packit 5c3484
	gnuplot foo.gnuplot
Packit 5c3484
Packit 5c3484
When a routine has an unrolled loop for, say, multiples of 8 limbs and then
Packit 5c3484
an ordinary loop for the remainder, it can happen that it's actually faster
Packit 5c3484
to do an operation on, say, 8 limbs than it is on 7 limbs.  The following
Packit 5c3484
draws a graph of mpn_sub_n, to see whether times smoothly increase with
Packit 5c3484
size.
Packit 5c3484
Packit 5c3484
        ./speed -s 1-100 -c -P foo mpn_sub_n
Packit 5c3484
	gnuplot foo.gnuplot
Packit 5c3484
Packit 5c3484
If mpn_lshift and mpn_rshift have special case code for shifts by 1, it
Packit 5c3484
ought to be faster (or at least not slower) than shifting by, say, 2 bits.
Packit 5c3484
Packit 5c3484
        ./speed -s 1-200 -c mpn_rshift.1 mpn_rshift.2
Packit 5c3484
Packit 5c3484
An mpn_lshift by 1 can be done by mpn_add_n adding a number to itself, and
Packit 5c3484
if the lshift isn't faster there's an obvious improvement that's possible.
Packit 5c3484
Packit 5c3484
        ./speed -s 1-200 -c mpn_lshift.1 mpn_add_n_self
Packit 5c3484
Packit 5c3484
On some CPUs (AMD K6 for example) an "in-place" mpn_add_n where the
Packit 5c3484
destination is one of the sources is faster than a separate destination.
Packit 5c3484
Here's an example to see this.  ".1" selects dst==src1 for mpn_add_n (and
Packit 5c3484
mpn_sub_n), for other values see speed.h SPEED_ROUTINE_MPN_BINARY_N_CALL.
Packit 5c3484
Packit 5c3484
        ./speed -s 1-200 -c mpn_add_n mpn_add_n.1
Packit 5c3484
Packit 5c3484
The gmp manual points out that divisions by powers of two should be done
Packit 5c3484
using a right shift because it'll be significantly faster than an actual
Packit 5c3484
division.  The following shows by what factor mpn_rshift is faster than
Packit 5c3484
mpn_divrem_1, using division by 32 as an example.
Packit 5c3484
Packit 5c3484
        ./speed -s 10-20 -r mpn_rshift.5 mpn_divrem_1.32
Packit 5c3484
Packit 5c3484
Packit 5c3484
Packit 5c3484
Packit 5c3484
EXAMPLE COMPARISONS - MULTIPLICATION
Packit 5c3484
Packit 5c3484
mul_basecase takes a ".<r>" parameter. If positive, it gives the second
Packit 5c3484
(smaller) operand size.  For example to show speeds for 3x3 up to 20x3 in
Packit 5c3484
cycles,
Packit 5c3484
Packit 5c3484
        ./speed -s 3-20 -c mpn_mul_basecase.3
Packit 5c3484
Packit 5c3484
A negative ".<-r>" parameter fixes the size of the product to the absolute
Packit 5c3484
value r.  For example to show speeds for 10x10 up to 19x1 in cycles,
Packit 5c3484
Packit 5c3484
        ./speed -s 10-19 -c mpn_mul_basecase.-20
Packit 5c3484
Packit 5c3484
mul_basecase with no parameter does an NxN multiply, so for example to show
Packit 5c3484
speeds in cycles for 1x1, 2x2, 3x3, etc, up to 20x20, in cycles,
Packit 5c3484
Packit 5c3484
        ./speed -s 1-20 -c mpn_mul_basecase
Packit 5c3484
Packit 5c3484
sqr_basecase is implemented by a "triangular" method on most CPUs, making it
Packit 5c3484
up to twice as fast as mul_basecase.  In practice loop overheads and the
Packit 5c3484
products on the diagonal mean it falls short of this.  Here's an example
Packit 5c3484
running the two and showing by what factor an NxN mul_basecase is slower
Packit 5c3484
than an NxN sqr_basecase.  (Some versions of sqr_basecase only allow sizes
Packit 5c3484
below SQR_TOOM2_THRESHOLD, so if it crashes at that point don't worry.)
Packit 5c3484
Packit 5c3484
        ./speed -s 1-20 -r mpn_sqr_basecase mpn_mul_basecase
Packit 5c3484
Packit 5c3484
The technique described above with -CD for showing the time difference in
Packit 5c3484
cycles per limb between two size operations can be done on an NxN
Packit 5c3484
mul_basecase using -E to change the basis for the size increment to N*N.
Packit 5c3484
For instance a 20x20 operation is taken to be doing 400 limbs, and a 16x16
Packit 5c3484
doing 256 limbs.  The following therefore shows the per crossproduct speed
Packit 5c3484
of mul_basecase and sqr_basecase at around 20x20 limbs.
Packit 5c3484
Packit 5c3484
        ./speed -s 16-20 -t 4 -CDE mpn_mul_basecase mpn_sqr_basecase
Packit 5c3484
Packit 5c3484
Of course sqr_basecase isn't really doing NxN crossproducts, but it can be
Packit 5c3484
interesting to compare it to mul_basecase as if it was.  For sqr_basecase
Packit 5c3484
the -F option can be used to base the deltas on N*(N+1)/2 operations, which
Packit 5c3484
is the triangular products sqr_basecase does.  For example,
Packit 5c3484
Packit 5c3484
        ./speed -s 16-20 -t 4 -CDF mpn_sqr_basecase
Packit 5c3484
Packit 5c3484
Both -E and -F are preliminary and might change.  A consistent approach to
Packit 5c3484
using them when claiming certain per crossproduct or per triangularproduct
Packit 5c3484
speeds hasn't really been established, but the increment between speeds in
Packit 5c3484
the range karatsuba will call seems sensible, that being k to k/2.  For
Packit 5c3484
instance, if the karatsuba threshold was 20 for the multiply and 30 for the
Packit 5c3484
square,
Packit 5c3484
Packit 5c3484
        ./speed -s 10-20 -t 10 -CDE mpn_mul_basecase
Packit 5c3484
        ./speed -s 15-30 -t 15 -CDF mpn_sqr_basecase
Packit 5c3484
Packit 5c3484
Packit 5c3484
Packit 5c3484
EXAMPLE COMPARISONS - MALLOC
Packit 5c3484
Packit 5c3484
The gmp manual recommends application programs avoid excessive initializing
Packit 5c3484
and clearing of mpz_t variables (and mpq_t and mpf_t too).  Every new
Packit 5c3484
variable will at a minimum go through an init, a realloc for its first
Packit 5c3484
store, and finally a clear.  Quite how long that takes depends on the C
Packit 5c3484
library.  The following compares an mpz_init/realloc/clear to a 10 limb
Packit 5c3484
mpz_add.  Don't be surprised if the mallocing is quite slow.
Packit 5c3484
Packit 5c3484
        ./speed -s 10 -c mpz_init_realloc_clear mpz_add
Packit 5c3484
Packit 5c3484
On some systems malloc and free are much slower when dynamic linked.  The
Packit 5c3484
speed-dynamic program can be used to see this.  For example the following
Packit 5c3484
measures malloc/free, first static then dynamic.
Packit 5c3484
Packit 5c3484
        ./speed -s 10 -c malloc_free
Packit 5c3484
        ./speed-dynamic -s 10 -c malloc_free
Packit 5c3484
Packit 5c3484
Of course a real world program has big problems if it's doing so many
Packit 5c3484
mallocs and frees that it gets slowed down by a dynamic linked malloc.
Packit 5c3484
Packit 5c3484
Packit 5c3484
Packit 5c3484
Packit 5c3484
Packit 5c3484
EXAMPLE COMPARISONS - STRING CONVERSIONS
Packit 5c3484
Packit 5c3484
mpn_get_str does a binary to string conversion.  The base is specified with
Packit 5c3484
a ".<r>" parameter, or decimal by default.  Power of 2 bases are much faster
Packit 5c3484
than general bases.  The following compares decimal and hex for instance.
Packit 5c3484
Packit 5c3484
        ./speed -s 1-20 -c mpn_get_str mpn_get_str.16
Packit 5c3484
Packit 5c3484
Smaller bases need more divisions to split a given size number, and so are
Packit 5c3484
slower.  The following compares base 3 and base 9.  On small operands 9 will
Packit 5c3484
be nearly twice as fast, though at bigger sizes this reduces since in the
Packit 5c3484
current implementation both divide repeatedly by 3^20 (or 3^40 for 64 bit
Packit 5c3484
limbs) and those divisions come to dominate.
Packit 5c3484
Packit 5c3484
        ./speed -s 1-20 -cr mpn_get_str.3 mpn_get_str.9
Packit 5c3484
Packit 5c3484
mpn_set_str does a string to binary conversion.  The base is specified with
Packit 5c3484
a ".<r>" parameter, or decimal by default.  Power of 2 bases are faster than
Packit 5c3484
general bases on large conversions.
Packit 5c3484
Packit 5c3484
	./speed -s 1-512 -f 2 -c mpn_set_str.8 mpn_set_str.10
Packit 5c3484
Packit 5c3484
mpn_set_str also has some special case code for decimal which is a bit
Packit 5c3484
faster than the general case, basically by giving the compiler a chance to
Packit 5c3484
optimize some multiplications by 10.
Packit 5c3484
Packit 5c3484
	./speed -s 20-40 -c mpn_set_str.9 mpn_set_str.10 mpn_set_str.11
Packit 5c3484
Packit 5c3484
Packit 5c3484
Packit 5c3484
Packit 5c3484
EXAMPLE COMPARISONS - GCDs
Packit 5c3484
Packit 5c3484
mpn_gcd_1 has a threshold for when to reduce using an initial x%y when both
Packit 5c3484
x and y are single limbs.  This isn't tuned currently, but a value can be
Packit 5c3484
established by a measurement like
Packit 5c3484
Packit 5c3484
	./speed -s 10-32 mpn_gcd_1.10
Packit 5c3484
Packit 5c3484
This runs src[0] from 10 to 32 bits, and y fixed at 10 bits.  If the div
Packit 5c3484
threshold is high, say 31 so it's effectively disabled then a 32x10 bit gcd
Packit 5c3484
is done by nibbling away at the 32-bit operands bit-by-bit.  When the
Packit 5c3484
threshold is small, say 1 bit, then an initial x%y is done to reduce it to a
Packit 5c3484
10x10 bit operation.
Packit 5c3484
Packit 5c3484
The threshold in mpn/generic/gcd_1.c or the various assembler
Packit 5c3484
implementations can be tweaked up or down until there's no more speedups on
Packit 5c3484
interesting combinations of sizes.  Note that this affects only a 1x1 limb
Packit 5c3484
operation and so isn't very important.  (An Nx1 limb operation always does
Packit 5c3484
an initial modular reduction, using mpn_mod_1 or mpn_modexact_1_odd.)
Packit 5c3484
Packit 5c3484
Packit 5c3484
Packit 5c3484
Packit 5c3484
SPEED PROGRAM EXTENSIONS
Packit 5c3484
Packit 5c3484
Potentially lots of things could be made available in the program, but it's
Packit 5c3484
been left at only the things that have actually been wanted and are likely
Packit 5c3484
to be reasonably useful in the future.
Packit 5c3484
Packit 5c3484
Extensions should be fairly easy to make though.  speed-ext.c is an example,
Packit 5c3484
in a style that should suit one-off tests, or new code fragments under
Packit 5c3484
development.
Packit 5c3484
Packit 5c3484
many.pl is a script for generating a new speed program supplemented with
Packit 5c3484
alternate versions of the standard routines.  It can be used for measuring
Packit 5c3484
experimental code, or for comparing different implementations that exist
Packit 5c3484
within a CPU family.
Packit 5c3484
Packit 5c3484
Packit 5c3484
Packit 5c3484
Packit 5c3484
THRESHOLD EXAMINING
Packit 5c3484
Packit 5c3484
The speed program can be used to examine the speeds of different algorithms
Packit 5c3484
to check the tune program has done the right thing.  For example to examine
Packit 5c3484
the karatsuba multiply threshold,
Packit 5c3484
Packit 5c3484
	./speed -s 5-40 mpn_mul_basecase mpn_kara_mul_n
Packit 5c3484
Packit 5c3484
When examining the toom3 threshold, remember it depends on the karatsuba
Packit 5c3484
threshold, so the right karatsuba threshold needs to be compiled into the
Packit 5c3484
library first.  The tune program uses specially recompiled versions of
Packit 5c3484
mpn/mul_n.c etc for this reason, but the speed program simply uses the
Packit 5c3484
normal libgmp.la.
Packit 5c3484
Packit 5c3484
Note further that the various routines may recurse into themselves on sizes
Packit 5c3484
far enough above applicable thresholds.  For example, mpn_kara_mul_n will
Packit 5c3484
recurse into itself on sizes greater than twice the compiled-in
Packit 5c3484
MUL_TOOM22_THRESHOLD.
Packit 5c3484
Packit 5c3484
When doing the above comparison between mul_basecase and kara_mul_n what's
Packit 5c3484
probably of interest is mul_basecase versus a kara_mul_n that does one level
Packit 5c3484
of Karatsuba then calls to mul_basecase, but this only happens on sizes less
Packit 5c3484
than twice the compiled MUL_TOOM22_THRESHOLD.  A larger value for that
Packit 5c3484
setting can be compiled-in to avoid the problem if necessary.  The same
Packit 5c3484
applies to toom3 and DC, though in a trickier fashion.
Packit 5c3484
Packit 5c3484
There are some upper limits on some of the thresholds, arising from arrays
Packit 5c3484
dimensioned according to a threshold (mpn_mul_n), or asm code with certain
Packit 5c3484
sized displacements (some x86 versions of sqr_basecase).  So putting huge
Packit 5c3484
values for the thresholds, even just for testing, may fail.
Packit 5c3484
Packit 5c3484
Packit 5c3484
Packit 5c3484
Packit 5c3484
FUTURE
Packit 5c3484
Packit 5c3484
Make a program to check the time base is working properly, for small and
Packit 5c3484
large measurements.  Make it able to test each available method, including
Packit 5c3484
perhaps the apparent resolution of each.
Packit 5c3484
Packit 5c3484
Make a general mechanism for specifying operand overlap, and a syntax like
Packit 5c3484
maybe "mpn_add_n.dst=src2" to select it.  Some measuring routines do this
Packit 5c3484
sort of thing with the "r" parameter currently.
Packit 5c3484
Packit 5c3484
Packit 5c3484
Packit 5c3484
----------------
Packit 5c3484
Local variables:
Packit 5c3484
mode: text
Packit 5c3484
fill-column: 76
Packit 5c3484
End: