Blame mpn/x86/pentium4/README

Packit 5c3484
Copyright 2001 Free Software Foundation, Inc.
Packit 5c3484
Packit 5c3484
This file is part of the GNU MP Library.
Packit 5c3484
Packit 5c3484
The GNU MP Library is free software; you can redistribute it and/or modify
Packit 5c3484
it under the terms of either:
Packit 5c3484
Packit 5c3484
  * the GNU Lesser General Public License as published by the Free
Packit 5c3484
    Software Foundation; either version 3 of the License, or (at your
Packit 5c3484
    option) any later version.
Packit 5c3484
Packit 5c3484
or
Packit 5c3484
Packit 5c3484
  * the GNU General Public License as published by the Free Software
Packit 5c3484
    Foundation; either version 2 of the License, or (at your option) any
Packit 5c3484
    later version.
Packit 5c3484
Packit 5c3484
or both in parallel, as here.
Packit 5c3484
Packit 5c3484
The GNU MP Library is distributed in the hope that it will be useful, but
Packit 5c3484
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
Packit 5c3484
or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
Packit 5c3484
for more details.
Packit 5c3484
Packit 5c3484
You should have received copies of the GNU General Public License and the
Packit 5c3484
GNU Lesser General Public License along with the GNU MP Library.  If not,
Packit 5c3484
see https://www.gnu.org/licenses/.
Packit 5c3484
Packit 5c3484
Packit 5c3484
Packit 5c3484
Packit 5c3484
                   INTEL PENTIUM-4 MPN SUBROUTINES
Packit 5c3484
Packit 5c3484
Packit 5c3484
This directory contains mpn functions optimized for Intel Pentium-4.
Packit 5c3484
Packit 5c3484
The mmx subdirectory has routines using MMX instructions, the sse2
Packit 5c3484
subdirectory has routines using SSE2 instructions.  All P4s have these, the
Packit 5c3484
separate directories are just so configure can omit that code if the
Packit 5c3484
assembler doesn't support it.
Packit 5c3484
Packit 5c3484
Packit 5c3484
STATUS
Packit 5c3484
Packit 5c3484
                                cycles/limb
Packit 5c3484
Packit 5c3484
	mpn_add_n/sub_n            4 normal, 6 in-place
Packit 5c3484
Packit 5c3484
	mpn_mul_1                  4 normal, 6 in-place
Packit 5c3484
	mpn_addmul_1               6
Packit 5c3484
	mpn_submul_1               7
Packit 5c3484
Packit 5c3484
	mpn_mul_basecase           6 cycles/crossproduct (approx)
Packit 5c3484
Packit 5c3484
	mpn_sqr_basecase           3.5 cycles/crossproduct (approx)
Packit 5c3484
                                   or 7.0 cycles/triangleproduct (approx)
Packit 5c3484
Packit 5c3484
	mpn_l/rshift               1.75
Packit 5c3484
Packit 5c3484
Packit 5c3484
Packit 5c3484
The shifts ought to be able to go at 1.5 c/l, but not much effort has been
Packit 5c3484
applied to them yet.
Packit 5c3484
Packit 5c3484
In-place operations, and all addmul, submul, mul_basecase and sqr_basecase
Packit 5c3484
calls, suffer from pipeline anomalies associated with write combining and
Packit 5c3484
movd reads and writes to the same or nearby locations.  The movq
Packit 5c3484
instructions do not trigger the same hardware problems.  Unfortunately,
Packit 5c3484
using movq and splitting/combining seems to require too many extra
Packit 5c3484
instructions to help.  Perhaps future chip steppings will be better.
Packit 5c3484
Packit 5c3484
Packit 5c3484
Packit 5c3484
NOTES
Packit 5c3484
Packit 5c3484
The Pentium-4 pipeline "Netburst", provides for quite a number of surprises.
Packit 5c3484
Many traditional x86 instructions run very slowly, requiring use of
Packit 5c3484
alterative instructions for acceptable performance.
Packit 5c3484
Packit 5c3484
adcl and sbbl are quite slow at 8 cycles for reg->reg.  paddq of 32-bits
Packit 5c3484
within a 64-bit mmx register seems better, though the combination
Packit 5c3484
paddq/psrlq when propagating a carry is still a 4 cycle latency.
Packit 5c3484
Packit 5c3484
incl and decl should be avoided, instead use add $1 and sub $1.  Apparently
Packit 5c3484
the carry flag is not separately renamed, so incl and decl depend on all
Packit 5c3484
previous flags-setting instructions.
Packit 5c3484
Packit 5c3484
shll and shrl have a 4 cycle latency, or 8 times the latency of the fastest
Packit 5c3484
integer instructions (addl, subl, orl, andl, and some more).  shldl and
Packit 5c3484
shrdl seem to have 13 and 15 cycles latency, respectively.  Bizarre.
Packit 5c3484
Packit 5c3484
movq mmx -> mmx does have 6 cycle latency, as noted in the documentation.
Packit 5c3484
pxor/por or similar combination at 2 cycles latency can be used instead.
Packit 5c3484
The movq however executes in the float unit, thereby saving MMX execution
Packit 5c3484
resources.  With the right juggling, data moves shouldn't be on a dependent
Packit 5c3484
chain.
Packit 5c3484
Packit 5c3484
L1 is write-through, but the write-combining sounds like it does enough to
Packit 5c3484
not require explicit destination prefetching.
Packit 5c3484
Packit 5c3484
xmm registers so far haven't found a use, but not much effort has been
Packit 5c3484
expended.  A configure test for whether the operating system knows
Packit 5c3484
fxsave/fxrestor will be needed if they're used.
Packit 5c3484
Packit 5c3484
Packit 5c3484
Packit 5c3484
REFERENCES
Packit 5c3484
Packit 5c3484
Intel Pentium-4 processor manuals,
Packit 5c3484
Packit 5c3484
	http://developer.intel.com/design/pentium4/manuals
Packit 5c3484
Packit 5c3484
"Intel Pentium 4 Processor Optimization Reference Manual", Intel, 2001,
Packit 5c3484
order number 248966.  Available on-line:
Packit 5c3484
Packit 5c3484
	http://developer.intel.com/design/pentium4/manuals/248966.htm
Packit 5c3484
Packit 5c3484
Packit 5c3484
Packit 5c3484
----------------
Packit 5c3484
Local variables:
Packit 5c3484
mode: text
Packit 5c3484
fill-column: 76
Packit 5c3484
End: