Blame mpn/alpha/README

Packit 5c3484
Copyright 1996, 1997, 1999-2005 Free Software Foundation, Inc.
Packit 5c3484
Packit 5c3484
This file is part of the GNU MP Library.
Packit 5c3484
Packit 5c3484
The GNU MP Library is free software; you can redistribute it and/or modify
Packit 5c3484
it under the terms of either:
Packit 5c3484
Packit 5c3484
  * the GNU Lesser General Public License as published by the Free
Packit 5c3484
    Software Foundation; either version 3 of the License, or (at your
Packit 5c3484
    option) any later version.
Packit 5c3484
Packit 5c3484
or
Packit 5c3484
Packit 5c3484
  * the GNU General Public License as published by the Free Software
Packit 5c3484
    Foundation; either version 2 of the License, or (at your option) any
Packit 5c3484
    later version.
Packit 5c3484
Packit 5c3484
or both in parallel, as here.
Packit 5c3484
Packit 5c3484
The GNU MP Library is distributed in the hope that it will be useful, but
Packit 5c3484
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
Packit 5c3484
or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
Packit 5c3484
for more details.
Packit 5c3484
Packit 5c3484
You should have received copies of the GNU General Public License and the
Packit 5c3484
GNU Lesser General Public License along with the GNU MP Library.  If not,
Packit 5c3484
see https://www.gnu.org/licenses/.
Packit 5c3484
Packit 5c3484
Packit 5c3484
Packit 5c3484
Packit 5c3484
Packit 5c3484
This directory contains mpn functions optimized for DEC Alpha processors.
Packit 5c3484
Packit 5c3484
ALPHA ASSEMBLY RULES AND REGULATIONS
Packit 5c3484
Packit 5c3484
The `.prologue N' pseudo op marks the end of instruction that needs special
Packit 5c3484
handling by unwinding.  It also says whether $27 is really needed for computing
Packit 5c3484
the gp.  The `.mask M' pseudo op says which registers are saved on the stack,
Packit 5c3484
and at what offset in the frame.
Packit 5c3484
Packit 5c3484
Cray T3 code is very very different...
Packit 5c3484
Packit 5c3484
"$6" / "$f6" etc is the usual syntax for registers, but on Unicos instead "r6"
Packit 5c3484
/ "f6" is required.  We use the "r6" / "f6" forms, and have m4 defines expand
Packit 5c3484
them to "$6" or "$f6" where necessary.
Packit 5c3484
Packit 5c3484
"0x" introduces a hex constant in gas and DEC as, but on Unicos "^X" is
Packit 5c3484
required.  The X() macro accommodates this difference.
Packit 5c3484
Packit 5c3484
"cvttqc" is required by DEC as, "cvttq/c" is required by Unicos, and gas will
Packit 5c3484
accept either.  We use cvttqc and have an m4 define expand to cvttq/c where
Packit 5c3484
necessary.
Packit 5c3484
Packit 5c3484
"not" as an alias for "ornot r31, ..." is available in gas and DEC as, but not
Packit 5c3484
the Unicos assembler.  The full "ornot" must be used.
Packit 5c3484
Packit 5c3484
"unop" is not available in Unicos.  We make an m4 define to the usual "ldq_u
Packit 5c3484
r31,0(r30)", and in fact use that define on all systems since it comes out the
Packit 5c3484
same.
Packit 5c3484
Packit 5c3484
"!literal!123" etc explicit relocations as per Tru64 4.0 are apparently not
Packit 5c3484
available in older alpha assemblers (including gas prior to 2.12), according to
Packit 5c3484
the GCC manual, so the assembler macro forms must be used (eg. ldgp).
Packit 5c3484
Packit 5c3484
Packit 5c3484
Packit 5c3484
RELEVANT OPTIMIZATION ISSUES
Packit 5c3484
Packit 5c3484
EV4
Packit 5c3484
Packit 5c3484
1. This chip has very limited store bandwidth.  The on-chip L1 cache is write-
Packit 5c3484
   through, and a cache line is transferred from the store buffer to the off-
Packit 5c3484
   chip L2 in as much 15 cycles on most systems.  This delay hurts mpn_add_n,
Packit 5c3484
   mpn_sub_n, mpn_lshift, and mpn_rshift.
Packit 5c3484
Packit 5c3484
2. Pairing is possible between memory instructions and integer arithmetic
Packit 5c3484
   instructions.
Packit 5c3484
Packit 5c3484
3. mulq and umulh are documented to have a latency of 23 cycles, but 2 of these
Packit 5c3484
   cycles are pipelined.  Thus, multiply instructions can be issued at a rate
Packit 5c3484
   of one each 21st cycle.
Packit 5c3484
Packit 5c3484
EV5
Packit 5c3484
Packit 5c3484
1. The memory bandwidth of this chip is good, both for loads and stores.  The
Packit 5c3484
   L1 cache can handle two loads or one store per cycle, but two cycles after a
Packit 5c3484
   store, no ld can issue.
Packit 5c3484
Packit 5c3484
2. mulq has a latency of 12 cycles and an issue rate of 1 each 8th cycle.
Packit 5c3484
   umulh has a latency of 14 cycles and an issue rate of 1 each 10th cycle.
Packit 5c3484
   (Note that published documentation gets these numbers slightly wrong.)
Packit 5c3484
Packit 5c3484
3. mpn_add_n.  With 4-fold unrolling, we need 37 instructions, whereof 12
Packit 5c3484
   are memory operations.  This will take at least
Packit 5c3484
	ceil(37/2) [dual issue] + 1 [taken branch] = 19 cycles
Packit 5c3484
   We have 12 memory cycles, plus 4 after-store conflict cycles, or 16 data
Packit 5c3484
   cache cycles, which should be completely hidden in the 19 issue cycles.
Packit 5c3484
   The computation is inherently serial, with these dependencies:
Packit 5c3484
Packit 5c3484
	       ldq  ldq
Packit 5c3484
		 \  /\
Packit 5c3484
	  (or)   addq |
Packit 5c3484
	   |\   /   \ |
Packit 5c3484
	   | addq  cmpult
Packit 5c3484
	    \  |     |
Packit 5c3484
	     cmpult  |
Packit 5c3484
		 \  /
Packit 5c3484
		  or
Packit 5c3484
Packit 5c3484
   I.e., 3 operations are needed between carry-in and carry-out, making 12
Packit 5c3484
   cycles the absolute minimum for the 4 limbs.  We could replace the `or' with
Packit 5c3484
   a cmoveq/cmovne, which could issue one cycle earlier that the `or', but that
Packit 5c3484
   might waste a cycle on EV4.  The total depth remain unaffected, since cmov
Packit 5c3484
   has a latency of 2 cycles.
Packit 5c3484
Packit 5c3484
     addq
Packit 5c3484
     /   \
Packit 5c3484
   addq  cmpult
Packit 5c3484
     |      \
Packit 5c3484
   cmpult -> cmovne
Packit 5c3484
Packit 5c3484
  Montgomery has a slightly different way of computing carry that requires one
Packit 5c3484
  less instruction, but has depth 4 (instead of the current 3).  Since the code
Packit 5c3484
  is currently instruction issue bound, Montgomery's idea should save us 1/2
Packit 5c3484
  cycle per limb, or bring us down to a total of 17 cycles or 4.25 cycles/limb.
Packit 5c3484
  Unfortunately, this method will not be good for the EV6.
Packit 5c3484
Packit 5c3484
4. addmul_1 and friends: We previously had a scheme for splitting the single-
Packit 5c3484
   limb operand in 21-bits chunks and the multi-limb operand in 32-bit chunks,
Packit 5c3484
   and then use FP operations for every 2nd multiply, and integer operations
Packit 5c3484
   for every 2nd multiply.
Packit 5c3484
Packit 5c3484
   But it seems much better to split the single-limb operand in 16-bit chunks,
Packit 5c3484
   since we save many integer shifts and adds that way.  See powerpc64/README
Packit 5c3484
   for some more details.
Packit 5c3484
Packit 5c3484
EV6
Packit 5c3484
Packit 5c3484
Here we have a really parallel pipeline, capable of issuing up to 4 integer
Packit 5c3484
instructions per cycle.  In actual practice, it is never possible to sustain
Packit 5c3484
more than 3.5 integer insns/cycle due to rename register shortage.  One integer
Packit 5c3484
multiply instruction can issue each cycle.  To get optimal speed, we need to
Packit 5c3484
pretend we are vectorizing the code, i.e., minimize the depth of recurrences.
Packit 5c3484
Packit 5c3484
There are two dependencies to watch out for.  1) Address arithmetic
Packit 5c3484
dependencies, and 2) carry propagation dependencies.
Packit 5c3484
Packit 5c3484
We can avoid serializing due to address arithmetic by unrolling loops, so that
Packit 5c3484
addresses don't depend heavily on an index variable.  Avoiding serializing
Packit 5c3484
because of carry propagation is trickier; the ultimate performance of the code
Packit 5c3484
will be determined of the number of latency cycles it takes from accepting
Packit 5c3484
carry-in to a vector point until we can generate carry-out.
Packit 5c3484
Packit 5c3484
Most integer instructions can execute in either the L0, U0, L1, or U1
Packit 5c3484
pipelines.  Shifts only execute in U0 and U1, and multiply only in U1.
Packit 5c3484
Packit 5c3484
CMOV instructions split into two internal instructions, CMOV1 and CMOV2.  CMOV
Packit 5c3484
split the mapping process (see pg 2-26 in cmpwrgd.pdf), suggesting the CMOV
Packit 5c3484
should always be placed as the last instruction of an aligned 4 instruction
Packit 5c3484
block, or perhaps simply avoided.
Packit 5c3484
Packit 5c3484
Perhaps the most important issue is the latency between the L0/U0 and L1/U1
Packit 5c3484
clusters; a result obtained on either cluster has an extra cycle of latency for
Packit 5c3484
consumers in the opposite cluster.  Because of the dynamic nature of the
Packit 5c3484
implementation, it is hard to predict where an instruction will execute.
Packit 5c3484
Packit 5c3484
Packit 5c3484
Packit 5c3484
REFERENCES
Packit 5c3484
Packit 5c3484
"Alpha Architecture Handbook", version 4, Compaq, October 1998, order number
Packit 5c3484
EC-QD2KC-TE.
Packit 5c3484
Packit 5c3484
"Alpha 21164 Microprocessor Hardware Reference Manual", Compaq, December 1998,
Packit 5c3484
order number EC-QP99C-TE.
Packit 5c3484
Packit 5c3484
"Alpha 21264/EV67 Microprocessor Hardware Reference Manual", revision 1.4,
Packit 5c3484
Compaq, September 2000, order number DS-0028B-TE.
Packit 5c3484
Packit 5c3484
"Compiler Writer's Guide for the Alpha 21264", Compaq, June 1999, order number
Packit 5c3484
EC-RJ66A-TE.
Packit 5c3484
Packit 5c3484
All of the above are available online from
Packit 5c3484
Packit 5c3484
  http://ftp.digital.com/pub/Digital/info/semiconductor/literature/dsc-library.html
Packit 5c3484
  ftp://ftp.compaq.com/pub/products/alphaCPUdocs
Packit 5c3484
Packit 5c3484
"Tru64 Unix Assembly Language Programmer's Guide", Compaq, March 1996, part
Packit 5c3484
number AA-PS31D-TE.
Packit 5c3484
Packit 5c3484
"Digital UNIX Calling Standard for Alpha Systems", Digital Equipment Corp,
Packit 5c3484
March 1996, part number AA-PY8AC-TE.
Packit 5c3484
Packit 5c3484
The above are available online,
Packit 5c3484
Packit 5c3484
  http://h30097.www3.hp.com/docs/pub_page/V40F_DOCS.HTM
Packit 5c3484
Packit 5c3484
(Dunno what h30097 means in this URL, but if it moves try searching for "tru64
Packit 5c3484
online documentation" from the main www.hp.com page.)
Packit 5c3484
Packit 5c3484
Packit 5c3484
Packit 5c3484
----------------
Packit 5c3484
Local variables:
Packit 5c3484
mode: text
Packit 5c3484
fill-column: 79
Packit 5c3484
End: