|
Packit |
5c3484 |
Copyright 1996, 1997, 1999-2005 Free Software Foundation, Inc.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
This file is part of the GNU MP Library.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
The GNU MP Library is free software; you can redistribute it and/or modify
|
|
Packit |
5c3484 |
it under the terms of either:
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
* the GNU Lesser General Public License as published by the Free
|
|
Packit |
5c3484 |
Software Foundation; either version 3 of the License, or (at your
|
|
Packit |
5c3484 |
option) any later version.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
or
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
* the GNU General Public License as published by the Free Software
|
|
Packit |
5c3484 |
Foundation; either version 2 of the License, or (at your option) any
|
|
Packit |
5c3484 |
later version.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
or both in parallel, as here.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
The GNU MP Library is distributed in the hope that it will be useful, but
|
|
Packit |
5c3484 |
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
|
|
Packit |
5c3484 |
or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
|
|
Packit |
5c3484 |
for more details.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
You should have received copies of the GNU General Public License and the
|
|
Packit |
5c3484 |
GNU Lesser General Public License along with the GNU MP Library. If not,
|
|
Packit |
5c3484 |
see https://www.gnu.org/licenses/.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
This directory contains mpn functions optimized for DEC Alpha processors.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
ALPHA ASSEMBLY RULES AND REGULATIONS
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
The `.prologue N' pseudo op marks the end of instruction that needs special
|
|
Packit |
5c3484 |
handling by unwinding. It also says whether $27 is really needed for computing
|
|
Packit |
5c3484 |
the gp. The `.mask M' pseudo op says which registers are saved on the stack,
|
|
Packit |
5c3484 |
and at what offset in the frame.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
Cray T3 code is very very different...
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
"$6" / "$f6" etc is the usual syntax for registers, but on Unicos instead "r6"
|
|
Packit |
5c3484 |
/ "f6" is required. We use the "r6" / "f6" forms, and have m4 defines expand
|
|
Packit |
5c3484 |
them to "$6" or "$f6" where necessary.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
"0x" introduces a hex constant in gas and DEC as, but on Unicos "^X" is
|
|
Packit |
5c3484 |
required. The X() macro accommodates this difference.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
"cvttqc" is required by DEC as, "cvttq/c" is required by Unicos, and gas will
|
|
Packit |
5c3484 |
accept either. We use cvttqc and have an m4 define expand to cvttq/c where
|
|
Packit |
5c3484 |
necessary.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
"not" as an alias for "ornot r31, ..." is available in gas and DEC as, but not
|
|
Packit |
5c3484 |
the Unicos assembler. The full "ornot" must be used.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
"unop" is not available in Unicos. We make an m4 define to the usual "ldq_u
|
|
Packit |
5c3484 |
r31,0(r30)", and in fact use that define on all systems since it comes out the
|
|
Packit |
5c3484 |
same.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
"!literal!123" etc explicit relocations as per Tru64 4.0 are apparently not
|
|
Packit |
5c3484 |
available in older alpha assemblers (including gas prior to 2.12), according to
|
|
Packit |
5c3484 |
the GCC manual, so the assembler macro forms must be used (eg. ldgp).
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
RELEVANT OPTIMIZATION ISSUES
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
EV4
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
1. This chip has very limited store bandwidth. The on-chip L1 cache is write-
|
|
Packit |
5c3484 |
through, and a cache line is transferred from the store buffer to the off-
|
|
Packit |
5c3484 |
chip L2 in as much 15 cycles on most systems. This delay hurts mpn_add_n,
|
|
Packit |
5c3484 |
mpn_sub_n, mpn_lshift, and mpn_rshift.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
2. Pairing is possible between memory instructions and integer arithmetic
|
|
Packit |
5c3484 |
instructions.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
3. mulq and umulh are documented to have a latency of 23 cycles, but 2 of these
|
|
Packit |
5c3484 |
cycles are pipelined. Thus, multiply instructions can be issued at a rate
|
|
Packit |
5c3484 |
of one each 21st cycle.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
EV5
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
1. The memory bandwidth of this chip is good, both for loads and stores. The
|
|
Packit |
5c3484 |
L1 cache can handle two loads or one store per cycle, but two cycles after a
|
|
Packit |
5c3484 |
store, no ld can issue.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
2. mulq has a latency of 12 cycles and an issue rate of 1 each 8th cycle.
|
|
Packit |
5c3484 |
umulh has a latency of 14 cycles and an issue rate of 1 each 10th cycle.
|
|
Packit |
5c3484 |
(Note that published documentation gets these numbers slightly wrong.)
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
3. mpn_add_n. With 4-fold unrolling, we need 37 instructions, whereof 12
|
|
Packit |
5c3484 |
are memory operations. This will take at least
|
|
Packit |
5c3484 |
ceil(37/2) [dual issue] + 1 [taken branch] = 19 cycles
|
|
Packit |
5c3484 |
We have 12 memory cycles, plus 4 after-store conflict cycles, or 16 data
|
|
Packit |
5c3484 |
cache cycles, which should be completely hidden in the 19 issue cycles.
|
|
Packit |
5c3484 |
The computation is inherently serial, with these dependencies:
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
ldq ldq
|
|
Packit |
5c3484 |
\ /\
|
|
Packit |
5c3484 |
(or) addq |
|
|
Packit |
5c3484 |
|\ / \ |
|
|
Packit |
5c3484 |
| addq cmpult
|
|
Packit |
5c3484 |
\ | |
|
|
Packit |
5c3484 |
cmpult |
|
|
Packit |
5c3484 |
\ /
|
|
Packit |
5c3484 |
or
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
I.e., 3 operations are needed between carry-in and carry-out, making 12
|
|
Packit |
5c3484 |
cycles the absolute minimum for the 4 limbs. We could replace the `or' with
|
|
Packit |
5c3484 |
a cmoveq/cmovne, which could issue one cycle earlier that the `or', but that
|
|
Packit |
5c3484 |
might waste a cycle on EV4. The total depth remain unaffected, since cmov
|
|
Packit |
5c3484 |
has a latency of 2 cycles.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
addq
|
|
Packit |
5c3484 |
/ \
|
|
Packit |
5c3484 |
addq cmpult
|
|
Packit |
5c3484 |
| \
|
|
Packit |
5c3484 |
cmpult -> cmovne
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
Montgomery has a slightly different way of computing carry that requires one
|
|
Packit |
5c3484 |
less instruction, but has depth 4 (instead of the current 3). Since the code
|
|
Packit |
5c3484 |
is currently instruction issue bound, Montgomery's idea should save us 1/2
|
|
Packit |
5c3484 |
cycle per limb, or bring us down to a total of 17 cycles or 4.25 cycles/limb.
|
|
Packit |
5c3484 |
Unfortunately, this method will not be good for the EV6.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
4. addmul_1 and friends: We previously had a scheme for splitting the single-
|
|
Packit |
5c3484 |
limb operand in 21-bits chunks and the multi-limb operand in 32-bit chunks,
|
|
Packit |
5c3484 |
and then use FP operations for every 2nd multiply, and integer operations
|
|
Packit |
5c3484 |
for every 2nd multiply.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
But it seems much better to split the single-limb operand in 16-bit chunks,
|
|
Packit |
5c3484 |
since we save many integer shifts and adds that way. See powerpc64/README
|
|
Packit |
5c3484 |
for some more details.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
EV6
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
Here we have a really parallel pipeline, capable of issuing up to 4 integer
|
|
Packit |
5c3484 |
instructions per cycle. In actual practice, it is never possible to sustain
|
|
Packit |
5c3484 |
more than 3.5 integer insns/cycle due to rename register shortage. One integer
|
|
Packit |
5c3484 |
multiply instruction can issue each cycle. To get optimal speed, we need to
|
|
Packit |
5c3484 |
pretend we are vectorizing the code, i.e., minimize the depth of recurrences.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
There are two dependencies to watch out for. 1) Address arithmetic
|
|
Packit |
5c3484 |
dependencies, and 2) carry propagation dependencies.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
We can avoid serializing due to address arithmetic by unrolling loops, so that
|
|
Packit |
5c3484 |
addresses don't depend heavily on an index variable. Avoiding serializing
|
|
Packit |
5c3484 |
because of carry propagation is trickier; the ultimate performance of the code
|
|
Packit |
5c3484 |
will be determined of the number of latency cycles it takes from accepting
|
|
Packit |
5c3484 |
carry-in to a vector point until we can generate carry-out.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
Most integer instructions can execute in either the L0, U0, L1, or U1
|
|
Packit |
5c3484 |
pipelines. Shifts only execute in U0 and U1, and multiply only in U1.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
CMOV instructions split into two internal instructions, CMOV1 and CMOV2. CMOV
|
|
Packit |
5c3484 |
split the mapping process (see pg 2-26 in cmpwrgd.pdf), suggesting the CMOV
|
|
Packit |
5c3484 |
should always be placed as the last instruction of an aligned 4 instruction
|
|
Packit |
5c3484 |
block, or perhaps simply avoided.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
Perhaps the most important issue is the latency between the L0/U0 and L1/U1
|
|
Packit |
5c3484 |
clusters; a result obtained on either cluster has an extra cycle of latency for
|
|
Packit |
5c3484 |
consumers in the opposite cluster. Because of the dynamic nature of the
|
|
Packit |
5c3484 |
implementation, it is hard to predict where an instruction will execute.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
REFERENCES
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
"Alpha Architecture Handbook", version 4, Compaq, October 1998, order number
|
|
Packit |
5c3484 |
EC-QD2KC-TE.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
"Alpha 21164 Microprocessor Hardware Reference Manual", Compaq, December 1998,
|
|
Packit |
5c3484 |
order number EC-QP99C-TE.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
"Alpha 21264/EV67 Microprocessor Hardware Reference Manual", revision 1.4,
|
|
Packit |
5c3484 |
Compaq, September 2000, order number DS-0028B-TE.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
"Compiler Writer's Guide for the Alpha 21264", Compaq, June 1999, order number
|
|
Packit |
5c3484 |
EC-RJ66A-TE.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
All of the above are available online from
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
http://ftp.digital.com/pub/Digital/info/semiconductor/literature/dsc-library.html
|
|
Packit |
5c3484 |
ftp://ftp.compaq.com/pub/products/alphaCPUdocs
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
"Tru64 Unix Assembly Language Programmer's Guide", Compaq, March 1996, part
|
|
Packit |
5c3484 |
number AA-PS31D-TE.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
"Digital UNIX Calling Standard for Alpha Systems", Digital Equipment Corp,
|
|
Packit |
5c3484 |
March 1996, part number AA-PY8AC-TE.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
The above are available online,
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
http://h30097.www3.hp.com/docs/pub_page/V40F_DOCS.HTM
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
(Dunno what h30097 means in this URL, but if it moves try searching for "tru64
|
|
Packit |
5c3484 |
online documentation" from the main www.hp.com page.)
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
----------------
|
|
Packit |
5c3484 |
Local variables:
|
|
Packit |
5c3484 |
mode: text
|
|
Packit |
5c3484 |
fill-column: 79
|
|
Packit |
5c3484 |
End:
|