|
Packit |
5c3484 |
Copyright 2000-2005 Free Software Foundation, Inc.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
This file is part of the GNU MP Library.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
The GNU MP Library is free software; you can redistribute it and/or modify
|
|
Packit |
5c3484 |
it under the terms of either:
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
* the GNU Lesser General Public License as published by the Free
|
|
Packit |
5c3484 |
Software Foundation; either version 3 of the License, or (at your
|
|
Packit |
5c3484 |
option) any later version.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
or
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
* the GNU General Public License as published by the Free Software
|
|
Packit |
5c3484 |
Foundation; either version 2 of the License, or (at your option) any
|
|
Packit |
5c3484 |
later version.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
or both in parallel, as here.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
The GNU MP Library is distributed in the hope that it will be useful, but
|
|
Packit |
5c3484 |
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
|
|
Packit |
5c3484 |
or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
|
|
Packit |
5c3484 |
for more details.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
You should have received copies of the GNU General Public License and the
|
|
Packit |
5c3484 |
GNU Lesser General Public License along with the GNU MP Library. If not,
|
|
Packit |
5c3484 |
see https://www.gnu.org/licenses/.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
IA-64 MPN SUBROUTINES
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
This directory contains mpn functions for the IA-64 architecture.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
CODE ORGANIZATION
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
mpn/ia64 itanium-2, and generic ia64
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
The code here has been optimized primarily for Itanium 2. Very few Itanium 1
|
|
Packit |
5c3484 |
chips were ever sold, and Itanium 2 is more powerful, so the latter is what
|
|
Packit |
5c3484 |
we concentrate on.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
CHIP NOTES
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
The IA-64 ISA keeps instructions three and three in 128 bit bundles.
|
|
Packit |
5c3484 |
Programmers/compilers need to put explicit breaks `;;' when there are WAW or
|
|
Packit |
5c3484 |
RAW dependencies, with some notable exceptions. Such "breaks" are typically
|
|
Packit |
5c3484 |
at the end of a bundle, but can be put between operations within some bundle
|
|
Packit |
5c3484 |
types too.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
The Itanium 1 and Itanium 2 implementations can under ideal conditions
|
|
Packit |
5c3484 |
execute two bundles per cycle. The Itanium 1 allows 4 of these instructions
|
|
Packit |
5c3484 |
to do integer operations, while the Itanium 2 allows all 6 to be integer
|
|
Packit |
5c3484 |
operations.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
Taken cloop branches seem to insert a bubble into the pipeline most of the
|
|
Packit |
5c3484 |
time on Itanium 1.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
Loads to the fp registers bypass the L1 cache and thus get extremely long
|
|
Packit |
5c3484 |
latencies, 9 cycles on the Itanium 1 and 6 cycles on the Itanium 2.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
The software pipeline stuff using br.ctop instruction causes delays, since
|
|
Packit |
5c3484 |
many issue slots are taken up by instructions with zero predicates, and
|
|
Packit |
5c3484 |
since many extra instructions are needed to set things up. These features
|
|
Packit |
5c3484 |
are clearly designed for code density, not speed.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
Misc pipeline limitations (Itanium 1):
|
|
Packit |
5c3484 |
* The getf.sig instruction can only execute in M0.
|
|
Packit |
5c3484 |
* At most four integer instructions/cycle.
|
|
Packit |
5c3484 |
* Nops take up resources like any plain instructions.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
Misc pipeline limitations (Itanium 2):
|
|
Packit |
5c3484 |
* The getf.sig instruction can only execute in M0.
|
|
Packit |
5c3484 |
* Nops take up resources like any plain instructions.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
ASSEMBLY SYNTAX
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
.align pads with nops in a text segment, but gas 2.14 and earlier
|
|
Packit |
5c3484 |
incorrectly byte-swaps its nop bundle in big endian mode (eg. hpux), making
|
|
Packit |
5c3484 |
it come out as break instructions. We use the ALIGN() macro in
|
|
Packit |
5c3484 |
mpn/ia64/ia64-defs.m4 when it might be executed across. That macro
|
|
Packit |
5c3484 |
suppresses any .align if the problem is detected by configure. Lack of
|
|
Packit |
5c3484 |
alignment might hurt performance but will at least be correct.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
foo:: to create a global symbol is not accepted by gas. Use separate
|
|
Packit |
5c3484 |
".global foo" and "foo:" instead.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
.global is the standard global directive. gas accepts .globl, but hpux "as"
|
|
Packit |
5c3484 |
doesn't.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
.proc / .endp generates the appropriate .type and .size information for ELF,
|
|
Packit |
5c3484 |
so the latter directives don't need to be given explicitly.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
.pred.rel "mutex"... is standard for annotating predicate register
|
|
Packit |
5c3484 |
relationships. gas also accepts .pred.rel.mutex, but hpux "as" doesn't.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
.pred directives can't be put on a line with a label, like
|
|
Packit |
5c3484 |
".Lfoo: .pred ...", the HP assembler on HP-UX 11.23 rejects that.
|
|
Packit |
5c3484 |
gas is happy with it, and past versions of HP had seemed ok.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
// is the standard comment sequence, but we prefer "C" since it inhibits m4
|
|
Packit |
5c3484 |
macro expansion. See comments in ia64-defs.m4.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
REGISTER USAGE
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
Special:
|
|
Packit |
5c3484 |
r0: constant 0
|
|
Packit |
5c3484 |
r1: global pointer (gp)
|
|
Packit |
5c3484 |
r8: return value
|
|
Packit |
5c3484 |
r12: stack pointer (sp)
|
|
Packit |
5c3484 |
r13: thread pointer (tp)
|
|
Packit |
5c3484 |
Caller-saves: r8-r11 r14-r31 f6-f15 f32-f127
|
|
Packit |
5c3484 |
Caller-saves but rotating: r32-
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
================================================================
|
|
Packit |
5c3484 |
mpn_add_n, mpn_sub_n:
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
The current code runs at 1.25 c/l on Itanium 2.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
================================================================
|
|
Packit |
5c3484 |
mpn_mul_1:
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
The current code runs at 2 c/l on Itanium 2.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
Using a blocked approach, working off of 4 separate places in the operands,
|
|
Packit |
5c3484 |
one could make use of the xma accumulation, and approach 1 c/l.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
ldf8 [up]
|
|
Packit |
5c3484 |
xma.l
|
|
Packit |
5c3484 |
xma.hu
|
|
Packit |
5c3484 |
stf8 [wrp]
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
================================================================
|
|
Packit |
5c3484 |
mpn_addmul_1:
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
The current code runs at 2 c/l on Itanium 2.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
It seems possible to use a blocked approach, as with mpn_mul_1. We should
|
|
Packit |
5c3484 |
read rp[] to integer registers, allowing for just one getf.sig per cycle.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
ld8 [rp]
|
|
Packit |
5c3484 |
ldf8 [up]
|
|
Packit |
5c3484 |
xma.l
|
|
Packit |
5c3484 |
xma.hu
|
|
Packit |
5c3484 |
getf.sig
|
|
Packit |
5c3484 |
add+add+cmp+cmp
|
|
Packit |
5c3484 |
st8 [wrp]
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
These 10 instructions can be scheduled to approach 1.667 cycles, and with
|
|
Packit |
5c3484 |
the 4 cycle latency of xma, this means we need at least 3 blocks. Using
|
|
Packit |
5c3484 |
ldfp8 we could approach 1.583 c/l.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
================================================================
|
|
Packit |
5c3484 |
mpn_submul_1:
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
The current code runs at 2.25 c/l on Itanium 2. Getting to 2 c/l requires
|
|
Packit |
5c3484 |
ldfp8 with all alignment headache that implies.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
================================================================
|
|
Packit |
5c3484 |
mpn_addmul_N
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
For best speed, we need to give up using mpn_addmul_2 as the main multiply
|
|
Packit |
5c3484 |
building block, and instead take multiple v limbs per loop. For the Itanium
|
|
Packit |
5c3484 |
1, we need to take about 8 limbs at a time for full speed. For the Itanium
|
|
Packit |
5c3484 |
2, something like mpn_addmul_4 should be enough.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
The add+cmp+cmp+add we use on the other codes is optimal for shortening
|
|
Packit |
5c3484 |
recurrencies (1 cycle) but the sequence takes up 4 execution slots. When
|
|
Packit |
5c3484 |
recurrency depth is not critical, a more standard 3-cycle add+cmp+add is
|
|
Packit |
5c3484 |
better.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
/* First load the 8 values from v */
|
|
Packit |
5c3484 |
ldfp8 v0, v1 = [r35], 16;;
|
|
Packit |
5c3484 |
ldfp8 v2, v3 = [r35], 16;;
|
|
Packit |
5c3484 |
ldfp8 v4, v5 = [r35], 16;;
|
|
Packit |
5c3484 |
ldfp8 v6, v7 = [r35], 16;;
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
/* In the inner loop, get a new U limb and store a result limb. */
|
|
Packit |
5c3484 |
mov lc = un
|
|
Packit |
5c3484 |
Loop: ldf8 u0 = [r33], 8
|
|
Packit |
5c3484 |
ld8 r0 = [r32]
|
|
Packit |
5c3484 |
xma.l lp0 = v0, u0, hp0
|
|
Packit |
5c3484 |
xma.hu hp0 = v0, u0, hp0
|
|
Packit |
5c3484 |
xma.l lp1 = v1, u0, hp1
|
|
Packit |
5c3484 |
xma.hu hp1 = v1, u0, hp1
|
|
Packit |
5c3484 |
xma.l lp2 = v2, u0, hp2
|
|
Packit |
5c3484 |
xma.hu hp2 = v2, u0, hp2
|
|
Packit |
5c3484 |
xma.l lp3 = v3, u0, hp3
|
|
Packit |
5c3484 |
xma.hu hp3 = v3, u0, hp3
|
|
Packit |
5c3484 |
xma.l lp4 = v4, u0, hp4
|
|
Packit |
5c3484 |
xma.hu hp4 = v4, u0, hp4
|
|
Packit |
5c3484 |
xma.l lp5 = v5, u0, hp5
|
|
Packit |
5c3484 |
xma.hu hp5 = v5, u0, hp5
|
|
Packit |
5c3484 |
xma.l lp6 = v6, u0, hp6
|
|
Packit |
5c3484 |
xma.hu hp6 = v6, u0, hp6
|
|
Packit |
5c3484 |
xma.l lp7 = v7, u0, hp7
|
|
Packit |
5c3484 |
xma.hu hp7 = v7, u0, hp7
|
|
Packit |
5c3484 |
getf.sig l0 = lp0
|
|
Packit |
5c3484 |
getf.sig l1 = lp1
|
|
Packit |
5c3484 |
getf.sig l2 = lp2
|
|
Packit |
5c3484 |
getf.sig l3 = lp3
|
|
Packit |
5c3484 |
getf.sig l4 = lp4
|
|
Packit |
5c3484 |
getf.sig l5 = lp5
|
|
Packit |
5c3484 |
getf.sig l6 = lp6
|
|
Packit |
5c3484 |
add+cmp+add xx, l0, r0
|
|
Packit |
5c3484 |
add+cmp+add acc0, acc1, l1
|
|
Packit |
5c3484 |
add+cmp+add acc1, acc2, l2
|
|
Packit |
5c3484 |
add+cmp+add acc2, acc3, l3
|
|
Packit |
5c3484 |
add+cmp+add acc3, acc4, l4
|
|
Packit |
5c3484 |
add+cmp+add acc4, acc5, l5
|
|
Packit |
5c3484 |
add+cmp+add acc5, acc6, l6
|
|
Packit |
5c3484 |
getf.sig acc6 = lp7
|
|
Packit |
5c3484 |
st8 [r32] = xx, 8
|
|
Packit |
5c3484 |
br.cloop Loop
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
49 insn at max 6 insn/cycle: 8.167 cycles/limb8
|
|
Packit |
5c3484 |
11 memops at max 2 memops/cycle: 5.5 cycles/limb8
|
|
Packit |
5c3484 |
16 fpops at max 2 fpops/cycle: 8 cycles/limb8
|
|
Packit |
5c3484 |
21 intops at max 4 intops/cycle: 5.25 cycles/limb8
|
|
Packit |
5c3484 |
11+21 memops+intops at max 4/cycle 8 cycles/limb8
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
================================================================
|
|
Packit |
5c3484 |
mpn_lshift, mpn_rshift
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
The current code runs at 1 cycle/limb on Itanium 2.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
Using 63 separate loops, we could use the double-word shrp instruction.
|
|
Packit |
5c3484 |
That instruction has a plain single-cycle latency. We need 63 loops since
|
|
Packit |
5c3484 |
this instruction only accept immediate count. That would lead to a somewhat
|
|
Packit |
5c3484 |
silly code size, but the speed would be 0.75 c/l on Itanium 2 (by using shrp
|
|
Packit |
5c3484 |
each cycle plus shl/shr going down I1 for a further limb every second
|
|
Packit |
5c3484 |
cycle).
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
================================================================
|
|
Packit |
5c3484 |
mpn_copyi, mpn_copyd
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
The current code runs at 0.5 c/l on Itanium 2. But that is just for L1
|
|
Packit |
5c3484 |
cache hit. The 4-way unrolled loop takes just 2 cycles, and thus load-use
|
|
Packit |
5c3484 |
scheduling isn't great. It might be best to actually use modulo scheduled
|
|
Packit |
5c3484 |
loops, since that will allow us to do better load-use scheduling without too
|
|
Packit |
5c3484 |
much unrolling.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
Depending on size or operand alignment, we get 1 c/l or 0.5 c/l on Itanium
|
|
Packit |
5c3484 |
2, according to tune/speed. Cache bank conflicts?
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
REFERENCES
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
Intel Itanium Architecture Software Developer's Manual, volumes 1 to 3,
|
|
Packit |
5c3484 |
Intel document 245317-004, 245318-004, 245319-004 October 2002. Volume 1
|
|
Packit |
5c3484 |
includes an Itanium optimization guide.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
Intel Itanium Processor-specific Application Binary Interface (ABI), Intel
|
|
Packit |
5c3484 |
document 245370-003, May 2001. Describes C type sizes, dynamic linking,
|
|
Packit |
5c3484 |
etc.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
Intel Itanium Architecture Assembly Language Reference Guide, Intel document
|
|
Packit |
5c3484 |
248801-004, 2000-2002. Describes assembly instruction syntax and other
|
|
Packit |
5c3484 |
directives.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
Itanium Software Conventions and Runtime Architecture Guide, Intel document
|
|
Packit |
5c3484 |
245358-003, May 2001. Describes calling conventions, including stack
|
|
Packit |
5c3484 |
unwinding requirements.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
Intel Itanium Processor Reference Manual for Software Optimization, Intel
|
|
Packit |
5c3484 |
document 245473-003, November 2001.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
Intel Itanium-2 Processor Reference Manual for Software Development and
|
|
Packit |
5c3484 |
Optimization, Intel document 251110-003, May 2004.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
All the above documents can be found online at
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
http://developer.intel.com/design/itanium/manuals.htm
|