Blame mpn/ia64/README

Packit 5c3484
Copyright 2000-2005 Free Software Foundation, Inc.
Packit 5c3484
Packit 5c3484
This file is part of the GNU MP Library.
Packit 5c3484
Packit 5c3484
The GNU MP Library is free software; you can redistribute it and/or modify
Packit 5c3484
it under the terms of either:
Packit 5c3484
Packit 5c3484
  * the GNU Lesser General Public License as published by the Free
Packit 5c3484
    Software Foundation; either version 3 of the License, or (at your
Packit 5c3484
    option) any later version.
Packit 5c3484
Packit 5c3484
or
Packit 5c3484
Packit 5c3484
  * the GNU General Public License as published by the Free Software
Packit 5c3484
    Foundation; either version 2 of the License, or (at your option) any
Packit 5c3484
    later version.
Packit 5c3484
Packit 5c3484
or both in parallel, as here.
Packit 5c3484
Packit 5c3484
The GNU MP Library is distributed in the hope that it will be useful, but
Packit 5c3484
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
Packit 5c3484
or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
Packit 5c3484
for more details.
Packit 5c3484
Packit 5c3484
You should have received copies of the GNU General Public License and the
Packit 5c3484
GNU Lesser General Public License along with the GNU MP Library.  If not,
Packit 5c3484
see https://www.gnu.org/licenses/.
Packit 5c3484
Packit 5c3484
Packit 5c3484
Packit 5c3484
                      IA-64 MPN SUBROUTINES
Packit 5c3484
Packit 5c3484
Packit 5c3484
This directory contains mpn functions for the IA-64 architecture.
Packit 5c3484
Packit 5c3484
Packit 5c3484
CODE ORGANIZATION
Packit 5c3484
Packit 5c3484
	mpn/ia64          itanium-2, and generic ia64
Packit 5c3484
Packit 5c3484
The code here has been optimized primarily for Itanium 2.  Very few Itanium 1
Packit 5c3484
chips were ever sold, and Itanium 2 is more powerful, so the latter is what
Packit 5c3484
we concentrate on.
Packit 5c3484
Packit 5c3484
Packit 5c3484
Packit 5c3484
CHIP NOTES
Packit 5c3484
Packit 5c3484
The IA-64 ISA keeps instructions three and three in 128 bit bundles.
Packit 5c3484
Programmers/compilers need to put explicit breaks `;;' when there are WAW or
Packit 5c3484
RAW dependencies, with some notable exceptions.  Such "breaks" are typically
Packit 5c3484
at the end of a bundle, but can be put between operations within some bundle
Packit 5c3484
types too.
Packit 5c3484
Packit 5c3484
The Itanium 1 and Itanium 2 implementations can under ideal conditions
Packit 5c3484
execute two bundles per cycle.  The Itanium 1 allows 4 of these instructions
Packit 5c3484
to do integer operations, while the Itanium 2 allows all 6 to be integer
Packit 5c3484
operations.
Packit 5c3484
Packit 5c3484
Taken cloop branches seem to insert a bubble into the pipeline most of the
Packit 5c3484
time on Itanium 1.
Packit 5c3484
Packit 5c3484
Loads to the fp registers bypass the L1 cache and thus get extremely long
Packit 5c3484
latencies, 9 cycles on the Itanium 1 and 6 cycles on the Itanium 2.
Packit 5c3484
Packit 5c3484
The software pipeline stuff using br.ctop instruction causes delays, since
Packit 5c3484
many issue slots are taken up by instructions with zero predicates, and
Packit 5c3484
since many extra instructions are needed to set things up.  These features
Packit 5c3484
are clearly designed for code density, not speed.
Packit 5c3484
Packit 5c3484
Misc pipeline limitations (Itanium 1):
Packit 5c3484
* The getf.sig instruction can only execute in M0.
Packit 5c3484
* At most four integer instructions/cycle.
Packit 5c3484
* Nops take up resources like any plain instructions.
Packit 5c3484
Packit 5c3484
Misc pipeline limitations (Itanium 2):
Packit 5c3484
* The getf.sig instruction can only execute in M0.
Packit 5c3484
* Nops take up resources like any plain instructions.
Packit 5c3484
Packit 5c3484
Packit 5c3484
ASSEMBLY SYNTAX
Packit 5c3484
Packit 5c3484
.align pads with nops in a text segment, but gas 2.14 and earlier
Packit 5c3484
incorrectly byte-swaps its nop bundle in big endian mode (eg. hpux), making
Packit 5c3484
it come out as break instructions.  We use the ALIGN() macro in
Packit 5c3484
mpn/ia64/ia64-defs.m4 when it might be executed across.  That macro
Packit 5c3484
suppresses any .align if the problem is detected by configure.  Lack of
Packit 5c3484
alignment might hurt performance but will at least be correct.
Packit 5c3484
Packit 5c3484
foo:: to create a global symbol is not accepted by gas.  Use separate
Packit 5c3484
".global foo" and "foo:" instead.
Packit 5c3484
Packit 5c3484
.global is the standard global directive.  gas accepts .globl, but hpux "as"
Packit 5c3484
doesn't.
Packit 5c3484
Packit 5c3484
.proc / .endp generates the appropriate .type and .size information for ELF,
Packit 5c3484
so the latter directives don't need to be given explicitly.
Packit 5c3484
Packit 5c3484
.pred.rel "mutex"... is standard for annotating predicate register
Packit 5c3484
relationships.  gas also accepts .pred.rel.mutex, but hpux "as" doesn't.
Packit 5c3484
Packit 5c3484
.pred directives can't be put on a line with a label, like
Packit 5c3484
".Lfoo: .pred ...", the HP assembler on HP-UX 11.23 rejects that.
Packit 5c3484
gas is happy with it, and past versions of HP had seemed ok.
Packit 5c3484
Packit 5c3484
// is the standard comment sequence, but we prefer "C" since it inhibits m4
Packit 5c3484
macro expansion.  See comments in ia64-defs.m4.
Packit 5c3484
Packit 5c3484
Packit 5c3484
REGISTER USAGE
Packit 5c3484
Packit 5c3484
Special:
Packit 5c3484
   r0: constant 0
Packit 5c3484
   r1: global pointer (gp)
Packit 5c3484
   r8: return value
Packit 5c3484
   r12: stack pointer (sp)
Packit 5c3484
   r13: thread pointer (tp)
Packit 5c3484
Caller-saves: r8-r11 r14-r31 f6-f15 f32-f127
Packit 5c3484
Caller-saves but rotating: r32-
Packit 5c3484
Packit 5c3484
Packit 5c3484
================================================================
Packit 5c3484
mpn_add_n, mpn_sub_n:
Packit 5c3484
Packit 5c3484
The current code runs at 1.25 c/l on Itanium 2.
Packit 5c3484
Packit 5c3484
================================================================
Packit 5c3484
mpn_mul_1:
Packit 5c3484
Packit 5c3484
The current code runs at 2 c/l on Itanium 2.
Packit 5c3484
Packit 5c3484
Using a blocked approach, working off of 4 separate places in the operands,
Packit 5c3484
one could make use of the xma accumulation, and approach 1 c/l.
Packit 5c3484
Packit 5c3484
	ldf8 [up]
Packit 5c3484
	xma.l
Packit 5c3484
	xma.hu
Packit 5c3484
	stf8  [wrp]
Packit 5c3484
Packit 5c3484
================================================================
Packit 5c3484
mpn_addmul_1:
Packit 5c3484
Packit 5c3484
The current code runs at 2 c/l on Itanium 2.
Packit 5c3484
Packit 5c3484
It seems possible to use a blocked approach, as with mpn_mul_1.  We should
Packit 5c3484
read rp[] to integer registers, allowing for just one getf.sig per cycle.
Packit 5c3484
Packit 5c3484
	ld8  [rp]
Packit 5c3484
	ldf8 [up]
Packit 5c3484
	xma.l
Packit 5c3484
	xma.hu
Packit 5c3484
	getf.sig
Packit 5c3484
	add+add+cmp+cmp
Packit 5c3484
	st8  [wrp]
Packit 5c3484
Packit 5c3484
These 10 instructions can be scheduled to approach 1.667 cycles, and with
Packit 5c3484
the 4 cycle latency of xma, this means we need at least 3 blocks.  Using
Packit 5c3484
ldfp8 we could approach 1.583 c/l.
Packit 5c3484
Packit 5c3484
================================================================
Packit 5c3484
mpn_submul_1:
Packit 5c3484
Packit 5c3484
The current code runs at 2.25 c/l on Itanium 2.  Getting to 2 c/l requires
Packit 5c3484
ldfp8 with all alignment headache that implies.
Packit 5c3484
Packit 5c3484
================================================================
Packit 5c3484
mpn_addmul_N
Packit 5c3484
Packit 5c3484
For best speed, we need to give up using mpn_addmul_2 as the main multiply
Packit 5c3484
building block, and instead take multiple v limbs per loop.  For the Itanium
Packit 5c3484
1, we need to take about 8 limbs at a time for full speed.  For the Itanium
Packit 5c3484
2, something like mpn_addmul_4 should be enough.
Packit 5c3484
Packit 5c3484
The add+cmp+cmp+add we use on the other codes is optimal for shortening
Packit 5c3484
recurrencies (1 cycle) but the sequence takes up 4 execution slots.  When
Packit 5c3484
recurrency depth is not critical, a more standard 3-cycle add+cmp+add is
Packit 5c3484
better.
Packit 5c3484
Packit 5c3484
/* First load the 8 values from v */
Packit 5c3484
	ldfp8		v0, v1 = [r35], 16;;
Packit 5c3484
	ldfp8		v2, v3 = [r35], 16;;
Packit 5c3484
	ldfp8		v4, v5 = [r35], 16;;
Packit 5c3484
	ldfp8		v6, v7 = [r35], 16;;
Packit 5c3484
Packit 5c3484
/* In the inner loop, get a new U limb and store a result limb. */
Packit 5c3484
	mov		lc = un
Packit 5c3484
Loop:	ldf8		u0 = [r33], 8
Packit 5c3484
	ld8		r0 = [r32]
Packit 5c3484
	xma.l		lp0 = v0, u0, hp0
Packit 5c3484
	xma.hu		hp0 = v0, u0, hp0
Packit 5c3484
	xma.l		lp1 = v1, u0, hp1
Packit 5c3484
	xma.hu		hp1 = v1, u0, hp1
Packit 5c3484
	xma.l		lp2 = v2, u0, hp2
Packit 5c3484
	xma.hu		hp2 = v2, u0, hp2
Packit 5c3484
	xma.l		lp3 = v3, u0, hp3
Packit 5c3484
	xma.hu		hp3 = v3, u0, hp3
Packit 5c3484
	xma.l		lp4 = v4, u0, hp4
Packit 5c3484
	xma.hu		hp4 = v4, u0, hp4
Packit 5c3484
	xma.l		lp5 = v5, u0, hp5
Packit 5c3484
	xma.hu		hp5 = v5, u0, hp5
Packit 5c3484
	xma.l		lp6 = v6, u0, hp6
Packit 5c3484
	xma.hu		hp6 = v6, u0, hp6
Packit 5c3484
	xma.l		lp7 = v7, u0, hp7
Packit 5c3484
	xma.hu		hp7 = v7, u0, hp7
Packit 5c3484
	getf.sig	l0 = lp0
Packit 5c3484
	getf.sig	l1 = lp1
Packit 5c3484
	getf.sig	l2 = lp2
Packit 5c3484
	getf.sig	l3 = lp3
Packit 5c3484
	getf.sig	l4 = lp4
Packit 5c3484
	getf.sig	l5 = lp5
Packit 5c3484
	getf.sig	l6 = lp6
Packit 5c3484
	add+cmp+add	xx, l0, r0
Packit 5c3484
	add+cmp+add	acc0, acc1, l1
Packit 5c3484
	add+cmp+add	acc1, acc2, l2
Packit 5c3484
	add+cmp+add	acc2, acc3, l3
Packit 5c3484
	add+cmp+add	acc3, acc4, l4
Packit 5c3484
	add+cmp+add	acc4, acc5, l5
Packit 5c3484
	add+cmp+add	acc5, acc6, l6
Packit 5c3484
	getf.sig	acc6 = lp7
Packit 5c3484
	st8		[r32] = xx, 8
Packit 5c3484
	br.cloop Loop
Packit 5c3484
Packit 5c3484
	49 insn at max 6 insn/cycle:		8.167 cycles/limb8
Packit 5c3484
	11 memops at max 2 memops/cycle:	5.5 cycles/limb8
Packit 5c3484
	16 fpops at max 2 fpops/cycle:		8 cycles/limb8
Packit 5c3484
	21 intops at max 4 intops/cycle:	5.25 cycles/limb8
Packit 5c3484
	11+21 memops+intops at max 4/cycle	8 cycles/limb8
Packit 5c3484
Packit 5c3484
================================================================
Packit 5c3484
mpn_lshift, mpn_rshift
Packit 5c3484
Packit 5c3484
The current code runs at 1 cycle/limb on Itanium 2.
Packit 5c3484
Packit 5c3484
Using 63 separate loops, we could use the double-word shrp instruction.
Packit 5c3484
That instruction has a plain single-cycle latency.  We need 63 loops since
Packit 5c3484
this instruction only accept immediate count.  That would lead to a somewhat
Packit 5c3484
silly code size, but the speed would be 0.75 c/l on Itanium 2 (by using shrp
Packit 5c3484
each cycle plus shl/shr going down I1 for a further limb every second
Packit 5c3484
cycle).
Packit 5c3484
Packit 5c3484
================================================================
Packit 5c3484
mpn_copyi, mpn_copyd
Packit 5c3484
Packit 5c3484
The current code runs at 0.5 c/l on Itanium 2.  But that is just for L1
Packit 5c3484
cache hit.  The 4-way unrolled loop takes just 2 cycles, and thus load-use
Packit 5c3484
scheduling isn't great.  It might be best to actually use modulo scheduled
Packit 5c3484
loops, since that will allow us to do better load-use scheduling without too
Packit 5c3484
much unrolling.
Packit 5c3484
Packit 5c3484
Depending on size or operand alignment, we get 1 c/l or 0.5 c/l on Itanium
Packit 5c3484
2, according to tune/speed.  Cache bank conflicts?
Packit 5c3484
Packit 5c3484
Packit 5c3484
Packit 5c3484
REFERENCES
Packit 5c3484
Packit 5c3484
Intel Itanium Architecture Software Developer's Manual, volumes 1 to 3,
Packit 5c3484
Intel document 245317-004, 245318-004, 245319-004 October 2002.  Volume 1
Packit 5c3484
includes an Itanium optimization guide.
Packit 5c3484
Packit 5c3484
Intel Itanium Processor-specific Application Binary Interface (ABI), Intel
Packit 5c3484
document 245370-003, May 2001.  Describes C type sizes, dynamic linking,
Packit 5c3484
etc.
Packit 5c3484
Packit 5c3484
Intel Itanium Architecture Assembly Language Reference Guide, Intel document
Packit 5c3484
248801-004, 2000-2002.  Describes assembly instruction syntax and other
Packit 5c3484
directives.
Packit 5c3484
Packit 5c3484
Itanium Software Conventions and Runtime Architecture Guide, Intel document
Packit 5c3484
245358-003, May 2001.  Describes calling conventions, including stack
Packit 5c3484
unwinding requirements.
Packit 5c3484
Packit 5c3484
Intel Itanium Processor Reference Manual for Software Optimization, Intel
Packit 5c3484
document 245473-003, November 2001.
Packit 5c3484
Packit 5c3484
Intel Itanium-2 Processor Reference Manual for Software Development and
Packit 5c3484
Optimization, Intel document 251110-003, May 2004.
Packit 5c3484
Packit 5c3484
All the above documents can be found online at
Packit 5c3484
Packit 5c3484
    http://developer.intel.com/design/itanium/manuals.htm