Blame mpn/sparc64/README

Packit 5c3484
Copyright 1997, 1999-2002 Free Software Foundation, Inc.
Packit 5c3484
Packit 5c3484
This file is part of the GNU MP Library.
Packit 5c3484
Packit 5c3484
The GNU MP Library is free software; you can redistribute it and/or modify
Packit 5c3484
it under the terms of either:
Packit 5c3484
Packit 5c3484
  * the GNU Lesser General Public License as published by the Free
Packit 5c3484
    Software Foundation; either version 3 of the License, or (at your
Packit 5c3484
    option) any later version.
Packit 5c3484
Packit 5c3484
or
Packit 5c3484
Packit 5c3484
  * the GNU General Public License as published by the Free Software
Packit 5c3484
    Foundation; either version 2 of the License, or (at your option) any
Packit 5c3484
    later version.
Packit 5c3484
Packit 5c3484
or both in parallel, as here.
Packit 5c3484
Packit 5c3484
The GNU MP Library is distributed in the hope that it will be useful, but
Packit 5c3484
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
Packit 5c3484
or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
Packit 5c3484
for more details.
Packit 5c3484
Packit 5c3484
You should have received copies of the GNU General Public License and the
Packit 5c3484
GNU Lesser General Public License along with the GNU MP Library.  If not,
Packit 5c3484
see https://www.gnu.org/licenses/.
Packit 5c3484
Packit 5c3484
Packit 5c3484
Packit 5c3484
Packit 5c3484
Packit 5c3484
This directory contains mpn functions for 64-bit V9 SPARC
Packit 5c3484
Packit 5c3484
RELEVANT OPTIMIZATION ISSUES
Packit 5c3484
Packit 5c3484
Notation:
Packit 5c3484
  IANY = shift/add/sub/logical/sethi
Packit 5c3484
  IADDLOG = add/sub/logical/sethi
Packit 5c3484
  MEM = ld*/st*
Packit 5c3484
  FA = fadd*/fsub*/f*to*/fmov*
Packit 5c3484
  FM = fmul*
Packit 5c3484
Packit 5c3484
UltraSPARC can issue four instructions per cycle, with these restrictions:
Packit 5c3484
* Two IANY instructions, but only one of these may be a shift.  If there is a
Packit 5c3484
  shift and an IANY instruction, the shift must precede the IANY instruction.
Packit 5c3484
* One FA.
Packit 5c3484
* One FM.
Packit 5c3484
* One branch.
Packit 5c3484
* One MEM.
Packit 5c3484
* IANY/IADDLOG/MEM must be insn 1, 2, or 3 in an issue bundle.  Taken branches
Packit 5c3484
  should not be in slot 4, since that makes the delay insn come from separate
Packit 5c3484
  bundle.
Packit 5c3484
* If two IANY/IADDLOG instructions are to be executed in the same cycle and one
Packit 5c3484
  of these is setting the condition codes, that instruction must be the second
Packit 5c3484
  one.
Packit 5c3484
Packit 5c3484
To summarize, ignoring branches, these are the bundles that can reach the peak
Packit 5c3484
execution speed:
Packit 5c3484
Packit 5c3484
insn1	iany	iany	mem	iany	iany	mem	iany	iany	mem
Packit 5c3484
insn2	iaddlog	mem	iany	mem	iaddlog	iany	mem	iaddlog	iany
Packit 5c3484
insn3	mem	iaddlog	iaddlog	fa	fa	fa	fm	fm	fm
Packit 5c3484
insn4	fa/fm	fa/fm	fa/fm	fm	fm	fm	fa	fa	fa
Packit 5c3484
Packit 5c3484
The 64-bit integer multiply instruction mulx takes from 5 cycles to 35 cycles,
Packit 5c3484
depending on the position of the most significant bit of the first source
Packit 5c3484
operand.  When used for 32x32->64 multiplication, it needs 20 cycles.
Packit 5c3484
Furthermore, it stalls the processor while executing.  We stay away from that
Packit 5c3484
instruction, and instead use floating-point operations.
Packit 5c3484
Packit 5c3484
Floating-point add and multiply units are fully pipelined.  The latency for
Packit 5c3484
UltraSPARC-1/2 is 3 cycles and for UltraSPARC-3 it is 4 cycles.
Packit 5c3484
Packit 5c3484
Integer conditional move instructions cannot dual-issue with other integer
Packit 5c3484
instructions.  No conditional move can issue 1-5 cycles after a load.  (This
Packit 5c3484
might have been fixed for UltraSPARC-3.)
Packit 5c3484
Packit 5c3484
The UltraSPARC-3 pipeline is very simular to the one of UltraSPARC-1/2 , but is
Packit 5c3484
somewhat slower.  Branches execute slower, and there may be other new stalls.
Packit 5c3484
But integer multiply doesn't stall the entire CPU and also has a much lower
Packit 5c3484
latency.  But it's still not pipelined, and thus useless for our needs.
Packit 5c3484
Packit 5c3484
STATUS
Packit 5c3484
Packit 5c3484
* mpn_lshift, mpn_rshift: The current code runs at 2.0 cycles/limb on
Packit 5c3484
  UltraSPARC-1/2 and 2.65 on UltraSPARC-3.  For UltraSPARC-1/2, the IEU0
Packit 5c3484
  functional unit is saturated with shifts.
Packit 5c3484
Packit 5c3484
* mpn_add_n, mpn_sub_n: The current code runs at 4 cycles/limb on
Packit 5c3484
  UltraSPARC-1/2 and 4.5 cycles/limb on UltraSPARC-3.  The 4 instruction
Packit 5c3484
  recurrency is the speed limiter.
Packit 5c3484
Packit 5c3484
* mpn_addmul_1: The current code runs at 14 cycles/limb asymptotically on
Packit 5c3484
  UltraSPARC-1/2 and 17.5 cycles/limb on UltraSPARC-3.  On UltraSPARC-1/2, the
Packit 5c3484
  code sustains 4 instructions/cycle.  It might be possible to invent a better
Packit 5c3484
  way of summing the intermediate 49-bit operands, but it is unlikely that it
Packit 5c3484
  will save enough instructions to save an entire cycle.
Packit 5c3484
Packit 5c3484
  The load-use of the u operand is not enough scheduled for good L2 cache
Packit 5c3484
  performance.  The UltraSPARC-1/2 L1 cache is direct mapped, and since we use
Packit 5c3484
  temporary stack slots that will conflict with the u and r operands, we miss
Packit 5c3484
  to L2 very often.  The load-use of the std/ldx pairs via the stack are
Packit 5c3484
  perhaps over-scheduled.
Packit 5c3484
Packit 5c3484
  It would be possible to save two instructions: (1) The mov could be avoided
Packit 5c3484
  if the std/ldx were less scheduled.  (2) The ldx of the r operand could be
Packit 5c3484
  split into two ld instructions, saving the shifts/masks.
Packit 5c3484
Packit 5c3484
  It should be possible to reach 14 cycles/limb for UltraSPARC-3 if the fp
Packit 5c3484
  operations where rescheduled for this processor's 4-cycle latency.
Packit 5c3484
Packit 5c3484
* mpn_mul_1: The current code is a straightforward edit of the mpn_addmul_1
Packit 5c3484
  code.  It would be possible to shave one or two cycles from it, with some
Packit 5c3484
  labour.
Packit 5c3484
Packit 5c3484
* mpn_submul_1: Simpleminded code just calling mpn_mul_1 + mpn_sub_n.  This
Packit 5c3484
  means that it runs at 18 cycles/limb on UltraSPARC-1/2 and 23 cycles/limb on
Packit 5c3484
  UltraSPARC-3.  It would be possible to either match the mpn_addmul_1
Packit 5c3484
  performance, or in the worst case use one more instruction group.
Packit 5c3484
Packit 5c3484
* US1/US2 cache conflict resolving.  The direct mapped L1 date cache of US1/US2
Packit 5c3484
  is a problem for mul_1, addmul_1 (and a prospective submul_1).  We should
Packit 5c3484
  allocate a larger cache area, and put the stack temp area in a place that
Packit 5c3484
  doesn't cause cache conflicts.