dnl Itanium-2 mpn_modexact_1c_odd -- mpn by 1 exact remainder. dnl Contributed to the GNU project by Kevin Ryde. dnl Copyright 2003-2005 Free Software Foundation, Inc. dnl This file is part of the GNU MP Library. dnl dnl The GNU MP Library is free software; you can redistribute it and/or modify dnl it under the terms of either: dnl dnl * the GNU Lesser General Public License as published by the Free dnl Software Foundation; either version 3 of the License, or (at your dnl option) any later version. dnl dnl or dnl dnl * the GNU General Public License as published by the Free Software dnl Foundation; either version 2 of the License, or (at your option) any dnl later version. dnl dnl or both in parallel, as here. dnl dnl The GNU MP Library is distributed in the hope that it will be useful, but dnl WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY dnl or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License dnl for more details. dnl dnl You should have received copies of the GNU General Public License and the dnl GNU Lesser General Public License along with the GNU MP Library. If not, dnl see https://www.gnu.org/licenses/. include(`../config.m4') C cycles/limb C Itanium: 15 C Itanium 2: 8 dnl Usage: ABI32(`code') dnl dnl Emit the given code only under HAVE_ABI_32. dnl define(ABI32, m4_assert_onearg() `ifdef(`HAVE_ABI_32',`$1')') C mp_limb_t mpn_modexact_1c_odd (mp_srcptr src, mp_size_t size, C mp_limb_t divisor, mp_limb_t carry); C C The modexact algorithm is usually conceived as a dependent chain C C l = src[i] - c C q = low(l * inverse) C c = high(q*divisor) + (src[i]=2), and the C calculation of q by the initial different scheme. C C C Entry Sequence: C C In the entry sequence, the critical path is the calculation of the C inverse, so this is begun first and optimized. Apart from that, ar.lc is C established nice and early so the br.cloop's should predict perfectly. C And the load for the low limbs src[0] and src[1] can be initiated long C ahead of where they're needed. C C C Inverse Calculation: C C The initial 8-bit inverse is calculated using a table lookup. If it hits C L1 (which is likely if we're called several times) then it should take a C total 4 cycles, otherwise hopefully L2 for 9 cycles. This is considered C the best approach, on balance. It could be done bitwise, but that would C probably be about 14 cycles (2 per bit beyond the first couple). Or it C could be taken from 4 bits to 8 with xmpy doubling as used beyond 8 bits, C but that would be about 11 cycles. C C The table is not the same as binvert_limb_table, instead it's 256 bytes, C designed to be indexed by the low byte of the divisor. The divisor is C always odd, so the relevant data is every second byte in the table. The C padding lets us use zxt1 instead of extr.u, the latter would cost an extra C cycle because it must go down I0, and we're using the first I0 slot to get C ip. The extra 128 bytes of padding should be insignificant compared to C typical ia64 code bloat. C C Having the table in .text allows us to use IP-relative addressing, C avoiding a fetch from ltoff. .rodata is apparently not suitable for use C IP-relative, it gets a linker relocation overflow on GNU/Linux. C C C Load Scheduling: C C In the main loop, the data loads are scheduled for an L2 hit, which means C 6 cycles for the data ready to use. In fact we end up 7 cycles ahead. In C any case that scheduling is achieved simply by doing the load (and xmpy.l C for "si") in the immediately preceding iteration. C C The main loop requires size >= 2, and we handle size==1 by an initial C br.cloop to enter the loop only if size>1. Since ar.lc is established C early, this should predict perfectly. C C C Not done: C C Consideration was given to using a plain "(src[0]-c) % divisor" for C size==1, but cycle counting suggests about 50 for the sort of approach C taken by gcc __umodsi3, versus about 47 for the modexact. (Both assuming C L1 hits for their respective fetching.) C C Consideration was given to a test for high 1 ;; C size==1, finish up now xma.hu f9 = f10, f6, f9 C c = high(q * divisor + c) mov ar.lc = r2 C I0 ;; getf.sig r8 = f9 C M2 return c br.ret.sptk.many b0 .Ltop: C r2 saved ar.lc C f6 divisor C f7 inverse C f8 -inverse C f9 carry C f10 src[i] * inverse C f11 scratch src[i+1] add r16 = 160, r32 ldf8 f11 = [r32], 8 C src[i+1] ;; C 2 cycles lfetch [r16] xma.l f10 = f9, f8, f10 C q = c * -inverse + si ;; C 3 cycles .Lentry: xma.hu f9 = f10, f6, f9 C c = high(q * divisor + c) xmpy.l f10 = f11, f7 C si = src[i] * inverse br.cloop.sptk.few.clr .Ltop ;; xma.l f10 = f9, f8, f10 C q = c * -inverse + si mov ar.lc = r2 C I0 ;; xma.hu f9 = f10, f6, f9 C c = high(q * divisor + c) ;; getf.sig r8 = f9 C M2 return c br.ret.sptk.many b0 EPILOGUE()