Blame mpn/cray/README

Packit 5c3484
Copyright 2000-2002 Free Software Foundation, Inc.
Packit 5c3484
Packit 5c3484
This file is part of the GNU MP Library.
Packit 5c3484
Packit 5c3484
The GNU MP Library is free software; you can redistribute it and/or modify
Packit 5c3484
it under the terms of either:
Packit 5c3484
Packit 5c3484
  * the GNU Lesser General Public License as published by the Free
Packit 5c3484
    Software Foundation; either version 3 of the License, or (at your
Packit 5c3484
    option) any later version.
Packit 5c3484
Packit 5c3484
or
Packit 5c3484
Packit 5c3484
  * the GNU General Public License as published by the Free Software
Packit 5c3484
    Foundation; either version 2 of the License, or (at your option) any
Packit 5c3484
    later version.
Packit 5c3484
Packit 5c3484
or both in parallel, as here.
Packit 5c3484
Packit 5c3484
The GNU MP Library is distributed in the hope that it will be useful, but
Packit 5c3484
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
Packit 5c3484
or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
Packit 5c3484
for more details.
Packit 5c3484
Packit 5c3484
You should have received copies of the GNU General Public License and the
Packit 5c3484
GNU Lesser General Public License along with the GNU MP Library.  If not,
Packit 5c3484
see https://www.gnu.org/licenses/.
Packit 5c3484
Packit 5c3484
Packit 5c3484
Packit 5c3484
Packit 5c3484
Packit 5c3484
Packit 5c3484
The code in this directory works for Cray vector systems such as C90,
Packit 5c3484
J90, T90 (both the CFP variant and the IEEE variant) and SV1.  (For
Packit 5c3484
the T3E and T3D systems, see the `alpha' subdirectory at the same
Packit 5c3484
level as the directory containing this file.)
Packit 5c3484
Packit 5c3484
The cfp subdirectory is for systems utilizing the traditional Cray
Packit 5c3484
floating-point format, and the ieee subdirectory is for the newer
Packit 5c3484
systems that use the IEEE floating-point format.
Packit 5c3484
Packit 5c3484
There are several issues that reduces speed on Cray systems.  For
Packit 5c3484
systems with cfp floating point, the main obstacle is the forming of
Packit 5c3484
128-bit products.  For IEEE systems, adding, and in particular
Packit 5c3484
computing carry is the main issue.  There are no vectorizing
Packit 5c3484
unsigned-less-than instructions, and the sequence that implement that
Packit 5c3484
operation is very long.
Packit 5c3484
Packit 5c3484
Shifting is the only operation that is simple to make fast.  All Cray
Packit 5c3484
systems have a bitblt instructions (Vi Vj,Vj<Ak and Vi Vj,Vj>Ak) that
Packit 5c3484
should be really useful.
Packit 5c3484
Packit 5c3484
For best speed for cfp systems, we need a mul_basecase, since that
Packit 5c3484
reduces the need for carry propagation to a minimum.  Depending on the
Packit 5c3484
size (vn) of the smaller of the two operands (V), we should split U and V
Packit 5c3484
in different chunk sizes:
Packit 5c3484
Packit 5c3484
U split in 2 32-bit parts
Packit 5c3484
V split according to the table:
Packit 5c3484
parts			4	5	6	7	8
Packit 5c3484
bits/part		16	13	11	10	8
Packit 5c3484
max allowed vn		1	8	32	64	256
Packit 5c3484
number of multiplies	8	10	12	14	16
Packit 5c3484
peak cycles/limb	4	5	6	7	8
Packit 5c3484
Packit 5c3484
U split in 3 22-bit parts
Packit 5c3484
V split according to the table:
Packit 5c3484
parts			3	4	5
Packit 5c3484
bits/part		22	16	13
Packit 5c3484
max allowed vn		16	1024	8192
Packit 5c3484
number of multiplies	9	12	15
Packit 5c3484
peak cycles/limb	4.5	6	7.5
Packit 5c3484
Packit 5c3484
U split in 4 16-bit parts
Packit 5c3484
V split according to the table:
Packit 5c3484
parts			4
Packit 5c3484
bits/part		16
Packit 5c3484
max allowed vn		65536
Packit 5c3484
number of multiplies	16
Packit 5c3484
peak cycles/limb	8
Packit 5c3484
Packit 5c3484
(A T90 CPU can accumulate two products per cycle.)
Packit 5c3484
Packit 5c3484
IDEA:
Packit 5c3484
* Rewrite mpn_add_n:
Packit 5c3484
    short cy[n + 1];
Packit 5c3484
    #pragma _CRI ivdep
Packit 5c3484
      for (i = 0; i < n; i++)
Packit 5c3484
	{ s = up[i] + vp[i];
Packit 5c3484
	  rp[i] = s;
Packit 5c3484
	  cy[i + 1] = s < up[i]; }
Packit 5c3484
      more_carries = 0;
Packit 5c3484
    #pragma _CRI ivdep
Packit 5c3484
      for (i = 1; i < n; i++)
Packit 5c3484
	{ s = rp[i] + cy[i];
Packit 5c3484
	  rp[i] = s;
Packit 5c3484
	  more_carries += s < cy[i]; }
Packit 5c3484
      cys = 0;
Packit 5c3484
      if (more_carries)
Packit 5c3484
	{
Packit 5c3484
	  cys = rp[1] < cy[1];
Packit 5c3484
	  for (i = 2; i < n; i++)
Packit 5c3484
	    { rp[i] += cys;
Packit 5c3484
	      cys = rp[i] < cys; }
Packit 5c3484
	}
Packit 5c3484
      return cys + cy[n];
Packit 5c3484
Packit 5c3484
* Write mpn_add3_n for adding three operands.  First add operands 1
Packit 5c3484
  and 2, and generate cy[].  Then add operand 3 to the partial result,
Packit 5c3484
  and accumulate carry into cy[].  Finally propagate carry just like
Packit 5c3484
  in the new mpn_add_n.
Packit 5c3484
Packit 5c3484
IDEA:
Packit 5c3484
Packit 5c3484
Store fewer bits, perhaps 62, per limb.  That brings mpn_add_n time
Packit 5c3484
down to 2.5 cycles/limb and mpn_addmul_1 times to 4 cycles/limb.  By
Packit 5c3484
storing even fewer bits per limb, perhaps 56, it would be possible to
Packit 5c3484
write a mul_mul_basecase that would run at effectively 1 cycle/limb.
Packit 5c3484
(Use VM here to better handle the romb-shaped multiply area, perhaps
Packit 5c3484
rounding operand sizes up to the next power of 2.)