Blame mpn/s390_64/README

Packit 5c3484
Copyright 2011 Free Software Foundation, Inc.
Packit 5c3484
Packit 5c3484
This file is part of the GNU MP Library.
Packit 5c3484
Packit 5c3484
The GNU MP Library is free software; you can redistribute it and/or modify
Packit 5c3484
it under the terms of either:
Packit 5c3484
Packit 5c3484
  * the GNU Lesser General Public License as published by the Free
Packit 5c3484
    Software Foundation; either version 3 of the License, or (at your
Packit 5c3484
    option) any later version.
Packit 5c3484
Packit 5c3484
or
Packit 5c3484
Packit 5c3484
  * the GNU General Public License as published by the Free Software
Packit 5c3484
    Foundation; either version 2 of the License, or (at your option) any
Packit 5c3484
    later version.
Packit 5c3484
Packit 5c3484
or both in parallel, as here.
Packit 5c3484
Packit 5c3484
The GNU MP Library is distributed in the hope that it will be useful, but
Packit 5c3484
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
Packit 5c3484
or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
Packit 5c3484
for more details.
Packit 5c3484
Packit 5c3484
You should have received copies of the GNU General Public License and the
Packit 5c3484
GNU Lesser General Public License along with the GNU MP Library.  If not,
Packit 5c3484
see https://www.gnu.org/licenses/.
Packit 5c3484
Packit 5c3484
Packit 5c3484
Packit 5c3484
There are 5 generations of 64-but s390 processors, z900, z990, z9,
Packit 5c3484
z10, and z196.  The current GMP code was optimised for the two oldest,
Packit 5c3484
z900 and z990.
Packit 5c3484
Packit 5c3484
Packit 5c3484
mpn_copyi
Packit 5c3484
Packit 5c3484
This code makes use of a loop around MVC.  It almost surely runs very
Packit 5c3484
close to optimally.  A small improvement could be done by using one
Packit 5c3484
MVC for size 256 bytes, now we use two (we use an extra MVC when
Packit 5c3484
copying any multiple of 256 bytes).
Packit 5c3484
Packit 5c3484
Packit 5c3484
mpn_copyd
Packit 5c3484
Packit 5c3484
We have tried several feed-in variants here, branch tree, jump table
Packit 5c3484
and computed goto.  The fastest (on z990) turned out to be computed
Packit 5c3484
goto.
Packit 5c3484
Packit 5c3484
An approach not tried is EX of LMG and STMG, modifying the register set
Packit 5c3484
on-the-fly.  Using that trick, we could completely avoid using
Packit 5c3484
separate feed-in paths.
Packit 5c3484
Packit 5c3484
Packit 5c3484
mpn_lshift, mpn_rshift
Packit 5c3484
Packit 5c3484
The current code runs at pipeline decode bandwidth on z990.
Packit 5c3484
Packit 5c3484
Packit 5c3484
mpn_add_n, mpn_sub_n
Packit 5c3484
Packit 5c3484
The current code is 4-way unrolled.  It should be unrolled more, at
Packit 5c3484
least 8x, in order to reach 2.5 c/l.
Packit 5c3484
Packit 5c3484
Packit 5c3484
mpn_mul_1, mpn_addmul_1, mpn_submul_1
Packit 5c3484
Packit 5c3484
The current code is very naive, but due to the non-pipelined nature of
Packit 5c3484
MLGR on z900 and z990, more sophisticated code would not gain much.
Packit 5c3484
Packit 5c3484
On z10 one would need to cluster at least 4 MLGR together, in order to
Packit 5c3484
reduce stalling.
Packit 5c3484
Packit 5c3484
On z196, one surely want to use unrolling and pipelining, to perhaps
Packit 5c3484
reach around 12 c/l.  A major issue here and on z10 is ALCGR's 3 cycle
Packit 5c3484
stalling.
Packit 5c3484
Packit 5c3484
Packit 5c3484
mpn_mul_2, mpn_addmul_2
Packit 5c3484
Packit 5c3484
At least for older machines (z900, z990) with very slow MLGR, we
Packit 5c3484
should use Karatsuba's algorithm on 2-limb units, making mul_2 and
Packit 5c3484
addmul_2 the main multiplication primitives.  The newer machines might
Packit 5c3484
benefit less from this approach, perhaps in particular z10, where MLGR
Packit 5c3484
clustering is more important.
Packit 5c3484
Packit 5c3484
With Karatsuba, one could hope for around 16 cycles per accumulated
Packit 5c3484
128 cross product, on z990.