|
Packit |
5c3484 |
Copyright 2011 Free Software Foundation, Inc.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
This file is part of the GNU MP Library.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
The GNU MP Library is free software; you can redistribute it and/or modify
|
|
Packit |
5c3484 |
it under the terms of either:
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
* the GNU Lesser General Public License as published by the Free
|
|
Packit |
5c3484 |
Software Foundation; either version 3 of the License, or (at your
|
|
Packit |
5c3484 |
option) any later version.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
or
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
* the GNU General Public License as published by the Free Software
|
|
Packit |
5c3484 |
Foundation; either version 2 of the License, or (at your option) any
|
|
Packit |
5c3484 |
later version.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
or both in parallel, as here.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
The GNU MP Library is distributed in the hope that it will be useful, but
|
|
Packit |
5c3484 |
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
|
|
Packit |
5c3484 |
or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
|
|
Packit |
5c3484 |
for more details.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
You should have received copies of the GNU General Public License and the
|
|
Packit |
5c3484 |
GNU Lesser General Public License along with the GNU MP Library. If not,
|
|
Packit |
5c3484 |
see https://www.gnu.org/licenses/.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
There are 5 generations of 64-but s390 processors, z900, z990, z9,
|
|
Packit |
5c3484 |
z10, and z196. The current GMP code was optimised for the two oldest,
|
|
Packit |
5c3484 |
z900 and z990.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
mpn_copyi
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
This code makes use of a loop around MVC. It almost surely runs very
|
|
Packit |
5c3484 |
close to optimally. A small improvement could be done by using one
|
|
Packit |
5c3484 |
MVC for size 256 bytes, now we use two (we use an extra MVC when
|
|
Packit |
5c3484 |
copying any multiple of 256 bytes).
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
mpn_copyd
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
We have tried several feed-in variants here, branch tree, jump table
|
|
Packit |
5c3484 |
and computed goto. The fastest (on z990) turned out to be computed
|
|
Packit |
5c3484 |
goto.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
An approach not tried is EX of LMG and STMG, modifying the register set
|
|
Packit |
5c3484 |
on-the-fly. Using that trick, we could completely avoid using
|
|
Packit |
5c3484 |
separate feed-in paths.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
mpn_lshift, mpn_rshift
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
The current code runs at pipeline decode bandwidth on z990.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
mpn_add_n, mpn_sub_n
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
The current code is 4-way unrolled. It should be unrolled more, at
|
|
Packit |
5c3484 |
least 8x, in order to reach 2.5 c/l.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
mpn_mul_1, mpn_addmul_1, mpn_submul_1
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
The current code is very naive, but due to the non-pipelined nature of
|
|
Packit |
5c3484 |
MLGR on z900 and z990, more sophisticated code would not gain much.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
On z10 one would need to cluster at least 4 MLGR together, in order to
|
|
Packit |
5c3484 |
reduce stalling.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
On z196, one surely want to use unrolling and pipelining, to perhaps
|
|
Packit |
5c3484 |
reach around 12 c/l. A major issue here and on z10 is ALCGR's 3 cycle
|
|
Packit |
5c3484 |
stalling.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
mpn_mul_2, mpn_addmul_2
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
At least for older machines (z900, z990) with very slow MLGR, we
|
|
Packit |
5c3484 |
should use Karatsuba's algorithm on 2-limb units, making mul_2 and
|
|
Packit |
5c3484 |
addmul_2 the main multiplication primitives. The newer machines might
|
|
Packit |
5c3484 |
benefit less from this approach, perhaps in particular z10, where MLGR
|
|
Packit |
5c3484 |
clustering is more important.
|
|
Packit |
5c3484 |
|
|
Packit |
5c3484 |
With Karatsuba, one could hope for around 16 cycles per accumulated
|
|
Packit |
5c3484 |
128 cross product, on z990.
|