|
Packit |
67cb25 |
.. index:: IEEE floating point
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. _chap_ieee:
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
******************************
|
|
Packit |
67cb25 |
IEEE floating-point arithmetic
|
|
Packit |
67cb25 |
******************************
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
This chapter describes functions for examining the representation of
|
|
Packit |
67cb25 |
floating point numbers and controlling the floating point environment of
|
|
Packit |
67cb25 |
your program. The functions described in this chapter are declared in
|
|
Packit |
67cb25 |
the header file :file:`gsl_ieee_utils.h`.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. index::
|
|
Packit |
67cb25 |
single: IEEE format for floating point numbers
|
|
Packit |
67cb25 |
single: bias, IEEE format
|
|
Packit |
67cb25 |
single: exponent, IEEE format
|
|
Packit |
67cb25 |
single: sign bit, IEEE format
|
|
Packit |
67cb25 |
single: mantissa, IEEE format
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
Representation of floating point numbers
|
|
Packit |
67cb25 |
========================================
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
The IEEE Standard for Binary Floating-Point Arithmetic defines binary
|
|
Packit |
67cb25 |
formats for single and double precision numbers. Each number is composed
|
|
Packit |
67cb25 |
of three parts: a *sign bit* (:math:`s`), an *exponent*
|
|
Packit |
67cb25 |
(:math:`E`) and a *fraction* (:math:`f`). The numerical value of the
|
|
Packit |
67cb25 |
combination :math:`(s,E,f)` is given by the following formula,
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: not texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. math:: (-1)^s (1 \cdot fffff\dots) 2^E
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
::
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
(-1)^s (1.fffff...) 2^E
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. index::
|
|
Packit |
67cb25 |
single: normalized form, IEEE format
|
|
Packit |
67cb25 |
single: denormalized form, IEEE format
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
The sign bit is either zero or one. The exponent ranges from a minimum value
|
|
Packit |
67cb25 |
:math:`E_{min}`
|
|
Packit |
67cb25 |
to a maximum value
|
|
Packit |
67cb25 |
:math:`E_{max}`
|
|
Packit |
67cb25 |
depending on the precision. The exponent is converted to an
|
|
Packit |
67cb25 |
unsigned number
|
|
Packit |
67cb25 |
:math:`e`, known as the *biased exponent*, for storage by adding a
|
|
Packit |
67cb25 |
*bias* parameter,
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: not texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. math:: e = E + \hbox{\it bias}
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
::
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
e = E + bias
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
The sequence :math:`fffff...` represents the digits of the binary
|
|
Packit |
67cb25 |
fraction :math:`f`. The binary digits are stored in *normalized
|
|
Packit |
67cb25 |
form*, by adjusting the exponent to give a leading digit of :math:`1`.
|
|
Packit |
67cb25 |
Since the leading digit is always 1 for normalized numbers it is
|
|
Packit |
67cb25 |
assumed implicitly and does not have to be stored.
|
|
Packit |
67cb25 |
Numbers smaller than
|
|
Packit |
67cb25 |
:math:`2^{E_{min}}`
|
|
Packit |
67cb25 |
are be stored in *denormalized form* with a leading zero,
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: not texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. math:: (-1)^s (0 \cdot fffff\dots) 2^{E_{min}}
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
::
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
(-1)^s (0.fffff...) 2^(E_min)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. index::
|
|
Packit |
67cb25 |
single: zero, IEEE format
|
|
Packit |
67cb25 |
single: infinity, IEEE format
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
This allows gradual underflow down to
|
|
Packit |
67cb25 |
:math:`2^{E_{min} - p}`
|
|
Packit |
67cb25 |
for :math:`p` bits of precision.
|
|
Packit |
67cb25 |
A zero is encoded with the special exponent of
|
|
Packit |
67cb25 |
:math:`2^{E_{min}-1}`
|
|
Packit |
67cb25 |
and infinities with the exponent of
|
|
Packit |
67cb25 |
:math:`2^{E_{max}+1}`.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. index::
|
|
Packit |
67cb25 |
single: single precision, IEEE format
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
The format for single precision numbers uses 32 bits divided in the
|
|
Packit |
67cb25 |
following way::
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
seeeeeeeefffffffffffffffffffffff
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
s = sign bit, 1 bit
|
|
Packit |
67cb25 |
e = exponent, 8 bits (E_min=-126, E_max=127, bias=127)
|
|
Packit |
67cb25 |
f = fraction, 23 bits
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. index::
|
|
Packit |
67cb25 |
single: double precision, IEEE format
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
The format for double precision numbers uses 64 bits divided in the
|
|
Packit |
67cb25 |
following way::
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
seeeeeeeeeeeffffffffffffffffffffffffffffffffffffffffffffffffffff
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
s = sign bit, 1 bit
|
|
Packit |
67cb25 |
e = exponent, 11 bits (E_min=-1022, E_max=1023, bias=1023)
|
|
Packit |
67cb25 |
f = fraction, 52 bits
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
It is often useful to be able to investigate the behavior of a
|
|
Packit |
67cb25 |
calculation at the bit-level and the library provides functions for
|
|
Packit |
67cb25 |
printing the IEEE representations in a human-readable form.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. float vs double vs long double
|
|
Packit |
67cb25 |
.. (how many digits are available for each)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. function:: void gsl_ieee_fprintf_float (FILE * stream, const float * x)
|
|
Packit |
67cb25 |
void gsl_ieee_fprintf_double (FILE * stream, const double * x)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
These functions output a formatted version of the IEEE floating-point
|
|
Packit |
67cb25 |
number pointed to by :data:`x` to the stream :data:`stream`. A pointer is
|
|
Packit |
67cb25 |
used to pass the number indirectly, to avoid any undesired promotion
|
|
Packit |
67cb25 |
from :code:`float` to :code:`double`. The output takes one of the
|
|
Packit |
67cb25 |
following forms,
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
:code:`NaN`
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
the Not-a-Number symbol
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
:code:`Inf, -Inf`
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
positive or negative infinity
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
:code:`1.fffff...*2^E, -1.fffff...*2^E`
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
a normalized floating point number
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
:code:`0.fffff...*2^E, -0.fffff...*2^E`
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
a denormalized floating point number
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
:code:`0, -0`
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
positive or negative zero
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
The output can be used directly in GNU Emacs Calc mode by preceding it
|
|
Packit |
67cb25 |
with :code:`2#` to indicate binary.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. @item [non-standard IEEE float], [non-standard IEEE double]
|
|
Packit |
67cb25 |
.. an unrecognized encoding
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. function:: void gsl_ieee_printf_float (const float * x)
|
|
Packit |
67cb25 |
void gsl_ieee_printf_double (const double * x)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
These functions output a formatted version of the IEEE floating-point
|
|
Packit |
67cb25 |
number pointed to by :data:`x` to the stream :code:`stdout`.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
The following program demonstrates the use of the functions by printing
|
|
Packit |
67cb25 |
the single and double precision representations of the fraction
|
|
Packit |
67cb25 |
:math:`1/3`. For comparison the representation of the value promoted from
|
|
Packit |
67cb25 |
single to double precision is also printed.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. include:: examples/ieee.c
|
|
Packit |
67cb25 |
:code:
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
The binary representation of :math:`1/3` is :math:`0.01010101...`. The
|
|
Packit |
67cb25 |
output below shows that the IEEE format normalizes this fraction to give
|
|
Packit |
67cb25 |
a leading digit of 1::
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
f= 1.01010101010101010101011*2^-2
|
|
Packit |
67cb25 |
fd= 1.0101010101010101010101100000000000000000000000000000*2^-2
|
|
Packit |
67cb25 |
d= 1.0101010101010101010101010101010101010101010101010101*2^-2
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
The output also shows that a single-precision number is promoted to
|
|
Packit |
67cb25 |
double-precision by adding zeros in the binary representation.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. importance of using 1.234L in long double calculations
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. @example
|
|
Packit |
67cb25 |
.. int main (void)
|
|
Packit |
67cb25 |
.. @{
|
|
Packit |
67cb25 |
.. long double x = 1.0, y = 1.0;
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. x = x + 0.2;
|
|
Packit |
67cb25 |
.. y = y + 0.2L;
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. printf(" d %.20Lf\n",x);
|
|
Packit |
67cb25 |
.. printf("ld %.20Lf\n",y);
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. return 1;
|
|
Packit |
67cb25 |
.. @}
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. d 1.20000000000000001110
|
|
Packit |
67cb25 |
.. ld 1.20000000000000000004
|
|
Packit |
67cb25 |
.. @end example
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. index::
|
|
Packit |
67cb25 |
single: IEEE exceptions
|
|
Packit |
67cb25 |
single: precision, IEEE arithmetic
|
|
Packit |
67cb25 |
single: rounding mode
|
|
Packit |
67cb25 |
single: arithmetic exceptions
|
|
Packit |
67cb25 |
single: exceptions, IEEE arithmetic
|
|
Packit |
67cb25 |
single: division by zero, IEEE exceptions
|
|
Packit |
67cb25 |
single: underflow, IEEE exceptions
|
|
Packit |
67cb25 |
single: overflow, IEEE exceptions
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
Setting up your IEEE environment
|
|
Packit |
67cb25 |
================================
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
The IEEE standard defines several *modes* for controlling the
|
|
Packit |
67cb25 |
behavior of floating point operations. These modes specify the important
|
|
Packit |
67cb25 |
properties of computer arithmetic: the direction used for rounding (e.g.
|
|
Packit |
67cb25 |
whether numbers should be rounded up, down or to the nearest number),
|
|
Packit |
67cb25 |
the rounding precision and how the program should handle arithmetic
|
|
Packit |
67cb25 |
exceptions, such as division by zero.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
Many of these features can now be controlled via standard functions such
|
|
Packit |
67cb25 |
as :func:`fpsetround`, which should be used whenever they are available.
|
|
Packit |
67cb25 |
Unfortunately in the past there has been no universal API for
|
|
Packit |
67cb25 |
controlling their behavior---each system has had its own low-level way
|
|
Packit |
67cb25 |
of accessing them. To help you write portable programs GSL allows you
|
|
Packit |
67cb25 |
to specify modes in a platform-independent way using the environment
|
|
Packit |
67cb25 |
variable :macro:`GSL_IEEE_MODE`. The library then takes care of all the
|
|
Packit |
67cb25 |
necessary machine-specific initializations for you when you call the
|
|
Packit |
67cb25 |
function :func:`gsl_ieee_env_setup`.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. macro:: GSL_IEEE_MODE
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
Environment variable which specifies IEEE mode.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. function:: void gsl_ieee_env_setup ()
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
This function reads the environment variable :macro:`GSL_IEEE_MODE` and
|
|
Packit |
67cb25 |
attempts to set up the corresponding specified IEEE modes. The
|
|
Packit |
67cb25 |
environment variable should be a list of keywords, separated by
|
|
Packit |
67cb25 |
commas, like this::
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
GSL_IEEE_MODE = "keyword, keyword, ..."
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
where :data:`keyword` is one of the following mode-names::
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
single-precision
|
|
Packit |
67cb25 |
double-precision
|
|
Packit |
67cb25 |
extended-precision
|
|
Packit |
67cb25 |
round-to-nearest
|
|
Packit |
67cb25 |
round-down
|
|
Packit |
67cb25 |
round-up
|
|
Packit |
67cb25 |
round-to-zero
|
|
Packit |
67cb25 |
mask-all
|
|
Packit |
67cb25 |
mask-invalid
|
|
Packit |
67cb25 |
mask-denormalized
|
|
Packit |
67cb25 |
mask-division-by-zero
|
|
Packit |
67cb25 |
mask-overflow
|
|
Packit |
67cb25 |
mask-underflow
|
|
Packit |
67cb25 |
trap-inexact
|
|
Packit |
67cb25 |
trap-common
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
If :macro:`GSL_IEEE_MODE` is empty or undefined then the function returns
|
|
Packit |
67cb25 |
immediately and no attempt is made to change the system's IEEE
|
|
Packit |
67cb25 |
mode. When the modes from :macro:`GSL_IEEE_MODE` are turned on the
|
|
Packit |
67cb25 |
function prints a short message showing the new settings to remind you
|
|
Packit |
67cb25 |
that the results of the program will be affected.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
If the requested modes are not supported by the platform being used then
|
|
Packit |
67cb25 |
the function calls the error handler and returns an error code of
|
|
Packit |
67cb25 |
:macro:`GSL_EUNSUP`.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
When options are specified using this method, the resulting mode is
|
|
Packit |
67cb25 |
based on a default setting of the highest available precision (double
|
|
Packit |
67cb25 |
precision or extended precision, depending on the platform) in
|
|
Packit |
67cb25 |
round-to-nearest mode, with all exceptions enabled apart from the
|
|
Packit |
67cb25 |
INEXACT exception. The INEXACT exception is generated
|
|
Packit |
67cb25 |
whenever rounding occurs, so it must generally be disabled in typical
|
|
Packit |
67cb25 |
scientific calculations. All other floating-point exceptions are
|
|
Packit |
67cb25 |
enabled by default, including underflows and the use of denormalized
|
|
Packit |
67cb25 |
numbers, for safety. They can be disabled with the individual
|
|
Packit |
67cb25 |
:code:`mask-` settings or together using :code:`mask-all`.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
The following adjusted combination of modes is convenient for many
|
|
Packit |
67cb25 |
purposes::
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
GSL_IEEE_MODE="double-precision,"\
|
|
Packit |
67cb25 |
"mask-underflow,"\
|
|
Packit |
67cb25 |
"mask-denormalized"
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
This choice ignores any errors relating to small numbers (either
|
|
Packit |
67cb25 |
denormalized, or underflowing to zero) but traps overflows, division by
|
|
Packit |
67cb25 |
zero and invalid operations.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
Note that on the x86 series of processors this function sets both the
|
|
Packit |
67cb25 |
original x87 mode and the newer MXCSR mode, which controls SSE
|
|
Packit |
67cb25 |
floating-point operations. The SSE floating-point units do not have a
|
|
Packit |
67cb25 |
precision-control bit, and always work in double-precision. The
|
|
Packit |
67cb25 |
single-precision and extended-precision keywords have no effect in
|
|
Packit |
67cb25 |
this case.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
To demonstrate the effects of different rounding modes consider the
|
|
Packit |
67cb25 |
following program which computes :math:`e`, the base of natural
|
|
Packit |
67cb25 |
logarithms, by summing a rapidly-decreasing series,
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: not texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. math::
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
e &= {1 \over 0!} + {1 \over 1!} + {1 \over 2!} + {1 \over 3!} + {1 \over 4!} + \dots \\
|
|
Packit |
67cb25 |
&= 2.71828182846...
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
::
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
e = 1 + 1/2! + 1/3! + 1/4! + ...
|
|
Packit |
67cb25 |
= 2.71828182846...
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. include:: examples/ieeeround.c
|
|
Packit |
67cb25 |
:code:
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
Here are the results of running the program in :code:`round-to-nearest`
|
|
Packit |
67cb25 |
mode. This is the IEEE default so it isn't really necessary to specify
|
|
Packit |
67cb25 |
it here::
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
$ GSL_IEEE_MODE="round-to-nearest" ./a.out
|
|
Packit |
67cb25 |
i= 1 sum=1.000000000000000000 error=-1.71828
|
|
Packit |
67cb25 |
i= 2 sum=2.000000000000000000 error=-0.718282
|
|
Packit |
67cb25 |
....
|
|
Packit |
67cb25 |
i=18 sum=2.718281828459045535 error=4.44089e-16
|
|
Packit |
67cb25 |
i=19 sum=2.718281828459045535 error=4.44089e-16
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
After nineteen terms the sum converges to within :math:`4 \times 10^{-16}`
|
|
Packit |
67cb25 |
of the correct value.
|
|
Packit |
67cb25 |
If we now change the rounding mode to
|
|
Packit |
67cb25 |
:code:`round-down` the final result is less accurate::
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
$ GSL_IEEE_MODE="round-down" ./a.out
|
|
Packit |
67cb25 |
i= 1 sum=1.000000000000000000 error=-1.71828
|
|
Packit |
67cb25 |
....
|
|
Packit |
67cb25 |
i=19 sum=2.718281828459041094 error=-3.9968e-15
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
The result is about
|
|
Packit |
67cb25 |
:math:`4 \times 10^{-15}`
|
|
Packit |
67cb25 |
below the correct value, an order of magnitude worse than the result
|
|
Packit |
67cb25 |
obtained in the :code:`round-to-nearest` mode.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
If we change to rounding mode to :code:`round-up` then the final result
|
|
Packit |
67cb25 |
is higher than the correct value (when we add each term to the sum the
|
|
Packit |
67cb25 |
final result is always rounded up, which increases the sum by at least
|
|
Packit |
67cb25 |
one tick until the added term underflows to zero). To avoid this
|
|
Packit |
67cb25 |
problem we would need to use a safer converge criterion, such as
|
|
Packit |
67cb25 |
:code:`while (fabs(sum - oldsum) > epsilon)`, with a suitably chosen
|
|
Packit |
67cb25 |
value of epsilon.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
Finally we can see the effect of computing the sum using
|
|
Packit |
67cb25 |
single-precision rounding, in the default :code:`round-to-nearest`
|
|
Packit |
67cb25 |
mode. In this case the program thinks it is still using double precision
|
|
Packit |
67cb25 |
numbers but the CPU rounds the result of each floating point operation
|
|
Packit |
67cb25 |
to single-precision accuracy. This simulates the effect of writing the
|
|
Packit |
67cb25 |
program using single-precision :code:`float` variables instead of
|
|
Packit |
67cb25 |
:code:`double` variables. The iteration stops after about half the number
|
|
Packit |
67cb25 |
of iterations and the final result is much less accurate::
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
$ GSL_IEEE_MODE="single-precision" ./a.out
|
|
Packit |
67cb25 |
....
|
|
Packit |
67cb25 |
i=12 sum=2.718281984329223633 error=1.5587e-07
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
with an error of
|
|
Packit |
67cb25 |
:math:`O(10^{-7})`,
|
|
Packit |
67cb25 |
which corresponds to single
|
|
Packit |
67cb25 |
precision accuracy (about 1 part in :math:`10^7`). Continuing the
|
|
Packit |
67cb25 |
iterations further does not decrease the error because all the
|
|
Packit |
67cb25 |
subsequent results are rounded to the same value.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
References and Further Reading
|
|
Packit |
67cb25 |
==============================
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
The reference for the IEEE standard is,
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
* ANSI/IEEE Std 754-1985, IEEE Standard for Binary Floating-Point Arithmetic.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
A more pedagogical introduction to the standard can be found in the
|
|
Packit |
67cb25 |
following paper,
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
* David Goldberg: What Every Computer Scientist Should Know About
|
|
Packit |
67cb25 |
Floating-Point Arithmetic. *ACM Computing Surveys*, Vol.: 23, No.: 1
|
|
Packit |
67cb25 |
(March 1991), pages 5--48.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
* Corrigendum: *ACM Computing Surveys*, Vol.: 23, No.: 3 (September
|
|
Packit |
67cb25 |
1991), page 413. and see also the sections by B. A. Wichmann and Charles
|
|
Packit |
67cb25 |
B. Dunham in Surveyor's Forum: "What Every Computer Scientist Should
|
|
Packit |
67cb25 |
Know About Floating-Point Arithmetic". *ACM Computing Surveys*,
|
|
Packit |
67cb25 |
Vol.: 24, No.: 3 (September 1992), page 319.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
A detailed textbook on IEEE arithmetic and its practical use is
|
|
Packit |
67cb25 |
available from SIAM Press,
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
* Michael L. Overton, *Numerical Computing with IEEE Floating Point Arithmetic*,
|
|
Packit |
67cb25 |
SIAM Press, ISBN 0898715717.
|