Blame doc/ieee754.rst

Packit 67cb25
.. index:: IEEE floating point 
Packit 67cb25
Packit 67cb25
.. _chap_ieee:
Packit 67cb25
Packit 67cb25
******************************
Packit 67cb25
IEEE floating-point arithmetic
Packit 67cb25
******************************
Packit 67cb25
Packit 67cb25
This chapter describes functions for examining the representation of
Packit 67cb25
floating point numbers and controlling the floating point environment of
Packit 67cb25
your program.  The functions described in this chapter are declared in
Packit 67cb25
the header file :file:`gsl_ieee_utils.h`.
Packit 67cb25
Packit 67cb25
.. index::
Packit 67cb25
   single: IEEE format for floating point numbers
Packit 67cb25
   single: bias, IEEE format
Packit 67cb25
   single: exponent, IEEE format
Packit 67cb25
   single: sign bit, IEEE format
Packit 67cb25
   single: mantissa, IEEE format
Packit 67cb25
Packit 67cb25
Representation of floating point numbers
Packit 67cb25
========================================
Packit 67cb25
Packit 67cb25
The IEEE Standard for Binary Floating-Point Arithmetic defines binary
Packit 67cb25
formats for single and double precision numbers.  Each number is composed
Packit 67cb25
of three parts: a *sign bit* (:math:`s`), an *exponent*
Packit 67cb25
(:math:`E`) and a *fraction* (:math:`f`).  The numerical value of the
Packit 67cb25
combination :math:`(s,E,f)` is given by the following formula,
Packit 67cb25
Packit 67cb25
.. only:: not texinfo
Packit 67cb25
Packit 67cb25
   .. math:: (-1)^s (1 \cdot fffff\dots) 2^E
Packit 67cb25
Packit 67cb25
.. only:: texinfo
Packit 67cb25
Packit 67cb25
   ::
Packit 67cb25
Packit 67cb25
      (-1)^s (1.fffff...) 2^E
Packit 67cb25
Packit 67cb25
.. index::
Packit 67cb25
   single: normalized form, IEEE format
Packit 67cb25
   single: denormalized form, IEEE format
Packit 67cb25
Packit 67cb25
The sign bit is either zero or one.  The exponent ranges from a minimum value
Packit 67cb25
:math:`E_{min}`
Packit 67cb25
to a maximum value
Packit 67cb25
:math:`E_{max}`
Packit 67cb25
depending on the precision.  The exponent is converted to an 
Packit 67cb25
unsigned number
Packit 67cb25
:math:`e`, known as the *biased exponent*, for storage by adding a
Packit 67cb25
*bias* parameter,
Packit 67cb25
Packit 67cb25
.. only:: not texinfo
Packit 67cb25
Packit 67cb25
   .. math:: e = E + \hbox{\it bias}
Packit 67cb25
Packit 67cb25
.. only:: texinfo
Packit 67cb25
   
Packit 67cb25
   ::
Packit 67cb25
Packit 67cb25
      e = E + bias
Packit 67cb25
Packit 67cb25
The sequence :math:`fffff...` represents the digits of the binary
Packit 67cb25
fraction :math:`f`.  The binary digits are stored in *normalized
Packit 67cb25
form*, by adjusting the exponent to give a leading digit of :math:`1`. 
Packit 67cb25
Since the leading digit is always 1 for normalized numbers it is
Packit 67cb25
assumed implicitly and does not have to be stored.
Packit 67cb25
Numbers smaller than 
Packit 67cb25
:math:`2^{E_{min}}`
Packit 67cb25
are be stored in *denormalized form* with a leading zero,
Packit 67cb25
Packit 67cb25
.. only:: not texinfo
Packit 67cb25
Packit 67cb25
   .. math:: (-1)^s (0 \cdot fffff\dots) 2^{E_{min}}
Packit 67cb25
Packit 67cb25
.. only:: texinfo
Packit 67cb25
Packit 67cb25
   ::
Packit 67cb25
Packit 67cb25
      (-1)^s (0.fffff...) 2^(E_min)
Packit 67cb25
Packit 67cb25
.. index::
Packit 67cb25
   single: zero, IEEE format
Packit 67cb25
   single: infinity, IEEE format
Packit 67cb25
Packit 67cb25
This allows gradual underflow down to 
Packit 67cb25
:math:`2^{E_{min} - p}`
Packit 67cb25
for :math:`p` bits of precision. 
Packit 67cb25
A zero is encoded with the special exponent of 
Packit 67cb25
:math:`2^{E_{min}-1}`
Packit 67cb25
and infinities with the exponent of 
Packit 67cb25
:math:`2^{E_{max}+1}`.
Packit 67cb25
Packit 67cb25
.. index::
Packit 67cb25
   single: single precision, IEEE format
Packit 67cb25
Packit 67cb25
The format for single precision numbers uses 32 bits divided in the
Packit 67cb25
following way::
Packit 67cb25
Packit 67cb25
  seeeeeeeefffffffffffffffffffffff
Packit 67cb25
    
Packit 67cb25
  s = sign bit, 1 bit
Packit 67cb25
  e = exponent, 8 bits  (E_min=-126, E_max=127, bias=127)
Packit 67cb25
  f = fraction, 23 bits
Packit 67cb25
Packit 67cb25
.. index::
Packit 67cb25
   single: double precision, IEEE format
Packit 67cb25
Packit 67cb25
The format for double precision numbers uses 64 bits divided in the
Packit 67cb25
following way::
Packit 67cb25
Packit 67cb25
  seeeeeeeeeeeffffffffffffffffffffffffffffffffffffffffffffffffffff
Packit 67cb25
Packit 67cb25
  s = sign bit, 1 bit
Packit 67cb25
  e = exponent, 11 bits  (E_min=-1022, E_max=1023, bias=1023)
Packit 67cb25
  f = fraction, 52 bits
Packit 67cb25
Packit 67cb25
It is often useful to be able to investigate the behavior of a
Packit 67cb25
calculation at the bit-level and the library provides functions for
Packit 67cb25
printing the IEEE representations in a human-readable form.
Packit 67cb25
Packit 67cb25
.. float vs double vs long double 
Packit 67cb25
.. (how many digits are available for each)
Packit 67cb25
Packit 67cb25
.. function:: void gsl_ieee_fprintf_float (FILE * stream, const float * x)
Packit 67cb25
              void gsl_ieee_fprintf_double (FILE * stream, const double * x)
Packit 67cb25
Packit 67cb25
   These functions output a formatted version of the IEEE floating-point
Packit 67cb25
   number pointed to by :data:`x` to the stream :data:`stream`. A pointer is
Packit 67cb25
   used to pass the number indirectly, to avoid any undesired promotion
Packit 67cb25
   from :code:`float` to :code:`double`.  The output takes one of the
Packit 67cb25
   following forms,
Packit 67cb25
Packit 67cb25
   :code:`NaN`
Packit 67cb25
Packit 67cb25
      the Not-a-Number symbol
Packit 67cb25
Packit 67cb25
   :code:`Inf, -Inf`
Packit 67cb25
Packit 67cb25
      positive or negative infinity
Packit 67cb25
Packit 67cb25
   :code:`1.fffff...*2^E, -1.fffff...*2^E`
Packit 67cb25
Packit 67cb25
      a normalized floating point number
Packit 67cb25
Packit 67cb25
   :code:`0.fffff...*2^E, -0.fffff...*2^E`
Packit 67cb25
Packit 67cb25
      a denormalized floating point number
Packit 67cb25
Packit 67cb25
   :code:`0, -0`
Packit 67cb25
Packit 67cb25
      positive or negative zero
Packit 67cb25
Packit 67cb25
   The output can be used directly in GNU Emacs Calc mode by preceding it
Packit 67cb25
   with :code:`2#` to indicate binary.
Packit 67cb25
Packit 67cb25
.. @item [non-standard IEEE float], [non-standard IEEE double]
Packit 67cb25
.. an unrecognized encoding
Packit 67cb25
Packit 67cb25
.. function:: void gsl_ieee_printf_float (const float * x)
Packit 67cb25
              void gsl_ieee_printf_double (const double * x)
Packit 67cb25
Packit 67cb25
   These functions output a formatted version of the IEEE floating-point
Packit 67cb25
   number pointed to by :data:`x` to the stream :code:`stdout`.
Packit 67cb25
Packit 67cb25
The following program demonstrates the use of the functions by printing
Packit 67cb25
the single and double precision representations of the fraction
Packit 67cb25
:math:`1/3`.  For comparison the representation of the value promoted from
Packit 67cb25
single to double precision is also printed.
Packit 67cb25
Packit 67cb25
.. include:: examples/ieee.c
Packit 67cb25
   :code:
Packit 67cb25
Packit 67cb25
The binary representation of :math:`1/3` is :math:`0.01010101...`.  The
Packit 67cb25
output below shows that the IEEE format normalizes this fraction to give
Packit 67cb25
a leading digit of 1::
Packit 67cb25
Packit 67cb25
   f= 1.01010101010101010101011*2^-2
Packit 67cb25
  fd= 1.0101010101010101010101100000000000000000000000000000*2^-2
Packit 67cb25
   d= 1.0101010101010101010101010101010101010101010101010101*2^-2
Packit 67cb25
Packit 67cb25
The output also shows that a single-precision number is promoted to
Packit 67cb25
double-precision by adding zeros in the binary representation.
Packit 67cb25
Packit 67cb25
.. importance of using 1.234L in long double calculations
Packit 67cb25
Packit 67cb25
.. @example
Packit 67cb25
.. int main (void)
Packit 67cb25
.. @{
Packit 67cb25
..   long double x = 1.0, y = 1.0;
Packit 67cb25
  
Packit 67cb25
..   x = x + 0.2;
Packit 67cb25
..   y = y + 0.2L;
Packit 67cb25
Packit 67cb25
..   printf(" d %.20Lf\n",x);
Packit 67cb25
..   printf("ld %.20Lf\n",y);
Packit 67cb25
Packit 67cb25
..   return 1;
Packit 67cb25
.. @}
Packit 67cb25
Packit 67cb25
..  d 1.20000000000000001110
Packit 67cb25
.. ld 1.20000000000000000004
Packit 67cb25
.. @end example
Packit 67cb25
Packit 67cb25
.. index::
Packit 67cb25
   single: IEEE exceptions
Packit 67cb25
   single: precision, IEEE arithmetic
Packit 67cb25
   single: rounding mode
Packit 67cb25
   single: arithmetic exceptions
Packit 67cb25
   single: exceptions, IEEE arithmetic
Packit 67cb25
   single: division by zero, IEEE exceptions
Packit 67cb25
   single: underflow, IEEE exceptions
Packit 67cb25
   single: overflow, IEEE exceptions
Packit 67cb25
Packit 67cb25
Setting up your IEEE environment
Packit 67cb25
================================
Packit 67cb25
Packit 67cb25
The IEEE standard defines several *modes* for controlling the
Packit 67cb25
behavior of floating point operations.  These modes specify the important
Packit 67cb25
properties of computer arithmetic: the direction used for rounding (e.g.
Packit 67cb25
whether numbers should be rounded up, down or to the nearest number),
Packit 67cb25
the rounding precision and how the program should handle arithmetic
Packit 67cb25
exceptions, such as division by zero.
Packit 67cb25
Packit 67cb25
Many of these features can now be controlled via standard functions such
Packit 67cb25
as :func:`fpsetround`, which should be used whenever they are available.
Packit 67cb25
Unfortunately in the past there has been no universal API for
Packit 67cb25
controlling their behavior---each system has had its own low-level way
Packit 67cb25
of accessing them.  To help you write portable programs GSL allows you
Packit 67cb25
to specify modes in a platform-independent way using the environment
Packit 67cb25
variable :macro:`GSL_IEEE_MODE`.  The library then takes care of all the
Packit 67cb25
necessary machine-specific initializations for you when you call the
Packit 67cb25
function :func:`gsl_ieee_env_setup`.
Packit 67cb25
Packit 67cb25
.. macro:: GSL_IEEE_MODE
Packit 67cb25
Packit 67cb25
   Environment variable which specifies IEEE mode.
Packit 67cb25
Packit 67cb25
.. function:: void gsl_ieee_env_setup ()
Packit 67cb25
Packit 67cb25
   This function reads the environment variable :macro:`GSL_IEEE_MODE` and
Packit 67cb25
   attempts to set up the corresponding specified IEEE modes.  The
Packit 67cb25
   environment variable should be a list of keywords, separated by
Packit 67cb25
   commas, like this::
Packit 67cb25
Packit 67cb25
      GSL_IEEE_MODE = "keyword, keyword, ..."
Packit 67cb25
Packit 67cb25
   where :data:`keyword` is one of the following mode-names::
Packit 67cb25
Packit 67cb25
     single-precision
Packit 67cb25
     double-precision
Packit 67cb25
     extended-precision
Packit 67cb25
     round-to-nearest
Packit 67cb25
     round-down
Packit 67cb25
     round-up
Packit 67cb25
     round-to-zero
Packit 67cb25
     mask-all
Packit 67cb25
     mask-invalid
Packit 67cb25
     mask-denormalized
Packit 67cb25
     mask-division-by-zero
Packit 67cb25
     mask-overflow
Packit 67cb25
     mask-underflow
Packit 67cb25
     trap-inexact
Packit 67cb25
     trap-common
Packit 67cb25
Packit 67cb25
   If :macro:`GSL_IEEE_MODE` is empty or undefined then the function returns
Packit 67cb25
   immediately and no attempt is made to change the system's IEEE
Packit 67cb25
   mode.  When the modes from :macro:`GSL_IEEE_MODE` are turned on the
Packit 67cb25
   function prints a short message showing the new settings to remind you
Packit 67cb25
   that the results of the program will be affected.
Packit 67cb25
Packit 67cb25
   If the requested modes are not supported by the platform being used then
Packit 67cb25
   the function calls the error handler and returns an error code of
Packit 67cb25
   :macro:`GSL_EUNSUP`.
Packit 67cb25
Packit 67cb25
   When options are specified using this method, the resulting mode is
Packit 67cb25
   based on a default setting of the highest available precision (double
Packit 67cb25
   precision or extended precision, depending on the platform) in
Packit 67cb25
   round-to-nearest mode, with all exceptions enabled apart from the
Packit 67cb25
   INEXACT exception.  The INEXACT exception is generated
Packit 67cb25
   whenever rounding occurs, so it must generally be disabled in typical
Packit 67cb25
   scientific calculations.  All other floating-point exceptions are
Packit 67cb25
   enabled by default, including underflows and the use of denormalized
Packit 67cb25
   numbers, for safety.  They can be disabled with the individual
Packit 67cb25
   :code:`mask-` settings or together using :code:`mask-all`.
Packit 67cb25
Packit 67cb25
   The following adjusted combination of modes is convenient for many
Packit 67cb25
   purposes::
Packit 67cb25
Packit 67cb25
      GSL_IEEE_MODE="double-precision,"\
Packit 67cb25
                      "mask-underflow,"\
Packit 67cb25
                        "mask-denormalized"
Packit 67cb25
Packit 67cb25
   This choice ignores any errors relating to small numbers (either
Packit 67cb25
   denormalized, or underflowing to zero) but traps overflows, division by
Packit 67cb25
   zero and invalid operations.
Packit 67cb25
Packit 67cb25
   Note that on the x86 series of processors this function sets both the
Packit 67cb25
   original x87 mode and the newer MXCSR mode, which controls SSE
Packit 67cb25
   floating-point operations.  The SSE floating-point units do not have a
Packit 67cb25
   precision-control bit, and always work in double-precision.  The
Packit 67cb25
   single-precision and extended-precision keywords have no effect in
Packit 67cb25
   this case.
Packit 67cb25
Packit 67cb25
To demonstrate the effects of different rounding modes consider the
Packit 67cb25
following program which computes :math:`e`, the base of natural
Packit 67cb25
logarithms, by summing a rapidly-decreasing series,
Packit 67cb25
Packit 67cb25
.. only:: not texinfo
Packit 67cb25
Packit 67cb25
   .. math::
Packit 67cb25
Packit 67cb25
      e &= {1 \over 0!} + {1 \over 1!} + {1 \over 2!} + {1 \over 3!} + {1 \over 4!} + \dots \\
Packit 67cb25
        &= 2.71828182846...
Packit 67cb25
Packit 67cb25
.. only:: texinfo
Packit 67cb25
Packit 67cb25
   ::
Packit 67cb25
Packit 67cb25
      e = 1 + 1/2! + 1/3! + 1/4! + ... 
Packit 67cb25
        = 2.71828182846...
Packit 67cb25
Packit 67cb25
.. include:: examples/ieeeround.c
Packit 67cb25
   :code:
Packit 67cb25
Packit 67cb25
Here are the results of running the program in :code:`round-to-nearest`
Packit 67cb25
mode.  This is the IEEE default so it isn't really necessary to specify
Packit 67cb25
it here::
Packit 67cb25
Packit 67cb25
  $ GSL_IEEE_MODE="round-to-nearest" ./a.out 
Packit 67cb25
  i= 1 sum=1.000000000000000000 error=-1.71828
Packit 67cb25
  i= 2 sum=2.000000000000000000 error=-0.718282
Packit 67cb25
  ....
Packit 67cb25
  i=18 sum=2.718281828459045535 error=4.44089e-16
Packit 67cb25
  i=19 sum=2.718281828459045535 error=4.44089e-16
Packit 67cb25
Packit 67cb25
After nineteen terms the sum converges to within :math:`4 \times 10^{-16}`
Packit 67cb25
of the correct value.  
Packit 67cb25
If we now change the rounding mode to
Packit 67cb25
:code:`round-down` the final result is less accurate::
Packit 67cb25
Packit 67cb25
  $ GSL_IEEE_MODE="round-down" ./a.out 
Packit 67cb25
  i= 1 sum=1.000000000000000000 error=-1.71828
Packit 67cb25
  ....
Packit 67cb25
  i=19 sum=2.718281828459041094 error=-3.9968e-15
Packit 67cb25
Packit 67cb25
The result is about 
Packit 67cb25
:math:`4 \times 10^{-15}`
Packit 67cb25
below the correct value, an order of magnitude worse than the result
Packit 67cb25
obtained in the :code:`round-to-nearest` mode.
Packit 67cb25
Packit 67cb25
If we change to rounding mode to :code:`round-up` then the final result
Packit 67cb25
is higher than the correct value (when we add each term to the sum the
Packit 67cb25
final result is always rounded up, which increases the sum by at least
Packit 67cb25
one tick until the added term underflows to zero).  To avoid this
Packit 67cb25
problem we would need to use a safer converge criterion, such as
Packit 67cb25
:code:`while (fabs(sum - oldsum) > epsilon)`, with a suitably chosen
Packit 67cb25
value of epsilon.
Packit 67cb25
Packit 67cb25
Finally we can see the effect of computing the sum using
Packit 67cb25
single-precision rounding, in the default :code:`round-to-nearest`
Packit 67cb25
mode.  In this case the program thinks it is still using double precision
Packit 67cb25
numbers but the CPU rounds the result of each floating point operation
Packit 67cb25
to single-precision accuracy.  This simulates the effect of writing the
Packit 67cb25
program using single-precision :code:`float` variables instead of
Packit 67cb25
:code:`double` variables.  The iteration stops after about half the number
Packit 67cb25
of iterations and the final result is much less accurate::
Packit 67cb25
Packit 67cb25
  $ GSL_IEEE_MODE="single-precision" ./a.out 
Packit 67cb25
  ....
Packit 67cb25
  i=12 sum=2.718281984329223633 error=1.5587e-07
Packit 67cb25
Packit 67cb25
with an error of 
Packit 67cb25
:math:`O(10^{-7})`,
Packit 67cb25
which corresponds to single
Packit 67cb25
precision accuracy (about 1 part in :math:`10^7`).  Continuing the
Packit 67cb25
iterations further does not decrease the error because all the
Packit 67cb25
subsequent results are rounded to the same value.
Packit 67cb25
Packit 67cb25
References and Further Reading
Packit 67cb25
==============================
Packit 67cb25
Packit 67cb25
The reference for the IEEE standard is,
Packit 67cb25
Packit 67cb25
* ANSI/IEEE Std 754-1985, IEEE Standard for Binary Floating-Point Arithmetic.
Packit 67cb25
Packit 67cb25
A more pedagogical introduction to the standard can be found in the
Packit 67cb25
following paper,
Packit 67cb25
Packit 67cb25
* David Goldberg: What Every Computer Scientist Should Know About
Packit 67cb25
  Floating-Point Arithmetic. *ACM Computing Surveys*, Vol.: 23, No.: 1
Packit 67cb25
  (March 1991), pages 5--48.
Packit 67cb25
Packit 67cb25
* Corrigendum: *ACM Computing Surveys*, Vol.: 23, No.: 3 (September
Packit 67cb25
  1991), page 413. and see also the sections by B. A. Wichmann and Charles
Packit 67cb25
  B. Dunham in Surveyor's Forum: "What Every Computer Scientist Should
Packit 67cb25
  Know About Floating-Point Arithmetic". *ACM Computing Surveys*,
Packit 67cb25
  Vol.: 24, No.: 3 (September 1992), page 319.
Packit 67cb25
Packit 67cb25
A detailed textbook on IEEE arithmetic and its practical use is
Packit 67cb25
available from SIAM Press,
Packit 67cb25
Packit 67cb25
* Michael L. Overton, *Numerical Computing with IEEE Floating Point Arithmetic*,
Packit 67cb25
  SIAM Press, ISBN 0898715717.