Tree - source-git/gsl - CentOS Git server

source-git / gsl

Blame doc/ieee754.rst

Blob History Raw

Packit	67cb25	`.. index:: IEEE floating point`
Packit	67cb25
Packit	67cb25	`.. _chap_ieee:`
Packit	67cb25
Packit	67cb25	`******************************`
Packit	67cb25	`IEEE floating-point arithmetic`
Packit	67cb25	`******************************`
Packit	67cb25
Packit	67cb25	`This chapter describes functions for examining the representation of`
Packit	67cb25	`floating point numbers and controlling the floating point environment of`
Packit	67cb25	`your program. The functions described in this chapter are declared in`
Packit	67cb25	the header file :file:`gsl_ieee_utils.h`.
Packit	67cb25
Packit	67cb25	`.. index::`
Packit	67cb25	`single: IEEE format for floating point numbers`
Packit	67cb25	`single: bias, IEEE format`
Packit	67cb25	`single: exponent, IEEE format`
Packit	67cb25	`single: sign bit, IEEE format`
Packit	67cb25	`single: mantissa, IEEE format`
Packit	67cb25
Packit	67cb25	`Representation of floating point numbers`
Packit	67cb25	`========================================`
Packit	67cb25
Packit	67cb25	`The IEEE Standard for Binary Floating-Point Arithmetic defines binary`
Packit	67cb25	`formats for single and double precision numbers. Each number is composed`
Packit	67cb25	of three parts: a sign bit (:math:`s`), an exponent
Packit	67cb25	(:math:`E`) and a fraction (:math:`f`). The numerical value of the
Packit	67cb25	combination :math:`(s,E,f)` is given by the following formula,
Packit	67cb25
Packit	67cb25	`.. only:: not texinfo`
Packit	67cb25
Packit	67cb25	`.. math:: (-1)^s (1 \cdot fffff\dots) 2^E`
Packit	67cb25
Packit	67cb25	`.. only:: texinfo`
Packit	67cb25
Packit	67cb25	`::`
Packit	67cb25
Packit	67cb25	`(-1)^s (1.fffff...) 2^E`
Packit	67cb25
Packit	67cb25	`.. index::`
Packit	67cb25	`single: normalized form, IEEE format`
Packit	67cb25	`single: denormalized form, IEEE format`
Packit	67cb25
Packit	67cb25	`The sign bit is either zero or one. The exponent ranges from a minimum value`
Packit	67cb25	:math:`E_{min}`
Packit	67cb25	`to a maximum value`
Packit	67cb25	:math:`E_{max}`
Packit	67cb25	`depending on the precision. The exponent is converted to an`
Packit	67cb25	`unsigned number`
Packit	67cb25	:math:`e`, known as the biased exponent, for storage by adding a
Packit	67cb25	`bias parameter,`
Packit	67cb25
Packit	67cb25	`.. only:: not texinfo`
Packit	67cb25
Packit	67cb25	`.. math:: e = E + \hbox{\it bias}`
Packit	67cb25
Packit	67cb25	`.. only:: texinfo`
Packit	67cb25
Packit	67cb25	`::`
Packit	67cb25
Packit	67cb25	`e = E + bias`
Packit	67cb25
Packit	67cb25	The sequence :math:`fffff...` represents the digits of the binary
Packit	67cb25	fraction :math:`f`. The binary digits are stored in *normalized
Packit	67cb25	form*, by adjusting the exponent to give a leading digit of :math:`1`.
Packit	67cb25	`Since the leading digit is always 1 for normalized numbers it is`
Packit	67cb25	`assumed implicitly and does not have to be stored.`
Packit	67cb25	`Numbers smaller than`
Packit	67cb25	:math:`2^{E_{min}}`
Packit	67cb25	`are be stored in denormalized form with a leading zero,`
Packit	67cb25
Packit	67cb25	`.. only:: not texinfo`
Packit	67cb25
Packit	67cb25	`.. math:: (-1)^s (0 \cdot fffff\dots) 2^{E_{min}}`
Packit	67cb25
Packit	67cb25	`.. only:: texinfo`
Packit	67cb25
Packit	67cb25	`::`
Packit	67cb25
Packit	67cb25	`(-1)^s (0.fffff...) 2^(E_min)`
Packit	67cb25
Packit	67cb25	`.. index::`
Packit	67cb25	`single: zero, IEEE format`
Packit	67cb25	`single: infinity, IEEE format`
Packit	67cb25
Packit	67cb25	`This allows gradual underflow down to`
Packit	67cb25	:math:`2^{E_{min} - p}`
Packit	67cb25	for :math:`p` bits of precision.
Packit	67cb25	`A zero is encoded with the special exponent of`
Packit	67cb25	:math:`2^{E_{min}-1}`
Packit	67cb25	`and infinities with the exponent of`
Packit	67cb25	:math:`2^{E_{max}+1}`.
Packit	67cb25
Packit	67cb25	`.. index::`
Packit	67cb25	`single: single precision, IEEE format`
Packit	67cb25
Packit	67cb25	`The format for single precision numbers uses 32 bits divided in the`
Packit	67cb25	`following way::`
Packit	67cb25
Packit	67cb25	`seeeeeeeefffffffffffffffffffffff`
Packit	67cb25
Packit	67cb25	`s = sign bit, 1 bit`
Packit	67cb25	`e = exponent, 8 bits (E_min=-126, E_max=127, bias=127)`
Packit	67cb25	`f = fraction, 23 bits`
Packit	67cb25
Packit	67cb25	`.. index::`
Packit	67cb25	`single: double precision, IEEE format`
Packit	67cb25
Packit	67cb25	`The format for double precision numbers uses 64 bits divided in the`
Packit	67cb25	`following way::`
Packit	67cb25
Packit	67cb25	`seeeeeeeeeeeffffffffffffffffffffffffffffffffffffffffffffffffffff`
Packit	67cb25
Packit	67cb25	`s = sign bit, 1 bit`
Packit	67cb25	`e = exponent, 11 bits (E_min=-1022, E_max=1023, bias=1023)`
Packit	67cb25	`f = fraction, 52 bits`
Packit	67cb25
Packit	67cb25	`It is often useful to be able to investigate the behavior of a`
Packit	67cb25	`calculation at the bit-level and the library provides functions for`
Packit	67cb25	`printing the IEEE representations in a human-readable form.`
Packit	67cb25
Packit	67cb25	`.. float vs double vs long double`
Packit	67cb25	`.. (how many digits are available for each)`
Packit	67cb25
Packit	67cb25	`.. function:: void gsl_ieee_fprintf_float (FILE * stream, const float * x)`
Packit	67cb25	`void gsl_ieee_fprintf_double (FILE * stream, const double * x)`
Packit	67cb25
Packit	67cb25	`These functions output a formatted version of the IEEE floating-point`
Packit	67cb25	number pointed to by :data:`x` to the stream :data:`stream`. A pointer is
Packit	67cb25	`used to pass the number indirectly, to avoid any undesired promotion`
Packit	67cb25	from :code:`float` to :code:`double`. The output takes one of the
Packit	67cb25	`following forms,`
Packit	67cb25
Packit	67cb25	:code:`NaN`
Packit	67cb25
Packit	67cb25	`the Not-a-Number symbol`
Packit	67cb25
Packit	67cb25	:code:`Inf, -Inf`
Packit	67cb25
Packit	67cb25	`positive or negative infinity`
Packit	67cb25
Packit	67cb25	:code:`1.fffff...2^E, -1.fffff...2^E`
Packit	67cb25
Packit	67cb25	`a normalized floating point number`
Packit	67cb25
Packit	67cb25	:code:`0.fffff...2^E, -0.fffff...2^E`
Packit	67cb25
Packit	67cb25	`a denormalized floating point number`
Packit	67cb25
Packit	67cb25	:code:`0, -0`
Packit	67cb25
Packit	67cb25	`positive or negative zero`
Packit	67cb25
Packit	67cb25	`The output can be used directly in GNU Emacs Calc mode by preceding it`
Packit	67cb25	with :code:`2#` to indicate binary.
Packit	67cb25
Packit	67cb25	`.. @item [non-standard IEEE float], [non-standard IEEE double]`
Packit	67cb25	`.. an unrecognized encoding`
Packit	67cb25
Packit	67cb25	`.. function:: void gsl_ieee_printf_float (const float * x)`
Packit	67cb25	`void gsl_ieee_printf_double (const double * x)`
Packit	67cb25
Packit	67cb25	`These functions output a formatted version of the IEEE floating-point`
Packit	67cb25	number pointed to by :data:`x` to the stream :code:`stdout`.
Packit	67cb25
Packit	67cb25	`The following program demonstrates the use of the functions by printing`
Packit	67cb25	`the single and double precision representations of the fraction`
Packit	67cb25	:math:`1/3`. For comparison the representation of the value promoted from
Packit	67cb25	`single to double precision is also printed.`
Packit	67cb25
Packit	67cb25	`.. include:: examples/ieee.c`
Packit	67cb25	`:code:`
Packit	67cb25
Packit	67cb25	The binary representation of :math:`1/3` is :math:`0.01010101...`. The
Packit	67cb25	`output below shows that the IEEE format normalizes this fraction to give`
Packit	67cb25	`a leading digit of 1::`
Packit	67cb25
Packit	67cb25	`f= 1.01010101010101010101011*2^-2`
Packit	67cb25	`fd= 1.0101010101010101010101100000000000000000000000000000*2^-2`
Packit	67cb25	`d= 1.0101010101010101010101010101010101010101010101010101*2^-2`
Packit	67cb25
Packit	67cb25	`The output also shows that a single-precision number is promoted to`
Packit	67cb25	`double-precision by adding zeros in the binary representation.`
Packit	67cb25
Packit	67cb25	`.. importance of using 1.234L in long double calculations`
Packit	67cb25
Packit	67cb25	`.. @example`
Packit	67cb25	`.. int main (void)`
Packit	67cb25	`.. @{`
Packit	67cb25	`.. long double x = 1.0, y = 1.0;`
Packit	67cb25
Packit	67cb25	`.. x = x + 0.2;`
Packit	67cb25	`.. y = y + 0.2L;`
Packit	67cb25
Packit	67cb25	`.. printf(" d %.20Lf\n",x);`
Packit	67cb25	`.. printf("ld %.20Lf\n",y);`
Packit	67cb25
Packit	67cb25	`.. return 1;`
Packit	67cb25	`.. @}`
Packit	67cb25
Packit	67cb25	`.. d 1.20000000000000001110`
Packit	67cb25	`.. ld 1.20000000000000000004`
Packit	67cb25	`.. @end example`
Packit	67cb25
Packit	67cb25	`.. index::`
Packit	67cb25	`single: IEEE exceptions`
Packit	67cb25	`single: precision, IEEE arithmetic`
Packit	67cb25	`single: rounding mode`
Packit	67cb25	`single: arithmetic exceptions`
Packit	67cb25	`single: exceptions, IEEE arithmetic`
Packit	67cb25	`single: division by zero, IEEE exceptions`
Packit	67cb25	`single: underflow, IEEE exceptions`
Packit	67cb25	`single: overflow, IEEE exceptions`
Packit	67cb25
Packit	67cb25	`Setting up your IEEE environment`
Packit	67cb25	`================================`
Packit	67cb25
Packit	67cb25	`The IEEE standard defines several modes for controlling the`
Packit	67cb25	`behavior of floating point operations. These modes specify the important`
Packit	67cb25	`properties of computer arithmetic: the direction used for rounding (e.g.`
Packit	67cb25	`whether numbers should be rounded up, down or to the nearest number),`
Packit	67cb25	`the rounding precision and how the program should handle arithmetic`
Packit	67cb25	`exceptions, such as division by zero.`
Packit	67cb25
Packit	67cb25	`Many of these features can now be controlled via standard functions such`
Packit	67cb25	as :func:`fpsetround`, which should be used whenever they are available.
Packit	67cb25	`Unfortunately in the past there has been no universal API for`
Packit	67cb25	`controlling their behavior---each system has had its own low-level way`
Packit	67cb25	`of accessing them. To help you write portable programs GSL allows you`
Packit	67cb25	`to specify modes in a platform-independent way using the environment`
Packit	67cb25	variable :macro:`GSL_IEEE_MODE`. The library then takes care of all the
Packit	67cb25	`necessary machine-specific initializations for you when you call the`
Packit	67cb25	function :func:`gsl_ieee_env_setup`.
Packit	67cb25
Packit	67cb25	`.. macro:: GSL_IEEE_MODE`
Packit	67cb25
Packit	67cb25	`Environment variable which specifies IEEE mode.`
Packit	67cb25
Packit	67cb25	`.. function:: void gsl_ieee_env_setup ()`
Packit	67cb25
Packit	67cb25	This function reads the environment variable :macro:`GSL_IEEE_MODE` and
Packit	67cb25	`attempts to set up the corresponding specified IEEE modes. The`
Packit	67cb25	`environment variable should be a list of keywords, separated by`
Packit	67cb25	`commas, like this::`
Packit	67cb25
Packit	67cb25	`GSL_IEEE_MODE = "keyword, keyword, ..."`
Packit	67cb25
Packit	67cb25	where :data:`keyword` is one of the following mode-names::
Packit	67cb25
Packit	67cb25	`single-precision`
Packit	67cb25	`double-precision`
Packit	67cb25	`extended-precision`
Packit	67cb25	`round-to-nearest`
Packit	67cb25	`round-down`
Packit	67cb25	`round-up`
Packit	67cb25	`round-to-zero`
Packit	67cb25	`mask-all`
Packit	67cb25	`mask-invalid`
Packit	67cb25	`mask-denormalized`
Packit	67cb25	`mask-division-by-zero`
Packit	67cb25	`mask-overflow`
Packit	67cb25	`mask-underflow`
Packit	67cb25	`trap-inexact`
Packit	67cb25	`trap-common`
Packit	67cb25
Packit	67cb25	If :macro:`GSL_IEEE_MODE` is empty or undefined then the function returns
Packit	67cb25	`immediately and no attempt is made to change the system's IEEE`
Packit	67cb25	mode. When the modes from :macro:`GSL_IEEE_MODE` are turned on the
Packit	67cb25	`function prints a short message showing the new settings to remind you`
Packit	67cb25	`that the results of the program will be affected.`
Packit	67cb25
Packit	67cb25	`If the requested modes are not supported by the platform being used then`
Packit	67cb25	`the function calls the error handler and returns an error code of`
Packit	67cb25	:macro:`GSL_EUNSUP`.
Packit	67cb25
Packit	67cb25	`When options are specified using this method, the resulting mode is`
Packit	67cb25	`based on a default setting of the highest available precision (double`
Packit	67cb25	`precision or extended precision, depending on the platform) in`
Packit	67cb25	`round-to-nearest mode, with all exceptions enabled apart from the`
Packit	67cb25	`INEXACT exception. The INEXACT exception is generated`
Packit	67cb25	`whenever rounding occurs, so it must generally be disabled in typical`
Packit	67cb25	`scientific calculations. All other floating-point exceptions are`
Packit	67cb25	`enabled by default, including underflows and the use of denormalized`
Packit	67cb25	`numbers, for safety. They can be disabled with the individual`
Packit	67cb25	:code:`mask-` settings or together using :code:`mask-all`.
Packit	67cb25
Packit	67cb25	`The following adjusted combination of modes is convenient for many`
Packit	67cb25	`purposes::`
Packit	67cb25
Packit	67cb25	`GSL_IEEE_MODE="double-precision,"\`
Packit	67cb25	`"mask-underflow,"\`
Packit	67cb25	`"mask-denormalized"`
Packit	67cb25
Packit	67cb25	`This choice ignores any errors relating to small numbers (either`
Packit	67cb25	`denormalized, or underflowing to zero) but traps overflows, division by`
Packit	67cb25	`zero and invalid operations.`
Packit	67cb25
Packit	67cb25	`Note that on the x86 series of processors this function sets both the`
Packit	67cb25	`original x87 mode and the newer MXCSR mode, which controls SSE`
Packit	67cb25	`floating-point operations. The SSE floating-point units do not have a`
Packit	67cb25	`precision-control bit, and always work in double-precision. The`
Packit	67cb25	`single-precision and extended-precision keywords have no effect in`
Packit	67cb25	`this case.`
Packit	67cb25
Packit	67cb25	`To demonstrate the effects of different rounding modes consider the`
Packit	67cb25	following program which computes :math:`e`, the base of natural
Packit	67cb25	`logarithms, by summing a rapidly-decreasing series,`
Packit	67cb25
Packit	67cb25	`.. only:: not texinfo`
Packit	67cb25
Packit	67cb25	`.. math::`
Packit	67cb25
Packit	67cb25	`e &= {1 \over 0!} + {1 \over 1!} + {1 \over 2!} + {1 \over 3!} + {1 \over 4!} + \dots \\`
Packit	67cb25	`&= 2.71828182846...`
Packit	67cb25
Packit	67cb25	`.. only:: texinfo`
Packit	67cb25
Packit	67cb25	`::`
Packit	67cb25
Packit	67cb25	`e = 1 + 1/2! + 1/3! + 1/4! + ...`
Packit	67cb25	`= 2.71828182846...`
Packit	67cb25
Packit	67cb25	`.. include:: examples/ieeeround.c`
Packit	67cb25	`:code:`
Packit	67cb25
Packit	67cb25	Here are the results of running the program in :code:`round-to-nearest`
Packit	67cb25	`mode. This is the IEEE default so it isn't really necessary to specify`
Packit	67cb25	`it here::`
Packit	67cb25
Packit	67cb25	`$ GSL_IEEE_MODE="round-to-nearest" ./a.out`
Packit	67cb25	`i= 1 sum=1.000000000000000000 error=-1.71828`
Packit	67cb25	`i= 2 sum=2.000000000000000000 error=-0.718282`
Packit	67cb25	`....`
Packit	67cb25	`i=18 sum=2.718281828459045535 error=4.44089e-16`
Packit	67cb25	`i=19 sum=2.718281828459045535 error=4.44089e-16`
Packit	67cb25
Packit	67cb25	After nineteen terms the sum converges to within :math:`4 \times 10^{-16}`
Packit	67cb25	`of the correct value.`
Packit	67cb25	`If we now change the rounding mode to`
Packit	67cb25	:code:`round-down` the final result is less accurate::
Packit	67cb25
Packit	67cb25	`$ GSL_IEEE_MODE="round-down" ./a.out`
Packit	67cb25	`i= 1 sum=1.000000000000000000 error=-1.71828`
Packit	67cb25	`....`
Packit	67cb25	`i=19 sum=2.718281828459041094 error=-3.9968e-15`
Packit	67cb25
Packit	67cb25	`The result is about`
Packit	67cb25	:math:`4 \times 10^{-15}`
Packit	67cb25	`below the correct value, an order of magnitude worse than the result`
Packit	67cb25	obtained in the :code:`round-to-nearest` mode.
Packit	67cb25
Packit	67cb25	If we change to rounding mode to :code:`round-up` then the final result
Packit	67cb25	`is higher than the correct value (when we add each term to the sum the`
Packit	67cb25	`final result is always rounded up, which increases the sum by at least`
Packit	67cb25	`one tick until the added term underflows to zero). To avoid this`
Packit	67cb25	`problem we would need to use a safer converge criterion, such as`
Packit	67cb25	:code:`while (fabs(sum - oldsum) > epsilon)`, with a suitably chosen
Packit	67cb25	`value of epsilon.`
Packit	67cb25
Packit	67cb25	`Finally we can see the effect of computing the sum using`
Packit	67cb25	single-precision rounding, in the default :code:`round-to-nearest`
Packit	67cb25	`mode. In this case the program thinks it is still using double precision`
Packit	67cb25	`numbers but the CPU rounds the result of each floating point operation`
Packit	67cb25	`to single-precision accuracy. This simulates the effect of writing the`
Packit	67cb25	program using single-precision :code:`float` variables instead of
Packit	67cb25	:code:`double` variables. The iteration stops after about half the number
Packit	67cb25	`of iterations and the final result is much less accurate::`
Packit	67cb25
Packit	67cb25	`$ GSL_IEEE_MODE="single-precision" ./a.out`
Packit	67cb25	`....`
Packit	67cb25	`i=12 sum=2.718281984329223633 error=1.5587e-07`
Packit	67cb25
Packit	67cb25	`with an error of`
Packit	67cb25	:math:`O(10^{-7})`,
Packit	67cb25	`which corresponds to single`
Packit	67cb25	precision accuracy (about 1 part in :math:`10^7`). Continuing the
Packit	67cb25	`iterations further does not decrease the error because all the`
Packit	67cb25	`subsequent results are rounded to the same value.`
Packit	67cb25
Packit	67cb25	`References and Further Reading`
Packit	67cb25	`==============================`
Packit	67cb25
Packit	67cb25	`The reference for the IEEE standard is,`
Packit	67cb25
Packit	67cb25	`* ANSI/IEEE Std 754-1985, IEEE Standard for Binary Floating-Point Arithmetic.`
Packit	67cb25
Packit	67cb25	`A more pedagogical introduction to the standard can be found in the`
Packit	67cb25	`following paper,`
Packit	67cb25
Packit	67cb25	`* David Goldberg: What Every Computer Scientist Should Know About`
Packit	67cb25	`Floating-Point Arithmetic. ACM Computing Surveys, Vol.: 23, No.: 1`
Packit	67cb25	`(March 1991), pages 5--48.`
Packit	67cb25
Packit	67cb25	`* Corrigendum: ACM Computing Surveys, Vol.: 23, No.: 3 (September`
Packit	67cb25	`1991), page 413. and see also the sections by B. A. Wichmann and Charles`
Packit	67cb25	`B. Dunham in Surveyor's Forum: "What Every Computer Scientist Should`
Packit	67cb25	`Know About Floating-Point Arithmetic". ACM Computing Surveys,`
Packit	67cb25	`Vol.: 24, No.: 3 (September 1992), page 319.`
Packit	67cb25
Packit	67cb25	`A detailed textbook on IEEE arithmetic and its practical use is`
Packit	67cb25	`available from SIAM Press,`
Packit	67cb25
Packit	67cb25	`* Michael L. Overton, Numerical Computing with IEEE Floating Point Arithmetic,`
Packit	67cb25	`SIAM Press, ISBN 0898715717.`

source-git / gsl

Source Code

Blame doc/ieee754.rst