|
Packit |
67cb25 |
.. index::
|
|
Packit |
67cb25 |
single: statistics
|
|
Packit |
67cb25 |
single: mean
|
|
Packit |
67cb25 |
single: standard deviation
|
|
Packit |
67cb25 |
single: variance
|
|
Packit |
67cb25 |
single: estimated standard deviation
|
|
Packit |
67cb25 |
single: estimated variance
|
|
Packit |
67cb25 |
single: t-test
|
|
Packit |
67cb25 |
single: range
|
|
Packit |
67cb25 |
single: min
|
|
Packit |
67cb25 |
single: max
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
**********
|
|
Packit |
67cb25 |
Statistics
|
|
Packit |
67cb25 |
**********
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
This chapter describes the statistical functions in the library. The
|
|
Packit |
67cb25 |
basic statistical functions include routines to compute the mean,
|
|
Packit |
67cb25 |
variance and standard deviation. More advanced functions allow you to
|
|
Packit |
67cb25 |
calculate absolute deviations, skewness, and kurtosis as well as the
|
|
Packit |
67cb25 |
median and arbitrary percentiles. The algorithms use recurrence
|
|
Packit |
67cb25 |
relations to compute average quantities in a stable way, without large
|
|
Packit |
67cb25 |
intermediate values that might overflow.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
The functions are available in versions for datasets in the standard
|
|
Packit |
67cb25 |
floating-point and integer types. The versions for double precision
|
|
Packit |
67cb25 |
floating-point data have the prefix :code:`gsl_stats` and are declared in
|
|
Packit |
67cb25 |
the header file :file:`gsl_statistics_double.h`. The versions for integer
|
|
Packit |
67cb25 |
data have the prefix :code:`gsl_stats_int` and are declared in the header
|
|
Packit |
67cb25 |
file :file:`gsl_statistics_int.h`. All the functions operate on C
|
|
Packit |
67cb25 |
arrays with a :code:`stride` parameter specifying the spacing between
|
|
Packit |
67cb25 |
elements.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
Mean, Standard Deviation and Variance
|
|
Packit |
67cb25 |
=====================================
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. function:: double gsl_stats_mean (const double data[], size_t stride, size_t n)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
This function returns the arithmetic mean of :data:`data`, a dataset of
|
|
Packit |
67cb25 |
length :data:`n` with stride :data:`stride`. The arithmetic mean, or
|
|
Packit |
67cb25 |
*sample mean*, is denoted by :math:`\Hat\mu` and defined as,
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. math:: \Hat\mu = {1 \over N} \sum x_i
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
where :math:`x_i` are the elements of the dataset :data:`data`. For
|
|
Packit |
67cb25 |
samples drawn from a gaussian distribution the variance of
|
|
Packit |
67cb25 |
:math:`\Hat\mu` is :math:`\sigma^2 / N`.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. function:: double gsl_stats_variance (const double data[], size_t stride, size_t n)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
This function returns the estimated, or *sample*, variance of
|
|
Packit |
67cb25 |
:data:`data`, a dataset of length :data:`n` with stride :data:`stride`. The
|
|
Packit |
67cb25 |
estimated variance is denoted by :math:`\Hat\sigma^2` and is defined by,
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: not texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. math:: {\Hat\sigma}^2 = {1 \over (N-1)} \sum (x_i - {\Hat\mu})^2
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
::
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
\Hat\sigma^2 = (1/(N-1)) \sum (x_i - \Hat\mu)^2
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
where :math:`x_i` are the elements of the dataset :data:`data`. Note that
|
|
Packit |
67cb25 |
the normalization factor of :math:`1/(N-1)` results from the derivation
|
|
Packit |
67cb25 |
of :math:`\Hat\sigma^2` as an unbiased estimator of the population
|
|
Packit |
67cb25 |
variance :math:`\sigma^2`. For samples drawn from a Gaussian distribution
|
|
Packit |
67cb25 |
the variance of :math:`\Hat\sigma^2` itself is :math:`2 \sigma^4 / N`.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
This function computes the mean via a call to :func:`gsl_stats_mean`. If
|
|
Packit |
67cb25 |
you have already computed the mean then you can pass it directly to
|
|
Packit |
67cb25 |
:func:`gsl_stats_variance_m`.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. function:: double gsl_stats_variance_m (const double data[], size_t stride, size_t n, double mean)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
This function returns the sample variance of :data:`data` relative to the
|
|
Packit |
67cb25 |
given value of :data:`mean`. The function is computed with :math:`\Hat\mu`
|
|
Packit |
67cb25 |
replaced by the value of :data:`mean` that you supply,
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: not texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. math:: {\Hat\sigma}^2 = {1 \over (N-1)} \sum (x_i - mean)^2
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
::
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
\Hat\sigma^2 = (1/(N-1)) \sum (x_i - mean)^2
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. function:: double gsl_stats_sd (const double data[], size_t stride, size_t n)
|
|
Packit |
67cb25 |
double gsl_stats_sd_m (const double data[], size_t stride, size_t n, double mean)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
The standard deviation is defined as the square root of the variance.
|
|
Packit |
67cb25 |
These functions return the square root of the corresponding variance
|
|
Packit |
67cb25 |
functions above.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. function:: double gsl_stats_tss (const double data[], size_t stride, size_t n)
|
|
Packit |
67cb25 |
double gsl_stats_tss_m (const double data[], size_t stride, size_t n, double mean)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
These functions return the total sum of squares (TSS) of :data:`data` about
|
|
Packit |
67cb25 |
the mean. For :func:`gsl_stats_tss_m` the user-supplied value of
|
|
Packit |
67cb25 |
:data:`mean` is used, and for :func:`gsl_stats_tss` it is computed using
|
|
Packit |
67cb25 |
:func:`gsl_stats_mean`.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: not texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. math:: {\rm TSS} = \sum (x_i - mean)^2
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
::
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
TSS = \sum (x_i - mean)^2
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. function:: double gsl_stats_variance_with_fixed_mean (const double data[], size_t stride, size_t n, double mean)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
This function computes an unbiased estimate of the variance of
|
|
Packit |
67cb25 |
:data:`data` when the population mean :data:`mean` of the underlying
|
|
Packit |
67cb25 |
distribution is known *a priori*. In this case the estimator for
|
|
Packit |
67cb25 |
the variance uses the factor :math:`1/N` and the sample mean
|
|
Packit |
67cb25 |
:math:`\Hat\mu` is replaced by the known population mean :math:`\mu`,
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: not texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. math:: {\Hat\sigma}^2 = {1 \over N} \sum (x_i - \mu)^2
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
::
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
\Hat\sigma^2 = (1/N) \sum (x_i - \mu)^2
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. function:: double gsl_stats_sd_with_fixed_mean (const double data[], size_t stride, size_t n, double mean)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
This function calculates the standard deviation of :data:`data` for a
|
|
Packit |
67cb25 |
fixed population mean :data:`mean`. The result is the square root of the
|
|
Packit |
67cb25 |
corresponding variance function.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
Absolute deviation
|
|
Packit |
67cb25 |
==================
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. function:: double gsl_stats_absdev (const double data[], size_t stride, size_t n)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
This function computes the absolute deviation from the mean of
|
|
Packit |
67cb25 |
:data:`data`, a dataset of length :data:`n` with stride :data:`stride`. The
|
|
Packit |
67cb25 |
absolute deviation from the mean is defined as,
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: not texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. math:: absdev = {1 \over N} \sum |x_i - {\Hat\mu}|
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
::
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
absdev = (1/N) \sum |x_i - \Hat\mu|
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
where :math:`x_i` are the elements of the dataset :data:`data`. The
|
|
Packit |
67cb25 |
absolute deviation from the mean provides a more robust measure of the
|
|
Packit |
67cb25 |
width of a distribution than the variance. This function computes the
|
|
Packit |
67cb25 |
mean of :data:`data` via a call to :func:`gsl_stats_mean`.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. function:: double gsl_stats_absdev_m (const double data[], size_t stride, size_t n, double mean)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
This function computes the absolute deviation of the dataset :data:`data`
|
|
Packit |
67cb25 |
relative to the given value of :data:`mean`,
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: not texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. math:: absdev = {1 \over N} \sum |x_i - mean|
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
::
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
absdev = (1/N) \sum |x_i - mean|
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
This function is useful if you have already computed the mean of
|
|
Packit |
67cb25 |
:data:`data` (and want to avoid recomputing it), or wish to calculate the
|
|
Packit |
67cb25 |
absolute deviation relative to another value (such as zero, or the
|
|
Packit |
67cb25 |
median).
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. index:: skewness, kurtosis
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
Higher moments (skewness and kurtosis)
|
|
Packit |
67cb25 |
======================================
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. function:: double gsl_stats_skew (const double data[], size_t stride, size_t n)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
This function computes the skewness of :data:`data`, a dataset of length
|
|
Packit |
67cb25 |
:data:`n` with stride :data:`stride`. The skewness is defined as,
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: not texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. math::
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
skew = {1 \over N} \sum
|
|
Packit |
67cb25 |
{\left( x_i - {\Hat\mu} \over {\Hat\sigma} \right)}^3
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
::
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
skew = (1/N) \sum ((x_i - \Hat\mu)/\Hat\sigma)^3
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
where :math:`x_i` are the elements of the dataset :data:`data`. The skewness
|
|
Packit |
67cb25 |
measures the asymmetry of the tails of a distribution.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
The function computes the mean and estimated standard deviation of
|
|
Packit |
67cb25 |
:data:`data` via calls to :func:`gsl_stats_mean` and :func:`gsl_stats_sd`.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. function:: double gsl_stats_skew_m_sd (const double data[], size_t stride, size_t n, double mean, double sd)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
This function computes the skewness of the dataset :data:`data` using the
|
|
Packit |
67cb25 |
given values of the mean :data:`mean` and standard deviation :data:`sd`,
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: not texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. math:: skew = {1 \over N} \sum {\left( x_i - mean \over sd \right)}^3
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
::
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
skew = (1/N) \sum ((x_i - mean)/sd)^3
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
These functions are useful if you have already computed the mean and
|
|
Packit |
67cb25 |
standard deviation of :data:`data` and want to avoid recomputing them.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. function:: double gsl_stats_kurtosis (const double data[], size_t stride, size_t n)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
This function computes the kurtosis of :data:`data`, a dataset of length
|
|
Packit |
67cb25 |
:data:`n` with stride :data:`stride`. The kurtosis is defined as,
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: not texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. math::
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
kurtosis = \left( {1 \over N} \sum
|
|
Packit |
67cb25 |
{\left(x_i - {\Hat\mu} \over {\Hat\sigma} \right)}^4
|
|
Packit |
67cb25 |
\right)
|
|
Packit |
67cb25 |
- 3
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
::
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
kurtosis = ((1/N) \sum ((x_i - \Hat\mu)/\Hat\sigma)^4) - 3
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
The kurtosis measures how sharply peaked a distribution is, relative to
|
|
Packit |
67cb25 |
its width. The kurtosis is normalized to zero for a Gaussian
|
|
Packit |
67cb25 |
distribution.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. function:: double gsl_stats_kurtosis_m_sd (const double data[], size_t stride, size_t n, double mean, double sd)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
This function computes the kurtosis of the dataset :data:`data` using the
|
|
Packit |
67cb25 |
given values of the mean :data:`mean` and standard deviation :data:`sd`,
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: not texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. math::
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
kurtosis = {1 \over N}
|
|
Packit |
67cb25 |
\left( \sum {\left(x_i - mean \over sd \right)}^4 \right)
|
|
Packit |
67cb25 |
- 3
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
::
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
kurtosis = ((1/N) \sum ((x_i - mean)/sd)^4) - 3
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
This function is useful if you have already computed the mean and
|
|
Packit |
67cb25 |
standard deviation of :data:`data` and want to avoid recomputing them.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
Autocorrelation
|
|
Packit |
67cb25 |
===============
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. function:: double gsl_stats_lag1_autocorrelation (const double data[], const size_t stride, const size_t n)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
This function computes the lag-1 autocorrelation of the dataset :data:`data`.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: not texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. math::
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
a_1 = {\sum_{i = 2}^{n} (x_{i} - \Hat\mu) (x_{i-1} - \Hat\mu)
|
|
Packit |
67cb25 |
\over
|
|
Packit |
67cb25 |
\sum_{i = 1}^{n} (x_{i} - \Hat\mu) (x_{i} - \Hat\mu)}
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
::
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
a_1 = {\sum_{i = 2}^{n} (x_{i} - \Hat\mu) (x_{i-1} - \Hat\mu)
|
|
Packit |
67cb25 |
\over
|
|
Packit |
67cb25 |
\sum_{i = 1}^{n} (x_{i} - \Hat\mu) (x_{i} - \Hat\mu)}
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. function:: double gsl_stats_lag1_autocorrelation_m (const double data[], const size_t stride, const size_t n, const double mean)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
This function computes the lag-1 autocorrelation of the dataset
|
|
Packit |
67cb25 |
:data:`data` using the given value of the mean :data:`mean`.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. index::
|
|
Packit |
67cb25 |
single: covariance, of two datasets
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
Covariance
|
|
Packit |
67cb25 |
==========
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. function:: double gsl_stats_covariance (const double data1[], const size_t stride1, const double data2[], const size_t stride2, const size_t n)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
This function computes the covariance of the datasets :data:`data1` and
|
|
Packit |
67cb25 |
:data:`data2` which must both be of the same length :data:`n`.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: not texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. math:: covar = {1 \over (n - 1)} \sum_{i = 1}^{n} (x_{i} - \Hat x) (y_{i} - \Hat y)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
::
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
covar = (1/(n - 1)) \sum_{i = 1}^{n} (x_i - \Hat x) (y_i - \Hat y)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. function:: double gsl_stats_covariance_m (const double data1[], const size_t stride1, const double data2[], const size_t stride2, const size_t n, const double mean1, const double mean2)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
This function computes the covariance of the datasets :data:`data1` and
|
|
Packit |
67cb25 |
:data:`data2` using the given values of the means, :data:`mean1` and
|
|
Packit |
67cb25 |
:data:`mean2`. This is useful if you have already computed the means of
|
|
Packit |
67cb25 |
:data:`data1` and :data:`data2` and want to avoid recomputing them.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. index::
|
|
Packit |
67cb25 |
single: correlation, of two datasets
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
Correlation
|
|
Packit |
67cb25 |
===========
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. function:: double gsl_stats_correlation (const double data1[], const size_t stride1, const double data2[], const size_t stride2, const size_t n)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
This function efficiently computes the Pearson correlation coefficient
|
|
Packit |
67cb25 |
between the datasets :data:`data1` and :data:`data2` which must both be of
|
|
Packit |
67cb25 |
the same length :data:`n`.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: not texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. math::
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
r = {cov(x, y) \over \Hat\sigma_x \Hat\sigma_y} =
|
|
Packit |
67cb25 |
{{1 \over n-1} \sum (x_i - \Hat x) (y_i - \Hat y)
|
|
Packit |
67cb25 |
\over
|
|
Packit |
67cb25 |
\sqrt{{1 \over n-1} \sum (x_i - {\Hat x})^2}
|
|
Packit |
67cb25 |
\sqrt{{1 \over n-1} \sum (y_i - {\Hat y})^2}
|
|
Packit |
67cb25 |
}
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
::
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
r = cov(x, y) / (\Hat\sigma_x \Hat\sigma_y)
|
|
Packit |
67cb25 |
= {1/(n-1) \sum (x_i - \Hat x) (y_i - \Hat y)
|
|
Packit |
67cb25 |
\over
|
|
Packit |
67cb25 |
\sqrt{1/(n-1) \sum (x_i - \Hat x)^2} \sqrt{1/(n-1) \sum (y_i - \Hat y)^2}
|
|
Packit |
67cb25 |
}
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. function:: double gsl_stats_spearman (const double data1[], const size_t stride1, const double data2[], const size_t stride2, const size_t n, double work[])
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
This function computes the Spearman rank correlation coefficient between
|
|
Packit |
67cb25 |
the datasets :data:`data1` and :data:`data2` which must both be of the same
|
|
Packit |
67cb25 |
length :data:`n`. Additional workspace of size 2 * :data:`n` is required in
|
|
Packit |
67cb25 |
:data:`work`. The Spearman rank correlation between vectors :math:`x` and
|
|
Packit |
67cb25 |
:math:`y` is equivalent to the Pearson correlation between the ranked
|
|
Packit |
67cb25 |
vectors :math:`x_R` and :math:`y_R`, where ranks are defined to be the
|
|
Packit |
67cb25 |
average of the positions of an element in the ascending order of the values.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
Weighted Samples
|
|
Packit |
67cb25 |
================
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
The functions described in this section allow the computation of
|
|
Packit |
67cb25 |
statistics for weighted samples. The functions accept an array of
|
|
Packit |
67cb25 |
samples, :math:`x_i`, with associated weights, :math:`w_i`. Each sample
|
|
Packit |
67cb25 |
:math:`x_i` is considered as having been drawn from a Gaussian
|
|
Packit |
67cb25 |
distribution with variance :math:`\sigma_i^2`. The sample weight
|
|
Packit |
67cb25 |
:math:`w_i` is defined as the reciprocal of this variance, :math:`w_i = 1/\sigma_i^2`.
|
|
Packit |
67cb25 |
Setting a weight to zero corresponds to removing a sample from a dataset.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. function:: double gsl_stats_wmean (const double w[], size_t wstride, const double data[], size_t stride, size_t n)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
This function returns the weighted mean of the dataset :data:`data` with
|
|
Packit |
67cb25 |
stride :data:`stride` and length :data:`n`, using the set of weights :data:`w`
|
|
Packit |
67cb25 |
with stride :data:`wstride` and length :data:`n`. The weighted mean is defined as,
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: not texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. math:: {\Hat\mu} = {{\sum w_i x_i} \over {\sum w_i}}
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
::
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
\Hat\mu = (\sum w_i x_i) / (\sum w_i)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. function:: double gsl_stats_wvariance (const double w[], size_t wstride, const double data[], size_t stride, size_t n)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
This function returns the estimated variance of the dataset :data:`data`
|
|
Packit |
67cb25 |
with stride :data:`stride` and length :data:`n`, using the set of weights
|
|
Packit |
67cb25 |
:data:`w` with stride :data:`wstride` and length :data:`n`. The estimated
|
|
Packit |
67cb25 |
variance of a weighted dataset is calculated as,
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: not texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. math::
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
\Hat\sigma^2 = {{\sum w_i} \over {(\sum w_i)^2 - \sum (w_i^2)}}
|
|
Packit |
67cb25 |
\sum w_i (x_i - \Hat\mu)^2
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
::
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
\Hat\sigma^2 = ((\sum w_i)/((\sum w_i)^2 - \sum (w_i^2)))
|
|
Packit |
67cb25 |
\sum w_i (x_i - \Hat\mu)^2
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
Note that this expression reduces to an unweighted variance with the
|
|
Packit |
67cb25 |
familiar :math:`1/(N-1)` factor when there are :math:`N` equal non-zero
|
|
Packit |
67cb25 |
weights.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. function:: double gsl_stats_wvariance_m (const double w[], size_t wstride, const double data[], size_t stride, size_t n, double wmean)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
This function returns the estimated variance of the weighted dataset
|
|
Packit |
67cb25 |
:data:`data` using the given weighted mean :data:`wmean`.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. function:: double gsl_stats_wsd (const double w[], size_t wstride, const double data[], size_t stride, size_t n)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
The standard deviation is defined as the square root of the variance.
|
|
Packit |
67cb25 |
This function returns the square root of the corresponding variance
|
|
Packit |
67cb25 |
function :func:`gsl_stats_wvariance` above.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. function:: double gsl_stats_wsd_m (const double w[], size_t wstride, const double data[], size_t stride, size_t n, double wmean)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
This function returns the square root of the corresponding variance
|
|
Packit |
67cb25 |
function :func:`gsl_stats_wvariance_m` above.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. function:: double gsl_stats_wvariance_with_fixed_mean (const double w[], size_t wstride, const double data[], size_t stride, size_t n, const double mean)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
This function computes an unbiased estimate of the variance of the weighted
|
|
Packit |
67cb25 |
dataset :data:`data` when the population mean :data:`mean` of the underlying
|
|
Packit |
67cb25 |
distribution is known *a priori*. In this case the estimator for
|
|
Packit |
67cb25 |
the variance replaces the sample mean :math:`\Hat\mu` by the known
|
|
Packit |
67cb25 |
population mean :math:`\mu`,
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: not texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. math:: \Hat\sigma^2 = {{\sum w_i (x_i - \mu)^2} \over {\sum w_i}}
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
::
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
\Hat\sigma^2 = (\sum w_i (x_i - \mu)^2) / (\sum w_i)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. function:: double gsl_stats_wsd_with_fixed_mean (const double w[], size_t wstride, const double data[], size_t stride, size_t n, const double mean)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
The standard deviation is defined as the square root of the variance.
|
|
Packit |
67cb25 |
This function returns the square root of the corresponding variance
|
|
Packit |
67cb25 |
function above.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. function:: double gsl_stats_wtss (const double w[], const size_t wstride, const double data[], size_t stride, size_t n)
|
|
Packit |
67cb25 |
double gsl_stats_wtss_m (const double w[], const size_t wstride, const double data[], size_t stride, size_t n, double wmean)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
These functions return the weighted total sum of squares (TSS) of
|
|
Packit |
67cb25 |
:data:`data` about the weighted mean. For :func:`gsl_stats_wtss_m` the
|
|
Packit |
67cb25 |
user-supplied value of :data:`wmean` is used, and for :func:`gsl_stats_wtss`
|
|
Packit |
67cb25 |
it is computed using :func:`gsl_stats_wmean`.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: not texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. math:: {\rm TSS} = \sum w_i (x_i - wmean)^2
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
::
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
TSS = \sum w_i (x_i - wmean)^2
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. function:: double gsl_stats_wabsdev (const double w[], size_t wstride, const double data[], size_t stride, size_t n)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
This function computes the weighted absolute deviation from the weighted
|
|
Packit |
67cb25 |
mean of :data:`data`. The absolute deviation from the mean is defined as,
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: not texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. math:: absdev = {{\sum w_i |x_i - \Hat\mu|} \over {\sum w_i}}
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
::
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
absdev = (\sum w_i |x_i - \Hat\mu|) / (\sum w_i)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. function:: double gsl_stats_wabsdev_m (const double w[], size_t wstride, const double data[], size_t stride, size_t n, double wmean)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
This function computes the absolute deviation of the weighted dataset
|
|
Packit |
67cb25 |
:data:`data` about the given weighted mean :data:`wmean`.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. function:: double gsl_stats_wskew (const double w[], size_t wstride, const double data[], size_t stride, size_t n)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
This function computes the weighted skewness of the dataset :data:`data`.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: not texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. math:: skew = {{\sum w_i ((x_i - {\Hat x})/{\Hat \sigma})^3} \over {\sum w_i}}
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
::
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
skew = (\sum w_i ((x_i - \Hat x)/\Hat \sigma)^3) / (\sum w_i)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. function:: double gsl_stats_wskew_m_sd (const double w[], size_t wstride, const double data[], size_t stride, size_t n, double wmean, double wsd)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
This function computes the weighted skewness of the dataset :data:`data`
|
|
Packit |
67cb25 |
using the given values of the weighted mean and weighted standard
|
|
Packit |
67cb25 |
deviation, :data:`wmean` and :data:`wsd`.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. function:: double gsl_stats_wkurtosis (const double w[], size_t wstride, const double data[], size_t stride, size_t n)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
This function computes the weighted kurtosis of the dataset :data:`data`.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: not texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. math:: kurtosis = {{\sum w_i ((x_i - {\Hat x})/{\Hat \sigma})^4} \over {\sum w_i}} - 3
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
::
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
kurtosis = ((\sum w_i ((x_i - \Hat x)/\Hat \sigma)^4) / (\sum w_i)) - 3
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. function:: double gsl_stats_wkurtosis_m_sd (const double w[], size_t wstride, const double data[], size_t stride, size_t n, double wmean, double wsd)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
This function computes the weighted kurtosis of the dataset :data:`data`
|
|
Packit |
67cb25 |
using the given values of the weighted mean and weighted standard
|
|
Packit |
67cb25 |
deviation, :data:`wmean` and :data:`wsd`.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
Maximum and Minimum values
|
|
Packit |
67cb25 |
==========================
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
The following functions find the maximum and minimum values of a
|
|
Packit |
67cb25 |
dataset (or their indices). If the data contains :code:`NaN`-s then a
|
|
Packit |
67cb25 |
:code:`NaN` will be returned, since the maximum or minimum value is
|
|
Packit |
67cb25 |
undefined. For functions which return an index, the location of the
|
|
Packit |
67cb25 |
first :code:`NaN` in the array is returned.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. function:: double gsl_stats_max (const double data[], size_t stride, size_t n)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
This function returns the maximum value in :data:`data`, a dataset of
|
|
Packit |
67cb25 |
length :data:`n` with stride :data:`stride`. The maximum value is defined
|
|
Packit |
67cb25 |
as the value of the element :math:`x_i` which satisfies :math:`x_i \ge x_j`
|
|
Packit |
67cb25 |
for all :math:`j`.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
If you want instead to find the element with the largest absolute
|
|
Packit |
67cb25 |
magnitude you will need to apply :func:`fabs` or :func:`abs` to your data
|
|
Packit |
67cb25 |
before calling this function.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. function:: double gsl_stats_min (const double data[], size_t stride, size_t n)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
This function returns the minimum value in :data:`data`, a dataset of
|
|
Packit |
67cb25 |
length :data:`n` with stride :data:`stride`. The minimum value is defined
|
|
Packit |
67cb25 |
as the value of the element :math:`x_i` which satisfies :math:`x_i \le x_j`
|
|
Packit |
67cb25 |
for all :math:`j`.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
If you want instead to find the element with the smallest absolute
|
|
Packit |
67cb25 |
magnitude you will need to apply :func:`fabs` or :func:`abs` to your data
|
|
Packit |
67cb25 |
before calling this function.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. function:: void gsl_stats_minmax (double * min, double * max, const double data[], size_t stride, size_t n)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
This function finds both the minimum and maximum values :data:`min`,
|
|
Packit |
67cb25 |
:data:`max` in :data:`data` in a single pass.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. function:: size_t gsl_stats_max_index (const double data[], size_t stride, size_t n)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
This function returns the index of the maximum value in :data:`data`, a
|
|
Packit |
67cb25 |
dataset of length :data:`n` with stride :data:`stride`. The maximum value is
|
|
Packit |
67cb25 |
defined as the value of the element :math:`x_i` which satisfies
|
|
Packit |
67cb25 |
:math:`x_i \ge x_j`
|
|
Packit |
67cb25 |
for all :math:`j`. When there are several equal maximum
|
|
Packit |
67cb25 |
elements then the first one is chosen.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. function:: size_t gsl_stats_min_index (const double data[], size_t stride, size_t n)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
This function returns the index of the minimum value in :data:`data`, a
|
|
Packit |
67cb25 |
dataset of length :data:`n` with stride :data:`stride`. The minimum value
|
|
Packit |
67cb25 |
is defined as the value of the element :math:`x_i` which satisfies
|
|
Packit |
67cb25 |
:math:`x_i \ge x_j`
|
|
Packit |
67cb25 |
for all :math:`j`. When there are several equal
|
|
Packit |
67cb25 |
minimum elements then the first one is chosen.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. function:: void gsl_stats_minmax_index (size_t * min_index, size_t * max_index, const double data[], size_t stride, size_t n)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
This function returns the indexes :data:`min_index`, :data:`max_index` of
|
|
Packit |
67cb25 |
the minimum and maximum values in :data:`data` in a single pass.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
Median and Percentiles
|
|
Packit |
67cb25 |
======================
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
The median and percentile functions described in this section operate on
|
|
Packit |
67cb25 |
sorted data in :math:`O(1)` time. There is also a routine for computing
|
|
Packit |
67cb25 |
the median of an unsorted input array in average :math:`O(n)` time using
|
|
Packit |
67cb25 |
the quickselect algorithm. For convenience we use *quantiles*, measured on a scale
|
|
Packit |
67cb25 |
of 0 to 1, instead of percentiles (which use a scale of 0 to 100).
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. function:: double gsl_stats_median_from_sorted_data (const double sorted_data[], const size_t stride, const size_t n)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
This function returns the median value of :data:`sorted_data`, a dataset
|
|
Packit |
67cb25 |
of length :data:`n` with stride :data:`stride`. The elements of the array
|
|
Packit |
67cb25 |
must be in ascending numerical order. There are no checks to see
|
|
Packit |
67cb25 |
whether the data are sorted, so the function :func:`gsl_sort` should
|
|
Packit |
67cb25 |
always be used first.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
When the dataset has an odd number of elements the median is the value
|
|
Packit |
67cb25 |
of element :math:`(n-1)/2`. When the dataset has an even number of
|
|
Packit |
67cb25 |
elements the median is the mean of the two nearest middle values,
|
|
Packit |
67cb25 |
elements :math:`(n-1)/2` and :math:`n/2`. Since the algorithm for
|
|
Packit |
67cb25 |
computing the median involves interpolation this function always returns
|
|
Packit |
67cb25 |
a floating-point number, even for integer data types.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. function:: double gsl_stats_median (double data[], const size_t stride, const size_t n)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
This function returns the median value of :data:`data`, a dataset
|
|
Packit |
67cb25 |
of length :data:`n` with stride :data:`stride`. The median is found
|
|
Packit |
67cb25 |
using the quickselect algorithm. The input array does not need to be
|
|
Packit |
67cb25 |
sorted, but note that the algorithm rearranges the array and so the input
|
|
Packit |
67cb25 |
is not preserved on output.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. function:: double gsl_stats_quantile_from_sorted_data (const double sorted_data[], size_t stride, size_t n, double f)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
This function returns a quantile value of :data:`sorted_data`, a
|
|
Packit |
67cb25 |
double-precision array of length :data:`n` with stride :data:`stride`. The
|
|
Packit |
67cb25 |
elements of the array must be in ascending numerical order. The
|
|
Packit |
67cb25 |
quantile is determined by the :data:`f`, a fraction between 0 and 1. For
|
|
Packit |
67cb25 |
example, to compute the value of the 75th percentile :data:`f` should have
|
|
Packit |
67cb25 |
the value 0.75.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
There are no checks to see whether the data are sorted, so the function
|
|
Packit |
67cb25 |
:func:`gsl_sort` should always be used first.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
The quantile is found by interpolation, using the formula
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: not texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. math:: \hbox{quantile} = (1 - \delta) x_i + \delta x_{i+1}
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
::
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
quantile = (1 - \delta) x_i + \delta x_{i+1}
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
where :math:`i` is :code:`floor((n - 1)f)` and :math:`\delta` is
|
|
Packit |
67cb25 |
:math:`(n-1)f - i`.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
Thus the minimum value of the array (:code:`data[0*stride]`) is given by
|
|
Packit |
67cb25 |
:data:`f` equal to zero, the maximum value (:code:`data[(n-1)*stride]`) is
|
|
Packit |
67cb25 |
given by :data:`f` equal to one and the median value is given by :data:`f`
|
|
Packit |
67cb25 |
equal to 0.5. Since the algorithm for computing quantiles involves
|
|
Packit |
67cb25 |
interpolation this function always returns a floating-point number, even
|
|
Packit |
67cb25 |
for integer data types.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. @node Statistical tests
|
|
Packit |
67cb25 |
.. @section Statistical tests
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. FIXME, do more work on the statistical tests
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. -@deftypefun double gsl_stats_ttest (const double data1[], double data2[], size_t n1, size_t n2)
|
|
Packit |
67cb25 |
.. -@deftypefunx Statistics double gsl_stats_int_ttest (const double data1[], double data2[], size_t n1, size_t n2)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. The function :func:`gsl_stats_ttest` computes the t-test statistic for
|
|
Packit |
67cb25 |
.. the two arrays :data:`data1`[] and :data:`data2`[], of lengths :data:`n1` and
|
|
Packit |
67cb25 |
.. -:data:`n2` respectively.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. The t-test statistic measures the difference between the means of two
|
|
Packit |
67cb25 |
.. datasets.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
Order Statistics
|
|
Packit |
67cb25 |
================
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
The :math:`k`-th *order statistic* of a sample is equal to its :math:`k`-th smallest value.
|
|
Packit |
67cb25 |
The :math:`k`-th order statistic of a set of :math:`n` values :math:`x = \left\{ x_i \right\}, 1 \le i \le n` is
|
|
Packit |
67cb25 |
denoted :math:`x_{(k)}`. The median of the set :math:`x` is equal to :math:`x_{\left( \frac{n}{2} \right)}` if
|
|
Packit |
67cb25 |
:math:`n` is odd, or the average of :math:`x_{\left( \frac{n}{2} \right)}` and :math:`x_{\left( \frac{n}{2} + 1 \right)}`
|
|
Packit |
67cb25 |
if :math:`n` is even. The :math:`k`-th smallest element of a length :math:`n` vector can be found
|
|
Packit |
67cb25 |
in average :math:`O(n)` time using the quickselect algorithm.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. function:: gsl_stats_select(double data[], const size_t stride, const size_t n, const size_t k)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
This function finds the :data:`k`-th smallest element of the input array :data:`data`
|
|
Packit |
67cb25 |
of length :data:`n` and stride :data:`stride` using the quickselect method. The
|
|
Packit |
67cb25 |
algorithm rearranges the elements of :data:`data` and so the input array is not preserved
|
|
Packit |
67cb25 |
on output.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. index::
|
|
Packit |
67cb25 |
single: robust location estimators
|
|
Packit |
67cb25 |
single: location estimation
|
|
Packit |
67cb25 |
single: estimation, location
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
Robust Location Estimates
|
|
Packit |
67cb25 |
=========================
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
A *location estimate* refers to a typical or central value which best describes a given
|
|
Packit |
67cb25 |
dataset. The mean and median are both examples of location estimators. However, the
|
|
Packit |
67cb25 |
mean has a severe sensitivity to data outliers and can give erroneous values when
|
|
Packit |
67cb25 |
even a small number of outliers are present. The median on the other hand, has
|
|
Packit |
67cb25 |
a strong insensitivity to data outliers, but due to its non-smoothness it can
|
|
Packit |
67cb25 |
behave unexpectedly in certain situations. GSL offers the following alternative
|
|
Packit |
67cb25 |
location estimators, which are robust to the presence of outliers.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. index::
|
|
Packit |
67cb25 |
single: trimmed mean
|
|
Packit |
67cb25 |
single: truncated mean
|
|
Packit |
67cb25 |
single: mean, trimmed
|
|
Packit |
67cb25 |
single: mean, truncated
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
Trimmed Mean
|
|
Packit |
67cb25 |
------------
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
The trimmed mean, or *truncated mean*, discards a certain number of smallest and largest
|
|
Packit |
67cb25 |
samples from the input vector before computing the mean of the remaining samples. The
|
|
Packit |
67cb25 |
amount of trimming is specified by a factor :math:`\alpha \in [0,0.5]`. Then the
|
|
Packit |
67cb25 |
number of samples discarded from both ends of the input vector is
|
|
Packit |
67cb25 |
:math:`\left\lfloor \alpha n \right\rfloor`, where :math:`n` is the length of the input.
|
|
Packit |
67cb25 |
So to discard 25% of the samples from each end, one would set :math:`\alpha = 0.25`.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. function:: double gsl_stats_trmean_from_sorted_data (const double alpha, const double sorted_data[], const size_t stride, const size_t n)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
This function returns the trimmed mean of :data:`sorted_data`, a dataset
|
|
Packit |
67cb25 |
of length :data:`n` with stride :data:`stride`. The elements of the array
|
|
Packit |
67cb25 |
must be in ascending numerical order. There are no checks to see
|
|
Packit |
67cb25 |
whether the data are sorted, so the function :func:`gsl_sort` should
|
|
Packit |
67cb25 |
always be used first. The trimming factor :math:`\alpha` is given in :data:`alpha`.
|
|
Packit |
67cb25 |
If :math:`\alpha \ge 0.5`, then the median of the input is returned.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. index::
|
|
Packit |
67cb25 |
single: Gastwirth estimator
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
Gastwirth Estimator
|
|
Packit |
67cb25 |
-------------------
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
Gastwirth's location estimator is a weighted sum of three order statistics,
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: not texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. math:: gastwirth = 0.3 \times Q_{\frac{1}{3}} + 0.4 \times Q_{\frac{1}{2}} + 0.3 \times Q_{\frac{2}{3}}
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
::
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
gastwirth = 0.3 * Q_{1/3} + 0.4 * Q_{1/2} + 0.3 * Q_{2/3}
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
where :math:`Q_{\frac{1}{3}}` is the one-third quantile, :math:`Q_{\frac{1}{2}}` is the one-half
|
|
Packit |
67cb25 |
quantile (i.e. median), and :math:`Q_{\frac{2}{3}}` is the two-thirds quantile.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. function:: double gsl_stats_gastwirth_from_sorted_data (const double sorted_data[], const size_t stride, const size_t n)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
This function returns the Gastwirth location estimator of :data:`sorted_data`, a dataset
|
|
Packit |
67cb25 |
of length :data:`n` with stride :data:`stride`. The elements of the array
|
|
Packit |
67cb25 |
must be in ascending numerical order. There are no checks to see
|
|
Packit |
67cb25 |
whether the data are sorted, so the function :func:`gsl_sort` should
|
|
Packit |
67cb25 |
always be used first.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. index::
|
|
Packit |
67cb25 |
single: robust scale estimators
|
|
Packit |
67cb25 |
single: scale estimation
|
|
Packit |
67cb25 |
single: estimation, scale
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
Robust Scale Estimates
|
|
Packit |
67cb25 |
======================
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
A *robust scale estimate*, also known as a robust measure of scale, attempts to quantify
|
|
Packit |
67cb25 |
the statistical dispersion (variability, scatter, spread) in a set of data which may contain outliers.
|
|
Packit |
67cb25 |
In such datasets, the usual variance or standard deviation scale estimate can be rendered useless
|
|
Packit |
67cb25 |
by even a single outlier.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. index::
|
|
Packit |
67cb25 |
single: median absolute deviation
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. _sec_mad-statistic:
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
Median Absolute Deviation (MAD)
|
|
Packit |
67cb25 |
-------------------------------
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
The median absolute deviation (MAD) is defined as
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: not texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. math:: MAD = 1.4826 \times \textrm{median} \left\{ \left| x_i - \textrm{median} \left( x \right) \right| \right\}
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
::
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
MAD = 1.4826 median { | x_i - median(x) | }
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
In words, first the median of all samples is computed. Then the median
|
|
Packit |
67cb25 |
is subtracted from all samples in the input to find the deviation of each sample
|
|
Packit |
67cb25 |
from the median. The median of all absolute deviations is then the MAD.
|
|
Packit |
67cb25 |
The factor :math:`1.4826` makes the MAD an unbiased estimator of the standard deviation for Gaussian data.
|
|
Packit |
67cb25 |
The median absolute deviation has an asymptotic efficiency of 37%.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. function:: double gsl_stats_mad0 (const double data[], const size_t stride, const size_t n, double work[])
|
|
Packit |
67cb25 |
.. function:: double gsl_stats_mad (const double data[], const size_t stride, const size_t n, double work[])
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
These functions return the median absolute deviation of :data:`data`, a dataset
|
|
Packit |
67cb25 |
of length :data:`n` and stride :data:`stride`.
|
|
Packit |
67cb25 |
The :code:`mad0` function calculates
|
|
Packit |
67cb25 |
:math:`\textrm{median} \left\{ \left| x_i - \textrm{median} \left( x \right) \right| \right\}`
|
|
Packit |
67cb25 |
(i.e. the :math:`MAD` statistic without the bias correction scale factor).
|
|
Packit |
67cb25 |
These functions require additional workspace of size :code:`n` provided in :data:`work`.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. index::
|
|
Packit |
67cb25 |
single: Sn statistic
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. _sec_Sn-statistic:
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
:math:`S_n` Statistic
|
|
Packit |
67cb25 |
---------------------
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
The :math:`S_n` statistic developed by Croux and Rousseeuw is defined as
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: not texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. math:: S_n = 1.1926 \times c_n \times \textrm{median}_i \left\{ \textrm{median}_j \left( \left| x_i - x_j \right| \right) \right\}
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
::
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
S_n = 1.1926 * c_n * median_i { median_j ( | x_i - x_j | ) }
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
For each sample :math:`x_i, 1 \le i \le n`, the median of the values :math:`\left| x_i - x_j \right|` is computed for all
|
|
Packit |
67cb25 |
:math:`x_j, 1 \le j \le n`. This yields :math:`n` values, whose median then gives the final :math:`S_n`.
|
|
Packit |
67cb25 |
The factor :math:`1.1926` makes :math:`S_n` an unbiased estimate of the standard deviation for Gaussian data.
|
|
Packit |
67cb25 |
The factor :math:`c_n` is a correction factor to correct bias in small sample sizes. :math:`S_n` has an asymptotic
|
|
Packit |
67cb25 |
efficiency of 58%.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. function:: double gsl_stats_Sn0_from_sorted_data (const double sorted_data[], const size_t stride, const size_t n, double work[])
|
|
Packit |
67cb25 |
.. function:: double gsl_stats_Sn_from_sorted_data (const double sorted_data[], const size_t stride, const size_t n, double work[])
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
These functions return the :math:`S_n` statistic of :data:`sorted_data`, a dataset
|
|
Packit |
67cb25 |
of length :data:`n` with stride :data:`stride`. The elements of the array
|
|
Packit |
67cb25 |
must be in ascending numerical order. There are no checks to see
|
|
Packit |
67cb25 |
whether the data are sorted, so the function :func:`gsl_sort` should
|
|
Packit |
67cb25 |
always be used first. The :code:`Sn0` function calculates
|
|
Packit |
67cb25 |
:math:`\textrm{median}_i \left\{ \textrm{median}_j \left( \left| x_i - x_j \right| \right) \right\}`
|
|
Packit |
67cb25 |
(i.e. the :math:`S_n` statistic without the bias correction scale factors).
|
|
Packit |
67cb25 |
These functions require additional workspace of size
|
|
Packit |
67cb25 |
:code:`n` provided in :data:`work`.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. index::
|
|
Packit |
67cb25 |
single: Qn statistic
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. _sec_Qn-statistic:
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
:math:`Q_n` Statistic
|
|
Packit |
67cb25 |
---------------------
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
The :math:`Q_n` statistic developed by Croux and Rousseeuw is defined as
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: not texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. math:: Q_n = 2.21914 \times d_n \times \left\{ \left| x_i - x_j \right|, i < j \right\}_{(k)}
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
::
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
Q_n = 2.21914 * d_n * { | x_i - x_j |, i < j }_{(k)}
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
The factor :math:`2.21914` makes :math:`Q_n` an unbiased estimate of the standard deviation for Gaussian data.
|
|
Packit |
67cb25 |
The factor :math:`d_n` is a correction factor to correct bias in small sample sizes. The order statistic
|
|
Packit |
67cb25 |
is
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: not texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. math:: k = \left(
|
|
Packit |
67cb25 |
\begin{array}{c}
|
|
Packit |
67cb25 |
\left\lfloor \frac{n}{2} \right\rfloor + 1 \\
|
|
Packit |
67cb25 |
2
|
|
Packit |
67cb25 |
\end{array}
|
|
Packit |
67cb25 |
\right)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. only:: texinfo
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
::
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
k = ( floor(n/2) + 1 )
|
|
Packit |
67cb25 |
( 2 )
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
:math:`Q_n` has an asymptotic efficiency of 82%.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. function:: double gsl_stats_Qn0_from_sorted_data (const double sorted_data[], const size_t stride, const size_t n, double work[], int work_int[])
|
|
Packit |
67cb25 |
double gsl_stats_Qn_from_sorted_data (const double sorted_data[], const size_t stride, const size_t n, double work[], int work_int[])
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
These functions return the :math:`Q_n` statistic of :data:`sorted_data`, a dataset
|
|
Packit |
67cb25 |
of length :data:`n` with stride :data:`stride`. The elements of the array
|
|
Packit |
67cb25 |
must be in ascending numerical order. There are no checks to see
|
|
Packit |
67cb25 |
whether the data are sorted, so the function :func:`gsl_sort` should
|
|
Packit |
67cb25 |
always be used first. The :code:`Qn0` function calculates
|
|
Packit |
67cb25 |
:math:`\left\{ \left| x_i - x_j \right|, i < j \right\}_{(k)}`
|
|
Packit |
67cb25 |
(i.e. :math:`Q_n` without the bias correction scale factors).
|
|
Packit |
67cb25 |
These functions require additional workspace of size
|
|
Packit |
67cb25 |
:code:`3n` provided in :data:`work` and integer workspace of size :code:`5n`
|
|
Packit |
67cb25 |
provided in :data:`work_int`.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
Examples
|
|
Packit |
67cb25 |
========
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
Here is a basic example of how to use the statistical functions:
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. include:: examples/stat.c
|
|
Packit |
67cb25 |
:code:
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
The program should produce the following output,
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. include:: examples/stat.txt
|
|
Packit |
67cb25 |
:code:
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
Here is an example using sorted data,
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. include:: examples/statsort.c
|
|
Packit |
67cb25 |
:code:
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
This program should produce the following output,
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
.. include:: examples/statsort.txt
|
|
Packit |
67cb25 |
:code:
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
References and Further Reading
|
|
Packit |
67cb25 |
==============================
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
The standard reference for almost any topic in statistics is the
|
|
Packit |
67cb25 |
multi-volume *Advanced Theory of Statistics* by Kendall and Stuart.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
* Maurice Kendall, Alan Stuart, and J. Keith Ord.
|
|
Packit |
67cb25 |
*The Advanced Theory of Statistics* (multiple volumes)
|
|
Packit |
67cb25 |
reprinted as *Kendall's Advanced Theory of Statistics*.
|
|
Packit |
67cb25 |
Wiley, ISBN 047023380X.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
Many statistical concepts can be more easily understood by a Bayesian
|
|
Packit |
67cb25 |
approach. The following book by Gelman, Carlin, Stern and Rubin gives a
|
|
Packit |
67cb25 |
comprehensive coverage of the subject.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
* Andrew Gelman, John B. Carlin, Hal S. Stern, Donald B. Rubin.
|
|
Packit |
67cb25 |
*Bayesian Data Analysis*.
|
|
Packit |
67cb25 |
Chapman & Hall, ISBN 0412039915.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
For physicists the Particle Data Group provides useful reviews of
|
|
Packit |
67cb25 |
Probability and Statistics in the "Mathematical Tools" section of its
|
|
Packit |
67cb25 |
Annual Review of Particle Physics.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
* *Review of Particle Properties*,
|
|
Packit |
67cb25 |
R.M. Barnett et al., Physical Review D54, 1 (1996)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
The Review of Particle Physics is available online at
|
|
Packit |
67cb25 |
the website http://pdg.lbl.gov/.
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
The following papers describe robust scale estimation,
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
* C. Croux and P. J. Rousseeuw, *Time-Efficient algorithms for two highly robust
|
|
Packit |
67cb25 |
estimators of scale*, Comp. Stat., Physica, Heidelberg, 1992.
|
|
Packit |
67cb25 |
* P. J. Rousseeuw and C. Croux, *Explicit scale estimators with high breakdown point*,
|
|
Packit |
67cb25 |
L1-Statistical Analysis and Related Methods, pp. 77-92, 1992.
|