|
Packit |
67cb25 |
# -*- org -*-
|
|
Packit |
67cb25 |
#+CATEGORY: statistics
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
* From: James Theiler <jt@lanl.gov>
|
|
Packit |
67cb25 |
To: John Lamb <J.D.Lamb@btinternet.com>
|
|
Packit |
67cb25 |
Cc: gsl-discuss@sources.redhat.com
|
|
Packit |
67cb25 |
Subject: Re: Collecting statistics for time dependent data?
|
|
Packit |
67cb25 |
Date: Thu, 9 Dec 2004 14:18:36 -0700 (MST)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
On Thu, 9 Dec 2004, John Lamb wrote:
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
] Raimondo Giammanco wrote:
|
|
Packit |
67cb25 |
] > Hello,
|
|
Packit |
67cb25 |
] >
|
|
Packit |
67cb25 |
] > I was wondering if there is a way to compute "running" statistics with
|
|
Packit |
67cb25 |
] > gsl.
|
|
Packit |
67cb25 |
] >
|
|
Packit |
67cb25 |
] Yes you can do it, but there's nothing in GSL that does it and its eay
|
|
Packit |
67cb25 |
] enough that you don't need GSL. Something like (untested)
|
|
Packit |
67cb25 |
]
|
|
Packit |
67cb25 |
] double update_mean( double* mean, int* n, double x ){
|
|
Packit |
67cb25 |
] if( *n == 1 )
|
|
Packit |
67cb25 |
] *mean = x;
|
|
Packit |
67cb25 |
] else
|
|
Packit |
67cb25 |
] *mean = (1 - (double)1 / *n ) * *mean + x / n;
|
|
Packit |
67cb25 |
] }
|
|
Packit |
67cb25 |
]
|
|
Packit |
67cb25 |
] will work and you can derive a similar method for updating the variance
|
|
Packit |
67cb25 |
] using the usual textbook formula.
|
|
Packit |
67cb25 |
]
|
|
Packit |
67cb25 |
] var[x] = (1/n) sum x^2_i - mean(x)^2
|
|
Packit |
67cb25 |
]
|
|
Packit |
67cb25 |
] I don't know if there is a method that avoids the rounding errors. I
|
|
Packit |
67cb25 |
] don't know why so many textbooks repeat this formula without the
|
|
Packit |
67cb25 |
] slightest warning that it can go so badly wrong.
|
|
Packit |
67cb25 |
]
|
|
Packit |
67cb25 |
]
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
Stably updating mean and variance is remarkably nontrivial. There was
|
|
Packit |
67cb25 |
a series of papers in Comm ACM that discussed the issue; the final one
|
|
Packit |
67cb25 |
(that I know of) refers back to the earlier ones, and it can be found
|
|
Packit |
67cb25 |
in D.H.D. West, Updating mean and variance estimates: an improved
|
|
Packit |
67cb25 |
method, Comm ACM 22:9, 532 (1979) [* I see Luke Stras just sent this
|
|
Packit |
67cb25 |
reference! *]. I'll just copy out the pseudocode since the paper is
|
|
Packit |
67cb25 |
old enough that it might not be easy to find. This, by the way, is
|
|
Packit |
67cb25 |
generalized for weighted data, so it assumes that you get a weight and
|
|
Packit |
67cb25 |
a data value (W_i and X_i) that you use to update the estimates XBAR
|
|
Packit |
67cb25 |
and S2:
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
SUMW = W_1
|
|
Packit |
67cb25 |
M = X_1
|
|
Packit |
67cb25 |
T = 0
|
|
Packit |
67cb25 |
For i=2,3,...,n
|
|
Packit |
67cb25 |
{
|
|
Packit |
67cb25 |
Q = X_i - M
|
|
Packit |
67cb25 |
TEMP = SUM + W_i // typo: He meant SUMW
|
|
Packit |
67cb25 |
R = Q*W_i/TEMP
|
|
Packit |
67cb25 |
M = M + R
|
|
Packit |
67cb25 |
T = T + R*SUMW*Q
|
|
Packit |
67cb25 |
SUMW = TEMP
|
|
Packit |
67cb25 |
}
|
|
Packit |
67cb25 |
XBAR = M
|
|
Packit |
67cb25 |
S2 = T*n/((n-1)*SUMW)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
jt
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
--
|
|
Packit |
67cb25 |
James Theiler Space and Remote Sensing Sciences
|
|
Packit |
67cb25 |
MS-B244, ISR-2, LANL Los Alamos National Laboratory
|
|
Packit |
67cb25 |
Los Alamos, NM 87545 http://nis-www.lanl.gov/~jt
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
* Look at STARPAC ftp://ftp.ucar.edu/starpac/ and Statlib
|
|
Packit |
67cb25 |
http://lib.stat.cmu.edu/ for more ideas
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
* Try using the Kahan summation formula to improve accuracy for the
|
|
Packit |
67cb25 |
NIST tests (see Brian for details, below is a sketch of the algorithm).
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
sum = x(1)
|
|
Packit |
67cb25 |
c = 0
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
DO i = 2, 1000000, 1
|
|
Packit |
67cb25 |
y = x(i) - c
|
|
Packit |
67cb25 |
t = sum + y
|
|
Packit |
67cb25 |
c = (t - sum) - y
|
|
Packit |
67cb25 |
sum = t
|
|
Packit |
67cb25 |
ENDDO
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
* Prevent incorrect use of unsorted data for quartile calculations
|
|
Packit |
67cb25 |
using a typedef for sorted data (?)
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
* Rejection of outliers
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
* Time series. Auto correlation, cross-correlation, smoothing (moving
|
|
Packit |
67cb25 |
average), detrending, various econometric things. Integrated
|
|
Packit |
67cb25 |
quantities (area under the curve). Interpolation of noisy data/fitting
|
|
Packit |
67cb25 |
-- maybe add that to the existing interpolation stuff.What about
|
|
Packit |
67cb25 |
missing data and gaps?
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
There is a new GNU package called gretl which does econometrics
|
|
Packit |
67cb25 |
|
|
Packit |
67cb25 |
* Statistical tests (equal means, equal variance, etc).
|
|
Packit |
67cb25 |
|