Blame statistics/TODO

Packit 67cb25
# -*- org -*-
Packit 67cb25
#+CATEGORY: statistics
Packit 67cb25
Packit 67cb25
* From: James Theiler <jt@lanl.gov>
Packit 67cb25
To: John Lamb <J.D.Lamb@btinternet.com>
Packit 67cb25
Cc: gsl-discuss@sources.redhat.com
Packit 67cb25
Subject: Re: Collecting statistics for time dependent data?
Packit 67cb25
Date: Thu, 9 Dec 2004 14:18:36 -0700 (MST)
Packit 67cb25
Packit 67cb25
On Thu, 9 Dec 2004, John Lamb wrote:
Packit 67cb25
Packit 67cb25
] Raimondo Giammanco wrote:
Packit 67cb25
] > Hello,
Packit 67cb25
] > 
Packit 67cb25
] >  I was wondering if there is a way to compute "running" statistics with
Packit 67cb25
] > gsl.
Packit 67cb25
] > 
Packit 67cb25
] Yes you can do it, but there's nothing in GSL that does it and its eay 
Packit 67cb25
] enough that you don't need GSL. Something like (untested)
Packit 67cb25
] 
Packit 67cb25
] double update_mean( double* mean, int* n, double x ){
Packit 67cb25
]    if( *n == 1 )
Packit 67cb25
]      *mean = x;
Packit 67cb25
]    else
Packit 67cb25
]      *mean = (1 - (double)1 / *n ) * *mean + x / n;
Packit 67cb25
] }
Packit 67cb25
] 
Packit 67cb25
] will work and you can derive a similar method for updating the variance 
Packit 67cb25
] using the usual textbook formula.
Packit 67cb25
] 
Packit 67cb25
] var[x] = (1/n) sum x^2_i - mean(x)^2
Packit 67cb25
] 
Packit 67cb25
] I don't know if there is a method that avoids the rounding errors. I 
Packit 67cb25
] don't know why so many textbooks repeat this formula without the 
Packit 67cb25
] slightest warning that it can go so badly wrong.
Packit 67cb25
] 
Packit 67cb25
]
Packit 67cb25
Packit 67cb25
Stably updating mean and variance is remarkably nontrivial.  There was
Packit 67cb25
a series of papers in Comm ACM that discussed the issue; the final one
Packit 67cb25
(that I know of) refers back to the earlier ones, and it can be found
Packit 67cb25
in D.H.D. West, Updating mean and variance estimates: an improved
Packit 67cb25
method, Comm ACM 22:9, 532 (1979)  [* I see Luke Stras just sent this
Packit 67cb25
reference! *].  I'll just copy out the pseudocode since the paper is
Packit 67cb25
old enough that it might not be easy to find.  This, by the way, is
Packit 67cb25
generalized for weighted data, so it assumes that you get a weight and
Packit 67cb25
a data value (W_i and X_i) that you use to update the estimates XBAR
Packit 67cb25
and S2:
Packit 67cb25
Packit 67cb25
    SUMW = W_1
Packit 67cb25
    M = X_1
Packit 67cb25
    T = 0
Packit 67cb25
    For i=2,3,...,n 
Packit 67cb25
    {
Packit 67cb25
       Q = X_i - M
Packit 67cb25
       TEMP = SUM + W_i    // typo: He meant SUMW
Packit 67cb25
       R = Q*W_i/TEMP
Packit 67cb25
       M = M + R
Packit 67cb25
       T = T + R*SUMW*Q
Packit 67cb25
       SUMW = TEMP
Packit 67cb25
    }
Packit 67cb25
    XBAR = M
Packit 67cb25
    S2 = T*n/((n-1)*SUMW)
Packit 67cb25
Packit 67cb25
Packit 67cb25
Packit 67cb25
jt 
Packit 67cb25
Packit 67cb25
-- 
Packit 67cb25
James Theiler            Space and Remote Sensing Sciences
Packit 67cb25
MS-B244, ISR-2, LANL     Los Alamos National Laboratory
Packit 67cb25
Los Alamos, NM 87545     http://nis-www.lanl.gov/~jt
Packit 67cb25
Packit 67cb25
Packit 67cb25
* Look at STARPAC ftp://ftp.ucar.edu/starpac/ and Statlib
Packit 67cb25
http://lib.stat.cmu.edu/ for more ideas
Packit 67cb25
Packit 67cb25
* Try using the Kahan summation formula to improve accuracy for the
Packit 67cb25
NIST tests (see Brian for details, below is a sketch of the algorithm).
Packit 67cb25
Packit 67cb25
      sum = x(1)
Packit 67cb25
      c = 0
Packit 67cb25
      
Packit 67cb25
      DO i = 2, 1000000, 1
Packit 67cb25
         y = x(i) - c
Packit 67cb25
         t = sum + y
Packit 67cb25
         c = (t - sum) - y
Packit 67cb25
         sum = t
Packit 67cb25
      ENDDO
Packit 67cb25
Packit 67cb25
* Prevent incorrect use of unsorted data for quartile calculations
Packit 67cb25
using a typedef for sorted data (?)
Packit 67cb25
Packit 67cb25
* Rejection of outliers
Packit 67cb25
Packit 67cb25
* Time series. Auto correlation, cross-correlation, smoothing (moving
Packit 67cb25
average), detrending, various econometric things. Integrated
Packit 67cb25
quantities (area under the curve). Interpolation of noisy data/fitting
Packit 67cb25
-- maybe add that to the existing interpolation stuff.What about
Packit 67cb25
missing data and gaps?
Packit 67cb25
Packit 67cb25
   There is a new GNU package called gretl which does econometrics
Packit 67cb25
Packit 67cb25
* Statistical tests (equal means, equal variance, etc).
Packit 67cb25