[Haskell-cafe] Re: Math.Statistics

Tue Sep 25 20:25:39 EDT 2007

There are a number of interesting issues raised by mbeddoe's  
Math.Statistics.

0.  Coding issues.

     Why use foldr1 (*) instead of product?

     covm xs =  split' (length xs) cs
       where
         cs = [ cov a b | a <- xs, b <- xs]
         split' n = unfoldr (\y -> if null y then Nothing
			          else Just $ splitAt n y)

     seems a rather odd way to write

	covm xs = [[cov a b | b <- xs] | a <- xs]

     which in turn is a bit wasteful because the matrix must be
     symmetric.

     I believe the author may have misunderstood "numerically stable".
     The obvious (sum xs)/(fromIntegral $ length $ xs) is fine for
     the mean, and the obvious two-pass algorithm for the variance is
     fine.  The tricky algorithms are needed for incremental one-pass
     calculation.

1.  How do you decide how general to make things?

     mean, hmean: Floating a
	But only + and / are needed, so Fractional a is enough.
     gmean: Floating a
	You need + and /, so fair enough.
     median: Floating a, Ord a
	High median and low median (not provided) need only Ord a.
	Standard median needs + and /, so fair enough.
     modes: Ord a
	Right.
     range: Num a, Ord a
	But range needs only Ord a.
     avgdev: Floating a
	Right.
     pvar, var, stddev, skew, kurtosis, cov, covm: Floating a
	How should one define the variance of a collection of complex
	numbers?  I'm not at all sure that it makes sense.  If I were
	dealing with complex numbers, I might treat them as 2-vectors,
	in which case the "standard deviation" would be a 2x2 matrix,
	not a complex number.  I suspect that these should require
	Floating a, Ord a.  For skew this is quite certain: one looks
	for "positive" or "negative" skew, and this makes no sense for
	complex numbers.
    iqr: type not specified but appears to be :: [a] -> [a]
	This is certainly the wrong type as the inter-quartile range
	is supposed to be a NUMBER, not a sequence.  You have to use
	comparisons and subtraction, so
	iqr :: Num a, Ord a => [a] -> a

2. Statistics

     This is an odd bunch of things; mostly things that are dangerous to
     use.  A collection of the techniques from "Applications, Basics,  
and
     Computing" by Hoaglin and Velleman (and was there someone else?
     memory fails me) would be nice.  A few basic robust measures like
	- median absolute deviation
	- trimmed mean
	- Winsorized standard deviation
	- percentage bend correlation
     might be nice.

3.  How laziness could help.

Suppose you ask for the mean, standard deviation, and skewness of
a bunch of numbers using this library.  Calculating the standard
deviation will recalculate the mean.  Calculating the skewness
will recalculate the standard deviation, which will recalculate the
mean.

Robust statistics, like the trimmed mean, often want to sort the
data.  And you don't want to keep on sorting the data over and over
again.

This is a case where Haskell's laziness really shines: it translates
into direct practical gains for simple code.

     data (Floating a, Ord a)
       => Simple_Continuous_Variate a
        = SCV [a] Int a a (Array Int a)

     list_to_variate xs = SCV xs n m s o
       where n = length xs
             m = sum xs / fromIntegral n
             s = sum [(x - m)^2 | x <- xs] / fromIntegral (n - 1)
             o = listArray (1,n) (sort xs)

     vLength (SCV _ n _ _ _) = n
     vMean   (SCV _ _ m _ _) = m
     vSd     (SCV _ _ _ s _) = s
     vMin    (SCV _ _ _ _ a) = a ! 1
     vMax    (SCV _ n _ _ a) = a ! n
     vRange   scv            = vMax scv - vMin scv
     vMedian (SCV _ n _ _ a)
       | odd n               = a ! ((n+1)`div`2)
       | even n              = ((a ! l) + (a ! u))/2
                               where l = n `div` 2
                                     u = n - l
     .....

The wonderful thing about this is that there are no bangs in SCV, so
the various pieces of information don't get evaluated until they are
needed, and then don't get evaluated *again*.  For example, vRange
will cause the sorted array to be built at most once.  Trimmed mean
is very little harder to write than vMedian.

Esoteric uses of lazy evaluation abound, but this one makes sense  
even to
an old number-crunching programmer.  I once built something like this in
C, but boy, it was a pain.