[Haskell-cafe] hstats median algorithm

Mon Sep 3 14:57:53 CEST 2012

So I've been playing with the median problem today.  Not sure why, but it 
stuck in my head.

>import Data.List
>import Control.Monad.ST
>import Data.STRef
>import Control.Monad

I've been using the hashing algorithm that I described last night, but it's 
quite slow.  I must be missing something obvious.  It really shouldn't be 
slow at all!

So I have a median function:

>median

First I assume you know the range of your data.

> (min,max)

Then I ask you to figure out how many buckets you want to create, and which 
of those buckets you actually want to fill.  You should only fill the 
buckets near the middle, based on an educated guess of the distribution of 
your data, otherwise, you will end up wasting memory.  I you guessed the 
middle correctly, you will get back (Just median) otherwise, you will get 
back Nothing.

> numBuckets guessedMiddle

Then you are to pass it your list.

> list =

> let

First I seperate out the values into a list of buckets, each one holding 
values which are near to eachother.  I can figure out the length of the list
at the same time, since I have to go through the whole thing anyways.

>  (myBuckets,length) =
>   buckets
>    (min,max)
>    numBuckets
>    guessedMiddle
>    list

The buckets are set up, so that the first bucket has the lowest values, and 
the last bucket has the highest values. Each bucket, is a tuple, of it's 
length and it's contents.  We fold across the list of buckets, accumulating 
the "index so far", until we find the bucket in which the median must 
reside.

>  Right (medianBucket,stubLen) =
>   foldr
>    (\thisBucket@(thisBucketLen,_) eitheriOrMedianBucket ->
>     case eitheriOrMedianBucket of
>      Left i ->
>       if i + thisBucketLen > (length `div` 2)
>        then Right (thisBucket, thisBucketLen-((length `div` 2) - i))
>        else Left (i + thisBucketLen)
>      _ -> eitheriOrMedianBucket)
>    (Left 0)
>    myBuckets
> in

We then sort the bucket in which the median must reside, and we then find 
the median the normal way.  This should be faster, since we only had to sort
one bucket, rather than the entire list.  It is not, so there must be 
something horrible going on.

>  case snd medianBucket of
>   Just medianBucket' ->
>    Just (sort medianBucket' `genericIndex` stubLen)
>   Nothing -> Nothing

Here is my actual function for seperating the values out into buckets.

>buckets
> (min,max)
> numBuckets
> (guessedMiddleStart,guessedMiddleEnd)
> list =
> runST $ do
>  lengthRef <- newSTRef 0

First we create a list of empty buckets.  Since it would be a waste of 
memory to actually fill the buckets near the edges of our distribution(where
we are not likely to find our median), our buckets contain Maybe lists, and 
the buckets which are outside of our guessed bucket range will be filled 
with Nothing.

>  buckets' <- mapM
>               (\n->
>                 newSTRef
>                   (0,
>                    if n >= guessedMiddleStart && n <= guessedMiddleEnd
>                      then Just []
>                      else Nothing))
>               [0..numBuckets]

Then we go through the buckets, figuring out which bucket to put a given 
value into.  We calculate the length at the same time.

>  forM_ list $ \number -> do
>    let

Figure out which bucket to put this into.

>     bucket =
>      whichBucket
>       (min,max)
>       numBuckets
>       number

Increment length.

>    modifySTRef
>     lengthRef
>     (+1)

Put the value into the appropriate bucket.

>    modifySTRef
>     (buckets' `genericIndex` bucket) --Obvious optimization, use an array 
and not a list.
>     (\(oldLen,oldListMaybe)->
>       case oldListMaybe of
>        Just oldList ->
>         (oldLen+1,Just (number:oldList))
>        Nothing -> (oldLen + 1, Nothing))

>  filledBuckets <- mapM readSTRef buckets'
>  length <- readSTRef lengthRef
>  return (filledBuckets,length)

>whichBucket (min,max) numBuckets number = 
> (number - min)
>    `div`
>  ((max - min) `div` numBuckets)

So I created a little test scenario.

>someListNumBuckets = 100

>someListGuessedMiddle = (45,55)

>someListLength = genericLength someList

And found that this:

>realMedian = sort someList `genericIndex` (someListLength `div` 2)

Is actually faster ^_^ :O

*Main> realMedian 
500
(7.18 secs, 2031543472 bytes)

Than this:

>someListMedian =
> median
>  (someListMin,someListMax)
>  someListNumBuckets
>  someListGuessedMiddle
>  someList

*Main> someListMedian 
Just 500
(37.77 secs, 15376209200 bytes)

>someListMin :: Integer
>someListMin = 0

>someListMax :: Integer
>someListMax = 1000

>someList :: [Integer]
>someList =
> concatMap
>  (\n->intersperse n [someListMin..someListMax])
>  [someListMax,someListMax-1..someListMin]

---------- Původní zpráva ----------
Od: timothyhobbs at seznam.cz
Datum: 3. 9. 2012
Předmět: Re: [Haskell-cafe] hstats median algorithm
"

It really depends on how you are reading in the data and what you plan to do
with it besides taking the median.  Obviously, if you read in your data as 
an ordered list things can be done O(n) without any trouble.

In another case, if you already know the range, you can make a hash table 
and start at the middle bucket and move outwards.  That will be O(n) + O
(bucketsize log(bucketsize)) given that the middle bucket is non empty and 
I'm not horribly mistaken.  

Tim

---------- Původní zpráva ----------
Od: David Feuer <david.feuer at gmail.com>
Datum: 3. 9. 2012
Předmět: Re: [Haskell-cafe] hstats median algorithm
"

I was thinking it should offer a randomized version (taking a generator), 
since randomized median algorithms provide the best expected performance. It
could also offer a deterministic version using some variant of median-of-
medians, intended for long lists. I guess it probably should retain the 
naive version for short lists. Some benchmarking would suggest a good 
cutoff. Has anyone come up with a better practical deterministic O(n) 
algorithm since median-of-medians? I saw a paper by Dorit Dor on reducing 
the number of comparisons to a bit under 3n, which also showed a lower bound
of a bit over 2n, but the algorithm she gives strikes me as far too complex 
to be practical.

On Sep 1, 2012 9:17 PM, "Gershom Bazerman" <gershomb at gmail.com
(mailto:gershomb at gmail.com)> wrote:
" In my experience, doing much better than the naive algorithm for median is
surprisingly hard, and involves a choice from a range of trade-offs. Did you
have a particular better algorithm in mind?

If you did, you could write it, and contact the package author with a patch.

You also may be able to find something of use in Edward Kmett's order-
statistics package: http://hackage.haskell.org/package/order-statistics
(http://hackage.haskell.org/package/order-statistics)

Cheers,
Gershom

On 9/1/12 3:26 PM, David Feuer wrote:
" The median function in the hstats package uses a naive O(n log n)
algorithm. Is there another package providing an O(n) option? If not,
what would it take to get the package upgraded?

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe at haskell.org(mailto:Haskell-Cafe at haskell.org)
http://www.haskell.org/mailman/listinfo/haskell-cafe
(http://www.haskell.org/mailman/listinfo/haskell-cafe)
" 
"

"
"
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20120903/81790712/attachment.htm>