[Haskell-beginners] Performance

Sun Nov 22 13:32:26 EST 2009

Am Sonntag 22 November 2009 18:59:04 schrieb Philip Scott:
> Hi again folks,
>
> I am still at it with my time-series problem, for those who haven't been
> following; I have a list of (time stamp, value) pairs and I need to do
> various bits and bobs with them. I have got arithmetic down pat now, thanks
> to the kind help of various members of the list - now I am looking at
> functions that look at some historical data in the time-series and do some
> work on that to give me an answer for a particular day.
>
> I have chosen to represent my time series in reverse date order, since non
> of the operations will ever want to look into the future, but often they
> would like to look in to the past.
>
> A function I would like to write is 'avg'. For a particular day, it
> computes the average of the values last 'n' points; if there are not n
> points to fetch, thee is no answer. I then combine those to make a new time
> series.
>
> e.g.
>
> If my input time series was
>
> [(5,10),(4,20),(3,30),(2,40), (1,50)]
>
> (Where 5, 4, 3, 2, 1 are timestamps and 10, 20, 30, 50, 50 are values)
>
> I would like the answer
>
> [(5,20), (4,30), (3,40)]
>
> (e.g. 20 = (10+20+30)/3 etc.. I can't get an answer for timestamps 2 and 1
> because there isn't enough historical data)
>
> So I have written some code to do this, and it works nicely enough; but it
> is _slow_. To do 1000 averages of different lengths on a series with only
> 3000 points takes about 200 seconds on my (not overly shabby) laptop. The
> equivalent C program takes under a second.
>
> I am entirely sure that this is due to some failing on my part. I have been
> mucking around with the profiler all afternoon lazifying and delazifying
> various bits and bobs with no dramatic success so I thought I might put it
> to y'all if you don't mind!
>
> So here's some code. I've kept it quite general because there are a lot of
> functions I would like to implement that do similar things with bits of
> historical data.
>
> General comments on the Haskellyness/goodness of my code are welcomed as
> well, I'm still very much a beginner at this!
>
> --------- SNIP --------------
>
> -- Take n elements from a list if at least n exist
> takeMaybe n l | length l < n = Nothing
>
>               | otherwise    = Just $! (take n l)

Ouch, that makes your algorithm quadratic already.
Checking "length l < n" must trverse the entire list:

3000 nodes + 2999 nodes + 2998 nodes + you get the idea.

takeMaybe n l
    | null $ drop (n-1) l  = Nothing
    | otherwise     = Just (take n l)

Or a variation,

case splitAt (n-1) l of
    (a,h:t) -> Just (a ++ [h])
    _ -> Nothing

(test which is faster, play with various sorts of strictness,...)

>
> -- Little utility function, take a function f and apply it to the whole
> list, -- then the tail etc...
> lMap _ []     = []
> lMap f (x:xs) = (f (x:xs)):(lMap f xs)

lMap f = map f . tails

(Data.List.tails and Data.List.inits are often useful, more idiomatic anyway)

>
> -- Little utility function to take a list containing Maybes and delete them
> -- Returning a list with the values inside the Just
> maybeListToList [] = []
> maybeListToList (x:xs) = maybe (maybeListToList xs)
>                                (\y -> y:(maybeListToList xs))
>                                x

Look at Data.Maybe.catMaybes

>
> -- Return a list of lists, where each sublist is a list of the next n
> values histMaybe x = lMap (takeMaybe x)
> hist n x = maybeListToList $ histMaybe n x

map (take n) $ takeWhile (not . null . drop (n-1)) $ tails xs

>
> -- Take a function which works on a list of things and apply it only to a
> -- list of the second elements in a list of tuples 'l'.
> applyToValues f l = let (ts,vs) = unzip l
>                                 in zip ts $ f vs
>
> -- Create a timeseries with the cumulative sum of the last n values
> cumL n l = map sum (hist n l)
> cum = applyToValues . cumL
>
> -- Creates a timeseries with the average of the last n values
> avgL n l = map ((*) (1/fromIntegral(n))) $ cumL n l

map (/fromIntegral n), surely?

> avg = applyToValues . avgL
>
>
> --------- SNIP --------------
>
> According to the profiler (log attached), the vast majority of the time is
> spent in takeMaybe, presumably allocating and deallocating enormous amounts
> of memory for each of my little temporary sublists. I have tried liberally
> sprinkling $! and 'seq' about, thinking that might help but I am clearly
> not doing it right.
>
> Perhaps list is the wrong basic data structure for what I am doing?
>
> I hope I didn't bore you with that rather long email, I will leave it at
> that. If it would be useful, I could give you the complete program with a
> data set if anyone is keen enough to try for themselves.
>
> Thanks,
>
> Philip