[Haskell-beginners] histogram over large data

Ian Knopke ian.knopke at gmail.com
Tue Jun 5 22:09:49 CEST 2012


Hi Stephen,

I see now I didn't explain myself especially well. One way to build a
frequency count in haskell comes from LYAHFGG, chapter 7:

    ghci> map (\l@(x:xs) -> (x,length l)) . group . sort $
[1,1,1,1,2,2,2,2,3,3,2,2,2,5,6,7]
    [(1,4),(2,7),(3,2),(5,1),(6,1),(7,1)]

my guess would be that the problem with this, unless haskell is a lot
more magical than I thought, is that it requires all the entries to be
held in memory to do the sort (and then the group I suppose). The
database has hundreds of millions of entries.

My previous example in perl builds results cumulatively as it
retrieves each entry from the db - get_next_from_db() is basically an
iterator that returns undef once it runs out of entries (using DBI,
which is an SQL wrapper very similar to HDBC. I only wrote it that way
to focus on the real problem.) I could also do this in haskell with
something like accumArray but it would require a pass through to know
what all the possible categorical values are first. Also, if I
understand correctly this would also be quite slow.

What I'm really looking for is suggestions about the best or idiomatic
"haskell" way to cumulatively build these results as it goes along, in
a reasonable amount of time. For instance, my knowledge of updatable
data structures in haskell isn't very good. Is there something obvious
or well-known that I'm overlooking? In general, I'm trying to move my
haskell knowledge from toy problem level to being useful for real
world problems.


Ian




On Tue, Jun 5, 2012 at 7:54 AM, Stephen Tetley <stephen.tetley at gmail.com> wrote:
> It's likely the while loop can be replaced with a fold which has a Map
> (or IntMap if the key is an Int) as the accumulator. For a large Map
> you will need to pay attention to inserting elements strictly...
>
> The more complicated bit is without any extra knowledge about the
> implementation of the "database" `get_next_from_db()` is currently
> magic.



More information about the Beginners mailing list