[Haskell-cafe] Data analysis with Haskell

Don Stewart dons at galois.com
Mon Jan 12 16:18:02 EST 2009

> Hi all,
> I'm going to start a project where I'll have to do some data analysis 
> (statistics about product orders) based on database entries; it will 
> mostly be some very basic stuff like grouping by certain rules and 
> finding averages as well as summing up and such.  It will however be 
> more than what can be done directly in the database using SQL, so there 
> will be some processing in my program.
> I'm thinking about trying to do this in Haskell (because I like this 
> language a lot); however, it is surely not my most proficient language 
> and I tried to do some number crunching (real one that time) before in 
> Haskell where I had to deal with some 4 million integer lists, and this 
> failed; the program took a lot more memory than would have been 
> necessary and ran for several minutes (kept swapping all the time, too). 
>  A rewrite in Fortran did give the result in 6s and didn't run out of 
> space.

**Don't use lists when you mean to use arrays**

E.g. multiple two 4M element arrays, map over the result and sum that.

    import Data.Array.Vector

    main = print . sumU . mapU (+7) $ zipWithU (*)
                            (enumFromToU 1 (4000000 :: Int))
                            (enumFromToU 2 (4000001 :: Int))

Compile it:

    $ ghc -O2 -fvia-C -optc-O3 -funbox-strict-fields --make

    $ time ./A
    ./A  0.03s user 0.00s system 97% cpu 0.034 total

Not the end of the world at all.
> This was probably my fault at that time, because I surely did something 
> completely wrong for the Haskell style.  However, I fear I could run 
> into problems like that in the new project, too.  So I want to ask for 
> your opinions, do you think Haskell is the right language to do data 

You want to compile Haskell DB queries into SQL?

> analysis of this kind?  And do you think it is hard for still beginner 
> Haskell programmer to get this right so Haskell does not use up a lot of 
> memory for thunks or list-overhead or things like that?  And finally, 
> are there database bindings for Haskell I could use for the queries?

There are lots of database bindings. Very popular ones are HDBC and
Takusen. Check on hackage.haskell.org

-- Donnn

More information about the Haskell-Cafe mailing list