[Haskell-cafe] Mining Twitter data in Haskell and Clojure

Mon Jun 14 01:44:13 EDT 2010

deliverable:
> I'm computing a communication graph from Twitter data and then scan it
> daily to allocate social capital to nodes behaving  in a good karmic
> manner.  The graph is culled from 100 million tweets and has about 3
> million nodes. First I wrote the simulation of the 35 days of data in
> Clojure and then translated it into Haskell with great help from the
> glorious #haskell folks.  I had to add -A5G -K5G to make it work.  It
> does 10 days OK hovering at 57 GB of RAM; I include profiling of that
> in sc10days.prof.
> 
> At first the Haskell executable goes faster than Clojure, not by an
> order of magnitude, but by 2-3 times per day simulated.  (Clojure also
> fits well in its 32 GB JVM with compressed references.)  However,
> Haskell gets stuck after a while, and for good.  Clearly I'm not doing
> Haskell optimally here, and would appreciate optimization advice.
> Here's the code:
> 
> http://github.com/alexy/husky
> 
> The data and problem description is in
> 
> http://github.com/alexy/husky/blob/master/Haskell-vs-Clojure-Twitter.md
> 
> -- also referred from the main README.md.
> 
> The main is in sc.hs, and the algorithm is in SocRun.hs.  The original
> Clojure is in socrun.clj.  This is a continuation of active Twitter
> research and the results will be published, and I'd really like to
> make Haskell work at this scale and beyond!  The seq's sprinkled
> already did no good.  I ran under ghc 6.10 with -O2 with or without -
> fvia-C, with no difference in stallling, and am working to bring 6.12
> to bear now.

Hey. Very cool!

When you run it with +RTS -s what amount of time is being spent in
garbage collection?

What are you main data types?

When you compile with -prof -auto-all and do some heap profiling, what
do you see?

There's an introduction to profiling with GHC's heap and time tools here:

    http://book.realworldhaskell.org/read/profiling-and-optimization.html#id677729

Either way:

    * step one: do time profiling
    * step two: do space/heap profiling
    * look at the main data types being allocated and improve their
          representation.
    * look at  the main functions using time, and improve their
          complexity.
    * iterate until happy.

-- Don