[Haskell-cafe] Mining Twitter data in Haskell and Clojure
Don Stewart
dons at galois.com
Mon Jun 14 01:44:13 EDT 2010
deliverable:
> I'm computing a communication graph from Twitter data and then scan it
> daily to allocate social capital to nodes behaving in a good karmic
> manner. The graph is culled from 100 million tweets and has about 3
> million nodes. First I wrote the simulation of the 35 days of data in
> Clojure and then translated it into Haskell with great help from the
> glorious #haskell folks. I had to add -A5G -K5G to make it work. It
> does 10 days OK hovering at 57 GB of RAM; I include profiling of that
> in sc10days.prof.
>
> At first the Haskell executable goes faster than Clojure, not by an
> order of magnitude, but by 2-3 times per day simulated. (Clojure also
> fits well in its 32 GB JVM with compressed references.) However,
> Haskell gets stuck after a while, and for good. Clearly I'm not doing
> Haskell optimally here, and would appreciate optimization advice.
> Here's the code:
>
> http://github.com/alexy/husky
>
> The data and problem description is in
>
> http://github.com/alexy/husky/blob/master/Haskell-vs-Clojure-Twitter.md
>
> -- also referred from the main README.md.
>
> The main is in sc.hs, and the algorithm is in SocRun.hs. The original
> Clojure is in socrun.clj. This is a continuation of active Twitter
> research and the results will be published, and I'd really like to
> make Haskell work at this scale and beyond! The seq's sprinkled
> already did no good. I ran under ghc 6.10 with -O2 with or without -
> fvia-C, with no difference in stallling, and am working to bring 6.12
> to bear now.
Hey. Very cool!
When you run it with +RTS -s what amount of time is being spent in
garbage collection?
What are you main data types?
When you compile with -prof -auto-all and do some heap profiling, what
do you see?
There's an introduction to profiling with GHC's heap and time tools here:
http://book.realworldhaskell.org/read/profiling-and-optimization.html#id677729
Either way:
* step one: do time profiling
* step two: do space/heap profiling
* look at the main data types being allocated and improve their
representation.
* look at the main functions using time, and improve their
complexity.
* iterate until happy.
-- Don
More information about the Haskell-Cafe
mailing list