[Haskell-cafe] Re: Mining Twitter data in Haskell and Clojure

Daniel Fischer daniel.is.fischer at web.de
Tue Jun 15 18:08:06 EDT 2010

On Tuesday 15 June 2010 23:26:10, Don Stewart wrote:
> deliverable:
> > Wren -- thanks for the clarification!  Someone said that Foldable on
> > Trie may not be very efficient -- is that true?
> >
> > I use ByteString as a node type for the graph; these are Twitter user
> > names.  Surely it's useful to replace them with Int, which I'll try,
> > but Clojure works with Java String fine and it simplifies all kinds of
> > exploratory data mining and debugging to keep it as a String, so I'll
> > try to get the most mileage from other things before interning.
> bytestring seems appropriate.
> > What's the exact relationship between Trie and Map and their
> > respective performance?
> Tries specialized to bytestring keys should outperform the generic Map.

That would be desirable.
I've done some profiling with the sample data, and found that - if we 
subtract the times for loading and saving the graphs - some 35-40% of the 
time is spent looking up ByteStrings in Maps. That's far too much for my 
liking. I'm not sure whether the lookup for e.g. an Int key would be much 
faster, but I suspect it would be.

I've also fiddled a bit with the strictness and removed a bit of 
unnecessary work, reduced the heap usage by ~20%, MUT times by ~15% and GC 
times by ~50% (all for the tests on my box with a measly 1GB RAM).
It's still a far cry from a racehorse, but at least I can now run the 
sample data for the entire 35 days without having my box thrashing madly :)

The result of my endeavours is attached.


-------------- next part --------------
A non-text attachment was scrubbed...
Name: SocRun.hs
Type: text/x-haskell
Size: 9200 bytes
Desc: not available
Url : http://www.haskell.org/pipermail/haskell-cafe/attachments/20100615/8359546c/SocRun.bin

More information about the Haskell-Cafe mailing list