[Haskell-cafe] Re: Mining Twitter data in Haskell and Clojure
wren ng thornton
wren at freegeek.org
Fri Jun 18 10:03:21 EDT 2010
> Wren -- thanks for the clarification! Someone said that Foldable on
> Trie may not be very efficient -- is that true?
That was probably me saying that I had worked on some more efficient
implementations than those currently in use. Alas, the more efficient
ones seem to alter the order of enumerating the values, and I haven't
had a chance to fix that (so they're not being used currently).
As for general efficiency, I haven't run any benchmarks for folds vs
other datastructures so I can only speculate. If you're using foldl or
foldr instead of foldrWithKey then the performance should be
comparable to Data.Map. For both data structures, all they do is
traverse the spine and enumerate the leaves, so the indexing data on the
spine nodes is irrelevant.
> What's the exact relationship between Trie and Map and their
> respective performance?
Data.Trie is a big-endian patricia tree with node fusion, which is the
same datastructure as Data.IntMap. The only difference is that IntMap
can only take fixed-length keys (namely Ints) whereas Trie can take
variable length keys (i.e., any string of bits). On the other hand,
Data.Map is a form of balanced tree, which is rather different. The API
documentation gives links to the relevant papers.
The API documentation also gives their asymptotic performance. I don't
have all the numbers for Trie yet because my sources didn't give them
and I haven't had the chance to prove them myself, but they will be
similar to those for IntMap. In general, because it is specialized for
the key type, Trie will be more efficient than (Map Bytestring) just as
IntMap is more efficient than (Map Int).
The exact performance characteristics can vary depending on the specific
program, however. Because patricia trees and balanced trees are such
different structures, they excel for different usage patterns. For
example, if you wanted to get (or enumerate) the submap containing only
the keys beginning with "foo", tries will be much more efficient than
the alternatives. However, if you need to use functions like
foldrWithKey often, then tries will be at a disadvantage since they must
reconstruct the keys. As I recall, tries are also at an advantage if
you often merge/meld maps; though I don't recall the algorithm Data.Map
uses to be sure of that.
Similarly, since Map, IntMap, and Trie are persistent data structures,
if you're not using that persistence then you may get better performance
out of an ephemeral data structure like hash tables. Though you should
first check to make sure that IntMap + interning is insufficient, since
a big part of hash table's performance characteristics comes from the
interning implicit in the hash function.
Choosing the right tool depends on how you want to use it.
 If you do need to use foldrWithKey often, an easy way to improve
performance is to use a Trie (ByteString,Foo) and store the complete key
alongside the value. This way you can use the faster foldl/foldr and
still have access to the key. This will increase memory consumption, of
course, though Trie does try to maximize sharing of key segments, so it
won't cost as much as initially expected.
More information about the Haskell-Cafe