[Haskell-cafe] Re: Mining Twitter data in Haskell and Clojure

Fri Jun 18 10:03:21 EDT 2010

braver wrote:
> Wren -- thanks for the clarification!  Someone said that Foldable on
> Trie may not be very efficient -- is that true?

That was probably me saying that I had worked on some more efficient 
implementations than those currently in use. Alas, the more efficient 
ones seem to alter the order of enumerating the values, and I haven't 
had a chance to fix that (so they're not being used currently).

As for general efficiency, I haven't run any benchmarks for folds vs 
other datastructures so I can only speculate. If you're using foldl or 
foldr instead of foldrWithKey[1] then the performance should be 
comparable to Data.Map. For both data structures, all they do is 
traverse the spine and enumerate the leaves, so the indexing data on the 
spine nodes is irrelevant.

> What's the exact relationship between Trie and Map and their
> respective performance?

Data.Trie is a big-endian patricia tree with node fusion, which is the 
same datastructure as Data.IntMap. The only difference is that IntMap 
can only take fixed-length keys (namely Ints) whereas Trie can take 
variable length keys (i.e., any string of bits). On the other hand, 
Data.Map is a form of balanced tree, which is rather different. The API 
documentation gives links to the relevant papers.

The API documentation also gives their asymptotic performance. I don't 
have all the numbers for Trie yet because my sources didn't give them 
and I haven't had the chance to prove them myself, but they will be 
similar to those for IntMap. In general, because it is specialized for 
the key type, Trie will be more efficient than (Map Bytestring) just as 
IntMap is more efficient than (Map Int).

The exact performance characteristics can vary depending on the specific 
program, however. Because patricia trees and balanced trees are such 
different structures, they excel for different usage patterns. For 
example, if you wanted to get (or enumerate) the submap containing only 
the keys beginning with "foo", tries will be much more efficient than 
the alternatives. However, if you need to use functions like 
foldrWithKey often, then tries will be at a disadvantage since they must 
reconstruct the keys.[1] As I recall, tries are also at an advantage if 
you often merge/meld maps; though I don't recall the algorithm Data.Map 
uses to be sure of that.

Similarly, since Map, IntMap, and Trie are persistent data structures, 
if you're not using that persistence then you may get better performance 
out of an ephemeral data structure like hash tables. Though you should 
first check to make sure that IntMap + interning is insufficient, since 
a big part of hash table's performance characteristics comes from the 
interning implicit in the hash function.

Choosing the right tool depends on how you want to use it.

[1] If you do need to use foldrWithKey often, an easy way to improve 
performance is to use a Trie (ByteString,Foo) and store the complete key 
alongside the value. This way you can use the faster foldl/foldr and 
still have access to the key. This will increase memory consumption, of 
course, though Trie does try to maximize sharing of key segments, so it 
won't cost as much as initially expected.

-- 
Live well,
~wren