[Haskell-cafe] Data.Binary poor read performance

Mon Feb 23 19:44:55 EST 2009

Neil Mitchell wrote:
> 2) The storage for String seems to be raw strings, which is nice.
> Would I get a substantial speedup by moving to bytestrings instead of
> strings? If I hashed the strings and stored common ones in a hash
> table is it likely to be a big win?

Bytestrings should help. The big wins in this application are likely to 
be cache issues, though the improved memory/GC overhead is nice too.

If you have many identical strings then you will save lots by memoizing 
your strings into Integers, and then serializing that memo table and the 
integerized version of your data structure. The amount of savings 
decreases as the number of duplications decrease, though since you don't 
need the memo table itself you should be able to serialize it in a way 
that doesn't have much overhead.

A refinement on memoization for when you have many partial matches but 
few total matches, is to chunk the strings into a series of partial 
matches, which are then integerized. The trick is deciding on how big to 
make your chunks (which is to say, optimizing the reuse ratio). If you 
want an industrial solution like this, it may be worth looking into 
Haskell bindings for SRILM or IRSTLM.

Depending on how rough your type signature was, you may also consider 
using bytestring-trie to replace the Map String portion. For the keys of 
your map this will give the bytestring optimization as well as prefix 
sharing. You could also replace the [String] with Trie Pos where Pos 
gives the position in the list, or Trie() if order is irrelevant. And 
you could replace [(String,Int)] with Trie (Map Pos Int) or similar 
depending on whether position is important (thus Trie (Set Int)) or 
whether you can avoid conflicting Ints for the same String (thus Trie 
Int). Without knowing more I'm not sure how well

     Trie [Either (String, Trie Pos) (Trie (Map Pos Int))]

will match your use case, but it's worth considering. The one thing of 
note is that Trie uses patricia trees like IntMap does rather than using 
balanced trees like Map does. Again, I'm not sure whether the 
differences will matter for your use case.

-- 
Live well,
~wren