[Haskell-cafe] Data.Binary poor read performance
wren ng thornton
wren at freegeek.org
Mon Feb 23 19:44:55 EST 2009
Neil Mitchell wrote:
> 2) The storage for String seems to be raw strings, which is nice.
> Would I get a substantial speedup by moving to bytestrings instead of
> strings? If I hashed the strings and stored common ones in a hash
> table is it likely to be a big win?
Bytestrings should help. The big wins in this application are likely to
be cache issues, though the improved memory/GC overhead is nice too.
If you have many identical strings then you will save lots by memoizing
your strings into Integers, and then serializing that memo table and the
integerized version of your data structure. The amount of savings
decreases as the number of duplications decrease, though since you don't
need the memo table itself you should be able to serialize it in a way
that doesn't have much overhead.
A refinement on memoization for when you have many partial matches but
few total matches, is to chunk the strings into a series of partial
matches, which are then integerized. The trick is deciding on how big to
make your chunks (which is to say, optimizing the reuse ratio). If you
want an industrial solution like this, it may be worth looking into
Haskell bindings for SRILM or IRSTLM.
Depending on how rough your type signature was, you may also consider
using bytestring-trie to replace the Map String portion. For the keys of
your map this will give the bytestring optimization as well as prefix
sharing. You could also replace the [String] with Trie Pos where Pos
gives the position in the list, or Trie() if order is irrelevant. And
you could replace [(String,Int)] with Trie (Map Pos Int) or similar
depending on whether position is important (thus Trie (Set Int)) or
whether you can avoid conflicting Ints for the same String (thus Trie
Int). Without knowing more I'm not sure how well
Trie [Either (String, Trie Pos) (Trie (Map Pos Int))]
will match your use case, but it's worth considering. The one thing of
note is that Trie uses patricia trees like IntMap does rather than using
balanced trees like Map does. Again, I'm not sure whether the
differences will matter for your use case.
More information about the Haskell-Cafe