seeking array advice
ketil+haskell at ii.uib.no
Wed Mar 2 05:50:24 EST 2005
I'm about to rework an old program, and I am pondering what data
structure to use. Basically, I want to store an index of fixed-length
words (q-grams) with positions, i.e. given a word, I can quickly find
its position in the data.
The data set may be large (gigabytes would be nice), and I also want
the option of storing a sparse set of the words in the data
(i.e. every other word, every third, etc).
I *think* I want to use a Map from words to positions, words packed to
fit in Int(32|64|eger) (depending on word size), and positions as
Int(32|64). I've used a similar approach with FiniteMaps previously,
and it worked nicely, but for smaller data sets.
Using a 'Map key [pos]' will eat a lot of space for the lists.
'Map key (UArray Int pos)' will be more efficient, but every insertion
and deletion will destroy and create a new array -- and I would like
the Map to live outside a monad (does there even exist an (IO|ST)Map?)
I've considered lists of small(ish -- cache line compatible, probably)
arrays. Another possibility is a suffix array, and a Map from words
to regions in it. How will a hashtable work in this setting?
PS: it seems from a recent thread in comp.lang.functional (see
<slrnd2autq.3hb.tomasz.zielonka at localhost.localdomain>) that array
performance has improved in GHC6.4. Impressively so.
If I haven't seen further, it is by standing in the footprints of giants
More information about the Glasgow-haskell-users