seeking array advice

Wed Mar 2 05:50:24 EST 2005

Hi,

I'm about to rework an old program, and I am pondering what data
structure to use.  Basically, I want to store an index of fixed-length
words (q-grams) with positions, i.e. given a word, I can quickly find
its position in the data.

The data set may be large (gigabytes would be nice), and I also want
the option of storing a sparse set of the words in the data
(i.e. every other word, every third, etc).

I *think* I want to use a Map from words to positions, words packed to
fit in Int(32|64|eger) (depending on word size), and positions as
Int(32|64).  I've used a similar approach with FiniteMaps previously,
and it worked nicely, but for smaller data sets.

However:

Using a 'Map key [pos]' will eat a lot of space for the lists.  
'Map key (UArray Int pos)' will be more efficient, but every insertion
and deletion will destroy and create a new array -- and I would like
the Map to live outside a monad (does there even exist an (IO|ST)Map?)

I've considered lists of small(ish -- cache line compatible, probably)
arrays.  Another possibility is a suffix array, and a Map from words
to regions in it. How will a hashtable work in this setting?

Any thoughts?

-kzm

PS: it seems from a recent thread in comp.lang.functional (see
<slrnd2autq.3hb.tomasz.zielonka at localhost.localdomain>) that array
performance has improved in GHC6.4.  Impressively so.
-- 
If I haven't seen further, it is by standing in the footprints of giants