[Haskell-cafe] [Biohaskell] hash map/associative structure

Ketil Malde ketil at malde.org
Thu May 4 10:05:08 UTC 2017


> I know it may be a long shot, but did you consider using columnar data store like Apache Arrow?

Arrow might be an option, but is there a Haskell interface?  (Googling
gives the obvious hits regarding arrows, and Google doesn't seem to care
about me adding +apache to the search, it gives me result where
"+apache" is overstruck.)

> Without knowing more about your application it is a bit difficult to produce more hints.
> What is your application?

The short story is that I extract a number of 64-bit values from my
data, and want to maintain frequency counts for each unique value.  So
there'll be on the order of 10^9 (plus/minus an order of magnitude)
unique values, with counts ranging from one to a few million (and large
values being rare).

The long explanation is that I'm doing k-mer counts for molecular sequences,
breaking DNA sequence data into overlapping words of fixed size (the
parameter k), and counting their occurrences.  I encode them as Word64,
using two bits per nucleotide (the alphabet is A, C, G, and T).  This is
of course a fairly staple thing to do, and there is no lack of
alternative programs that do it - but I'd like mine to work anyway, and
it annoys me to have run into this particular bug.  Whether it is my own
fault, in the Judy FFI, the GHC runtime or libraries, the libjudy code,
GHC compilation issues, or a hardware error.

-k
-- 
If I haven't seen further, it is by standing in the footprints of giants


More information about the Haskell-Cafe mailing list