[Haskell-cafe] Big Arrays

Hemanth Kapila saihemanth at gmail.com
Wed Oct 6 05:31:38 EDT 2010


Thanks for the response.
 That sounds sequence comparison seems very impressive

On Wed, Oct 6, 2010 at 2:23 PM, Ketil Malde <ketil at malde.org> wrote:

> Hemanth Kapila <saihemanth at gmail.com> writes:
>
> > Let us say, we are using a bit-array of size 2^43 (that is, a byte array
> of
> > size 2^40) to store a bloom filter. And let us further assume that we are
> > interested in a false-positive probability of 0.01
>
> Since we are just making up numbers, let us instead say we are using a
> bit array of size 2^32 - still too large for Int indexing (which was the
> issue here) but "only" 500MB in size.
>
> > That means, I will be able to use this array to represent  a set of
> > cardinality 9.18e11 ~ 10^12
>
> ...bringing it down to less than 10^9, easily reached for building an
> indexing of k-words (tuples) in e.g. the human genome (3GB).
>
> But: I'm about to start analyzing Illumina sequencing data, where we have
> sequences from two individuals of the same species.  I'm interesting in
> the differences between these species, so I might want to index both
> data sets and screen the sequences of each against the other to identify
> bits that don't appear in the other.  Since I'll have easily tens, maybe
> a hundred gigabytes of sequence from each, it's not impossible to get
> into the territory you describe.
>
> (In practice, things will be limited by available RAM, sadly still some
> orders of magnitude less than 2^40 - although apparently SGI can deliver
> it. I guess I just need a bigger budget.)
>
> > I was curious to know what sort of programs would be dealing with sets of
> > 10^12 elements.
>
> Most likely a program using mutation, and probably not copying,
> multi-generation GC.
>
> Other data sets that have considerable size are acoustics data (I
> understand a modern sonar deliver about a Gbit/sec continously), and
> seismic data.
>
> Also, for more moderate bodies of data and more complex analysis, you
> might want to use less frugal data structure, like suffix arrays.
>
> -k
> --
> If I haven't seen further, it is by standing in the footprints of giants
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.haskell.org/pipermail/haskell-cafe/attachments/20101006/796f8147/attachment.html


More information about the Haskell-Cafe mailing list