[Haskell-cafe] Big Arrays
saihemanth at gmail.com
Wed Oct 6 05:31:38 EDT 2010
Thanks for the response.
That sounds sequence comparison seems very impressive
On Wed, Oct 6, 2010 at 2:23 PM, Ketil Malde <ketil at malde.org> wrote:
> Hemanth Kapila <saihemanth at gmail.com> writes:
> > Let us say, we are using a bit-array of size 2^43 (that is, a byte array
> > size 2^40) to store a bloom filter. And let us further assume that we are
> > interested in a false-positive probability of 0.01
> Since we are just making up numbers, let us instead say we are using a
> bit array of size 2^32 - still too large for Int indexing (which was the
> issue here) but "only" 500MB in size.
> > That means, I will be able to use this array to represent a set of
> > cardinality 9.18e11 ~ 10^12
> ...bringing it down to less than 10^9, easily reached for building an
> indexing of k-words (tuples) in e.g. the human genome (3GB).
> But: I'm about to start analyzing Illumina sequencing data, where we have
> sequences from two individuals of the same species. I'm interesting in
> the differences between these species, so I might want to index both
> data sets and screen the sequences of each against the other to identify
> bits that don't appear in the other. Since I'll have easily tens, maybe
> a hundred gigabytes of sequence from each, it's not impossible to get
> into the territory you describe.
> (In practice, things will be limited by available RAM, sadly still some
> orders of magnitude less than 2^40 - although apparently SGI can deliver
> it. I guess I just need a bigger budget.)
> > I was curious to know what sort of programs would be dealing with sets of
> > 10^12 elements.
> Most likely a program using mutation, and probably not copying,
> multi-generation GC.
> Other data sets that have considerable size are acoustics data (I
> understand a modern sonar deliver about a Gbit/sec continously), and
> seismic data.
> Also, for more moderate bodies of data and more complex analysis, you
> might want to use less frugal data structure, like suffix arrays.
> If I haven't seen further, it is by standing in the footprints of giants
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Haskell-Cafe