[Haskell-cafe] [Biohaskell] Error using Judy arrays (was: Re: hash map/associative structure)
Olaf Klinke
olf at aatal-apotheke.de
Tue May 9 19:33:18 UTC 2017
Ketil,
Given that the dump looks fine, my gut feeling is that indeed the histogram function operates on partially evaluated data and somehow the memory is corrupted. Can you write a Deepseq instance for your Judy array? After all, you could dump to a [(String,Int)] instead of an IO handle and re-parse the data.
I guess that forcing the k-mer itself to be fully evaluated does no good (it's a Word64, anyways?). You'd need to make sure the insertWith (+) operation is fully evaluated.
Finally, can you rule out a normal overflow? I had sequencing runs where fastx_nucleotide_distribution_graph.sh simply overflowed the Int counter. Using Haskell's Integer produced correct results.
Yet the thing with the bit number influencing the outcome sounds very much like data corruption.
Olaf
> Am 09.05.2017 um 09:29 schrieb Ketil Malde <ketil at malde.org>:
>
>
> On 2017-05-04 12:14 (+0200), MichaĆ J Gajda <mjgajda at gmail.com> wrote:
>
>> Interesting. Do You have an error report filed anywhere to peruse?
>
> Well, I wrote the below. In the mean time, I've tried replacing
> "insertWith" with a manual lookup and update. This changed the cases
> where the segfaults happen, but didn't eliminate them.
>
> And I tried using the safe freeze function, to no avail.
>
> And I implemented some of the functionality using a Vector, which gives
> correct results (but without reimplementing a real hashtable, it can
> only be used for the histogram collection, not k-mer counting, since
> with large k that requires full 64-bit indexes into the vector)
>
> I've also run on several different computers, so I don't think it's a
> hardware error.
>
> -k
>
> ----------------------------------------
> So: I have a program that indexes k-mers (fixed-size words) as Word64's,
> storing the different ones along with their counts in a Judy array, and
> at the end, writing to a file. So an associative map mapping
> k-word -> count.
>
> Then, the second stage re-reads this file, and builds a histogram,
> tallying how often each count occurs. let's call this count -> frequency
>
> For some time, this just seemed to work. But then I implemented an
> expectation-maximization algorithm to model the frequency distributions,
> and that failed to terminate in some cases. Going back, I found the
> following histogram to cause the trouble:
>
> ==> SRR901891_1.32.hist <==
> 865 809
> 866 775
> 867 757
> 868 781
> 869 725
> 870 778
> 871 795
> 872 774
> 873 771
> 48036128 2017612633061982208
>
> Notice that last value? Apparently, there are 2.018e18 different
> k-mers with a frequency of 48036128. This is clearly not right.
>
> Notice also that the latter number is 0x01c000...000. Does this carry
> any special meaning?
>
> One possibility is of course some garbage in the k-mer count index, but
> I also have a dump command that prints the contents as text. I can find
> nothing wrong with that, and the highest count k-mers look like this:
>
> % sort -nr -k2 *dump* | head
> tgggggtccagccatggagaagagtttagaca 1670602
> gggggtccagccatggagaagagtttagacac 1653146
> gtctaaactcttctccatggctggacccccaa 1638158
> gaatattatttgggggtccagccatggagaag 1567591
> ggaatattatttgggggtccagccatggagaa 1539378
> actagtgcttaggaaatctattggaggcagaa 1532596
> gcttctgcctccaatagatttcctaagcacta 1531091
> aggaatattatttgggggtccagccatggaga 1528806
> ctagtgcttaggaaatctattggaggcagaag 1527682
> aaaggaatattatttgggggtccagccatgga 1518392
>
> One thing to look into is how the histogram is built. I have wrapped
> this in a function (mk_judy, code below) that returns a struct of
> operations, called FreqCount. mk_judy takes as an argument the number
> of bits to use in the index - but this isn't really used for anything
> (JudyL is JudyL). Yet, when I changed this parameter from 16 to 32, the
> last line changed to:
>
> 17180896 2017612633061982208
>
> And with 64:
>
> 37271776 2017612633061982208
>
> The latter number is always the same, the former ends with 0x8e0, but
> appears to vary in the most significant bits.
>
> Everything else is the same in the histograms, I did a diff on the
> outputs.
>
> -- Code ----------------------------------------
> mk_judy :: Int -> IO FreqCount
> mk_judy l = do
> j <- J.new :: IO (J.JudyL Int)
> let ac k = k `seq` J.insertWith (+) k 1 j
> gc k = k `seq` fromMaybe 0 `fmap` J.lookup k j
> sc k v = J.insert k v j
> es = J.unsafeFreeze j >>= J.elems
> ks = J.unsafeFreeze j >>= J.keys
> as = J.unsafeFreeze j >>= J.toList
> return $ FreqCount { key_bits = l
> , add_count = ac, get_count = gc, set_count = sc
> , keys = ks, counts = es, assocs = as
> }
> -- Code ----------------------------------------
>
> If we're sure there is nothing wrong with the Judy library, the only
> other thing I can think of is the unsafeFreeze. I'm pretty sure I don't
> modify the Judy array after accessing it through the "frozen" functions
> - but could it e.g. be the case that j is frozen with some operations
> still unevaluated? I tried adding "j `seq`" before unsafeFreeze, and
> got:
>
> 33769696 2017612633061982208
>
> Again, 0x...8e0 and the usual 0x01c0...
>
> Anyway, that's where it stands at the moment. I'm not sure what to do,
> it does worry me that this counting seems to fail on occasion. Note
> also that the histogram gets truncated (always at the same place, after
> 873), and it should really go up to 1.5 million.
> ----------------------------------------
>
> --
> If I haven't seen further, it is by standing in the footprints of giants
> _______________________________________________
> Biohaskell mailing list
> Biohaskell at biohaskell.org
> http://biohaskell.org/cgi-bin/mailman/listinfo/biohaskell
More information about the Haskell-Cafe
mailing list