[Haskell-cafe] [Biohaskell] Error using Judy arrays (was: Re: hash map/associative structure)

Olaf Klinke olf at aatal-apotheke.de
Tue May 9 19:33:18 UTC 2017


Ketil, 

Given that the dump looks fine, my gut feeling is that indeed the histogram function operates on partially evaluated data and somehow the memory is corrupted. Can you write a Deepseq instance for your Judy array? After all, you could dump to a [(String,Int)] instead of an IO handle and re-parse the data. 
I guess that forcing the k-mer itself to be fully evaluated does no good (it's a Word64, anyways?). You'd need to make sure the insertWith (+) operation is fully evaluated. 

Finally, can you rule out a normal overflow? I had sequencing runs where fastx_nucleotide_distribution_graph.sh simply overflowed the Int counter. Using Haskell's Integer produced correct results. 
Yet the thing with the bit number influencing the outcome sounds very much like data corruption. 

Olaf

> Am 09.05.2017 um 09:29 schrieb Ketil Malde <ketil at malde.org>:
> 
> 
> On 2017-05-04 12:14 (+0200), MichaƂ J Gajda <mjgajda at gmail.com> wrote:
> 
>> Interesting. Do You have an error report filed anywhere to peruse?
> 
> Well, I wrote the below.  In the mean time, I've tried replacing
> "insertWith" with a manual lookup and update.  This changed the cases
> where the segfaults happen, but didn't eliminate them.
> 
> And I tried using the safe freeze function, to no avail.
> 
> And I implemented some of the functionality using a Vector, which gives
> correct results (but without reimplementing a real hashtable, it can
> only be used for the histogram collection, not k-mer counting, since
> with large k that requires full 64-bit indexes into the vector)
> 
> I've also run on several different computers, so I don't think it's a
> hardware error.
> 
> -k
> 
> ----------------------------------------
>  So: I have a program that indexes k-mers (fixed-size words) as Word64's,
>  storing the different ones along with their counts in a Judy array, and
>  at the end, writing to a file.  So an associative map mapping  
>  k-word -> count. 
> 
>  Then, the second stage re-reads this file, and builds a histogram,
>  tallying how often each count occurs.  let's call this count -> frequency
> 
>  For some time, this just seemed to work.  But then I implemented an
>  expectation-maximization algorithm to model the frequency distributions,
>  and that failed to terminate in some cases.  Going back, I found the
>  following histogram to cause the trouble:
> 
>  ==> SRR901891_1.32.hist <==
>  865     809
>  866     775
>  867     757
>  868     781
>  869     725
>  870     778
>  871     795
>  872     774
>  873     771
>  48036128        2017612633061982208
> 
>  Notice that last value?  Apparently, there are 2.018e18 different
>  k-mers with a frequency of 48036128.  This is clearly not right.
> 
>  Notice also that the latter number is 0x01c000...000.  Does this carry
>  any special meaning?
> 
>  One possibility is of course some garbage in the k-mer count index, but
>  I also have a dump command that prints the contents as text.  I can find
>  nothing wrong with that, and the highest count k-mers look like this:
> 
>  % sort -nr -k2 *dump* | head         
>  tgggggtccagccatggagaagagtttagaca        1670602
>  gggggtccagccatggagaagagtttagacac        1653146
>  gtctaaactcttctccatggctggacccccaa        1638158
>  gaatattatttgggggtccagccatggagaag        1567591
>  ggaatattatttgggggtccagccatggagaa        1539378
>  actagtgcttaggaaatctattggaggcagaa        1532596
>  gcttctgcctccaatagatttcctaagcacta        1531091
>  aggaatattatttgggggtccagccatggaga        1528806
>  ctagtgcttaggaaatctattggaggcagaag        1527682
>  aaaggaatattatttgggggtccagccatgga        1518392
> 
>  One thing to look into is how the histogram is built.  I have wrapped
>  this in a function (mk_judy, code below) that returns a struct of
>  operations, called FreqCount.  mk_judy takes as an argument the number
>  of bits to use in the index - but this isn't really used for anything
>  (JudyL is JudyL).  Yet, when I changed this parameter from 16 to 32, the
>  last line changed to:
> 
>  17180896        2017612633061982208
> 
>  And with 64:
> 
>  37271776        2017612633061982208
> 
>  The latter number is always the same, the former ends with 0x8e0, but
>  appears to vary in the most significant bits.
> 
>  Everything else is the same in the histograms, I did a diff on the
>  outputs.
> 
>  -- Code ----------------------------------------
>  mk_judy :: Int -> IO FreqCount
>  mk_judy l = do
>    j <- J.new :: IO (J.JudyL Int)
>    let ac k = k `seq` J.insertWith (+) k 1 j
> 	gc k = k `seq` fromMaybe 0 `fmap` J.lookup k j
> 	sc k v = J.insert k v j
> 	es = J.unsafeFreeze j >>= J.elems
> 	ks = J.unsafeFreeze j >>= J.keys
> 	as = J.unsafeFreeze j >>= J.toList
>    return $ FreqCount { key_bits = l
> 		       , add_count = ac, get_count = gc, set_count = sc
> 		       , keys = ks, counts = es, assocs = as
> 		       }
>  -- Code ----------------------------------------
> 
>  If we're sure there is nothing wrong with the Judy library, the only
>  other thing I can think of is the unsafeFreeze.  I'm pretty sure I don't
>  modify the Judy array after accessing it through the "frozen" functions
>  - but could it e.g. be the case that j is frozen with some operations
>  still unevaluated?  I tried adding "j `seq`" before unsafeFreeze, and
>  got:
> 
>  33769696        2017612633061982208
> 
>  Again, 0x...8e0 and the usual 0x01c0...
> 
>  Anyway, that's where it stands at the moment.  I'm not sure what to do,
>  it does worry me that this counting seems to fail on occasion.  Note
>  also that the histogram gets truncated (always at the same place, after
>  873), and it should really go up to 1.5 million.
> ----------------------------------------
> 
> -- 
> If I haven't seen further, it is by standing in the footprints of giants
> _______________________________________________
> Biohaskell mailing list
> Biohaskell at biohaskell.org
> http://biohaskell.org/cgi-bin/mailman/listinfo/biohaskell



More information about the Haskell-Cafe mailing list