[Haskell-cafe] Error using Judy arrays (was: Re: [Biohaskell] hash map/associative structure)

Ketil Malde ketil at malde.org
Tue May 9 07:29:37 UTC 2017


On 2017-05-04 12:14 (+0200), MichaƂ J Gajda <mjgajda at gmail.com> wrote:

> Interesting. Do You have an error report filed anywhere to peruse?

Well, I wrote the below.  In the mean time, I've tried replacing
"insertWith" with a manual lookup and update.  This changed the cases
where the segfaults happen, but didn't eliminate them.

And I tried using the safe freeze function, to no avail.

And I implemented some of the functionality using a Vector, which gives
correct results (but without reimplementing a real hashtable, it can
only be used for the histogram collection, not k-mer counting, since
with large k that requires full 64-bit indexes into the vector)

I've also run on several different computers, so I don't think it's a
hardware error.

-k

----------------------------------------
  So: I have a program that indexes k-mers (fixed-size words) as Word64's,
  storing the different ones along with their counts in a Judy array, and
  at the end, writing to a file.  So an associative map mapping  
  k-word -> count. 

  Then, the second stage re-reads this file, and builds a histogram,
  tallying how often each count occurs.  let's call this count -> frequency

  For some time, this just seemed to work.  But then I implemented an
  expectation-maximization algorithm to model the frequency distributions,
  and that failed to terminate in some cases.  Going back, I found the
  following histogram to cause the trouble:

  ==> SRR901891_1.32.hist <==
  865     809
  866     775
  867     757
  868     781
  869     725
  870     778
  871     795
  872     774
  873     771
  48036128        2017612633061982208

  Notice that last value?  Apparently, there are 2.018e18 different
  k-mers with a frequency of 48036128.  This is clearly not right.

  Notice also that the latter number is 0x01c000...000.  Does this carry
  any special meaning?

  One possibility is of course some garbage in the k-mer count index, but
  I also have a dump command that prints the contents as text.  I can find
  nothing wrong with that, and the highest count k-mers look like this:

  % sort -nr -k2 *dump* | head         
  tgggggtccagccatggagaagagtttagaca        1670602
  gggggtccagccatggagaagagtttagacac        1653146
  gtctaaactcttctccatggctggacccccaa        1638158
  gaatattatttgggggtccagccatggagaag        1567591
  ggaatattatttgggggtccagccatggagaa        1539378
  actagtgcttaggaaatctattggaggcagaa        1532596
  gcttctgcctccaatagatttcctaagcacta        1531091
  aggaatattatttgggggtccagccatggaga        1528806
  ctagtgcttaggaaatctattggaggcagaag        1527682
  aaaggaatattatttgggggtccagccatgga        1518392

  One thing to look into is how the histogram is built.  I have wrapped
  this in a function (mk_judy, code below) that returns a struct of
  operations, called FreqCount.  mk_judy takes as an argument the number
  of bits to use in the index - but this isn't really used for anything
  (JudyL is JudyL).  Yet, when I changed this parameter from 16 to 32, the
  last line changed to:

  17180896        2017612633061982208

  And with 64:

  37271776        2017612633061982208

  The latter number is always the same, the former ends with 0x8e0, but
  appears to vary in the most significant bits.

  Everything else is the same in the histograms, I did a diff on the
  outputs.

  -- Code ----------------------------------------
  mk_judy :: Int -> IO FreqCount
  mk_judy l = do
    j <- J.new :: IO (J.JudyL Int)
    let ac k = k `seq` J.insertWith (+) k 1 j
	gc k = k `seq` fromMaybe 0 `fmap` J.lookup k j
	sc k v = J.insert k v j
	es = J.unsafeFreeze j >>= J.elems
	ks = J.unsafeFreeze j >>= J.keys
	as = J.unsafeFreeze j >>= J.toList
    return $ FreqCount { key_bits = l
		       , add_count = ac, get_count = gc, set_count = sc
		       , keys = ks, counts = es, assocs = as
		       }
  -- Code ----------------------------------------

  If we're sure there is nothing wrong with the Judy library, the only
  other thing I can think of is the unsafeFreeze.  I'm pretty sure I don't
  modify the Judy array after accessing it through the "frozen" functions
  - but could it e.g. be the case that j is frozen with some operations
  still unevaluated?  I tried adding "j `seq`" before unsafeFreeze, and
  got:

  33769696        2017612633061982208

  Again, 0x...8e0 and the usual 0x01c0...

  Anyway, that's where it stands at the moment.  I'm not sure what to do,
  it does worry me that this counting seems to fail on occasion.  Note
  also that the histogram gets truncated (always at the same place, after
  873), and it should really go up to 1.5 million.
----------------------------------------

-- 
If I haven't seen further, it is by standing in the footprints of giants


More information about the Haskell-Cafe mailing list