[Haskell-cafe] Error using Judy arrays (was: Re: [Biohaskell] hash map/associative structure)
Ketil Malde
ketil at malde.org
Tue May 9 07:29:37 UTC 2017
On 2017-05-04 12:14 (+0200), MichaĆ J Gajda <mjgajda at gmail.com> wrote:
> Interesting. Do You have an error report filed anywhere to peruse?
Well, I wrote the below. In the mean time, I've tried replacing
"insertWith" with a manual lookup and update. This changed the cases
where the segfaults happen, but didn't eliminate them.
And I tried using the safe freeze function, to no avail.
And I implemented some of the functionality using a Vector, which gives
correct results (but without reimplementing a real hashtable, it can
only be used for the histogram collection, not k-mer counting, since
with large k that requires full 64-bit indexes into the vector)
I've also run on several different computers, so I don't think it's a
hardware error.
-k
----------------------------------------
So: I have a program that indexes k-mers (fixed-size words) as Word64's,
storing the different ones along with their counts in a Judy array, and
at the end, writing to a file. So an associative map mapping
k-word -> count.
Then, the second stage re-reads this file, and builds a histogram,
tallying how often each count occurs. let's call this count -> frequency
For some time, this just seemed to work. But then I implemented an
expectation-maximization algorithm to model the frequency distributions,
and that failed to terminate in some cases. Going back, I found the
following histogram to cause the trouble:
==> SRR901891_1.32.hist <==
865 809
866 775
867 757
868 781
869 725
870 778
871 795
872 774
873 771
48036128 2017612633061982208
Notice that last value? Apparently, there are 2.018e18 different
k-mers with a frequency of 48036128. This is clearly not right.
Notice also that the latter number is 0x01c000...000. Does this carry
any special meaning?
One possibility is of course some garbage in the k-mer count index, but
I also have a dump command that prints the contents as text. I can find
nothing wrong with that, and the highest count k-mers look like this:
% sort -nr -k2 *dump* | head
tgggggtccagccatggagaagagtttagaca 1670602
gggggtccagccatggagaagagtttagacac 1653146
gtctaaactcttctccatggctggacccccaa 1638158
gaatattatttgggggtccagccatggagaag 1567591
ggaatattatttgggggtccagccatggagaa 1539378
actagtgcttaggaaatctattggaggcagaa 1532596
gcttctgcctccaatagatttcctaagcacta 1531091
aggaatattatttgggggtccagccatggaga 1528806
ctagtgcttaggaaatctattggaggcagaag 1527682
aaaggaatattatttgggggtccagccatgga 1518392
One thing to look into is how the histogram is built. I have wrapped
this in a function (mk_judy, code below) that returns a struct of
operations, called FreqCount. mk_judy takes as an argument the number
of bits to use in the index - but this isn't really used for anything
(JudyL is JudyL). Yet, when I changed this parameter from 16 to 32, the
last line changed to:
17180896 2017612633061982208
And with 64:
37271776 2017612633061982208
The latter number is always the same, the former ends with 0x8e0, but
appears to vary in the most significant bits.
Everything else is the same in the histograms, I did a diff on the
outputs.
-- Code ----------------------------------------
mk_judy :: Int -> IO FreqCount
mk_judy l = do
j <- J.new :: IO (J.JudyL Int)
let ac k = k `seq` J.insertWith (+) k 1 j
gc k = k `seq` fromMaybe 0 `fmap` J.lookup k j
sc k v = J.insert k v j
es = J.unsafeFreeze j >>= J.elems
ks = J.unsafeFreeze j >>= J.keys
as = J.unsafeFreeze j >>= J.toList
return $ FreqCount { key_bits = l
, add_count = ac, get_count = gc, set_count = sc
, keys = ks, counts = es, assocs = as
}
-- Code ----------------------------------------
If we're sure there is nothing wrong with the Judy library, the only
other thing I can think of is the unsafeFreeze. I'm pretty sure I don't
modify the Judy array after accessing it through the "frozen" functions
- but could it e.g. be the case that j is frozen with some operations
still unevaluated? I tried adding "j `seq`" before unsafeFreeze, and
got:
33769696 2017612633061982208
Again, 0x...8e0 and the usual 0x01c0...
Anyway, that's where it stands at the moment. I'm not sure what to do,
it does worry me that this counting seems to fail on occasion. Note
also that the histogram gets truncated (always at the same place, after
873), and it should really go up to 1.5 million.
----------------------------------------
--
If I haven't seen further, it is by standing in the footprints of giants
More information about the Haskell-Cafe
mailing list