[Haskell-cafe] uvector package appendU: memory leak?

Tue Mar 31 08:04:11 EDT 2009

Claus Reinke ha scritto:
>>> appendU is strict, insertWith just doesn't force it (follow the 
>>> source link
>>> in the haddocks to see why).
>>
>> Ok, I see.
>> But, IMHO, this should be clearly documented.
> 
> There seems to be some agreement that strict variant operations should
> also be provided, but it needs some more thinking, and whenever I look
> at the code, I have the feeling that there are just too many operations
> defined in it, so I look away again:-(
> 

Well, it does not matter if strict variant operation is provided or not.
But at least it should be documented whether an implemented operation is 
strict or lazy.

>> However, reading the code now, I prefer my version using alter.
> 
> Yes, it is a more direct way to enforce strict evaluation. Btw, if your
> update function 'f' is strict in 'old' and 'new', you might just want to 
> use 'Just $! f old new' and save all those 'seq's.
> 

Righ, thanks.
And this seems to be a bit more memory friendly, too.

>> By the way, about insertWith/alter; from IntMap documentation:
>>
>> insertWithKey: O(min(n,W)
>> alter: O(log n)
>>
>> So, alter is more efficient than insertWithKey?
>> And what is that `W` ?
> 
> 'W' is a maximum bound (see comment at top of that haddock page).

Where?
And the top of that page there is only a link to a paper where the 
algorithm is defined.

> 'alter' has some shortcuts if the update functions doesn't actually do
> anything, but for reimplementing 'insertWithKey', the complexity should 
> be the same as the code would be the same.
>

Ok.

>>> But piling up appendUs is still not a good idea. For a moment,
>>> I thought that the stream representation's (+++) was handling
>>> runtime fusion gracefully, but a simple test case suggests otherwise
>>> at least for the simpler case of consU (the attached appendU.hs
>>> doesn't do any appendUs, as the consU case already demonstrates
>>> the issue; be careful with large numbers here, it'll quickly eat your 
>>> ram):
>>
>> I'm not sure to fully understand the code.
> 
> [ Ok ]
 >
>> But, again, IMHO it does not apply to my original problem.
> 
> It is a question of resource constraints, probably. Instead of maintaining
> an 'IntMap (UArr Int)' all the way through buildup, you could use an
> 'IntMap [Int]' while parsing, and then compress into 'IntMap (UArr Int)'
> when parsing is finished. 

No, as I have written, this is not possible for my problem.
Because this would eat all available memory after few minutes.

The data I'm parsing from Netflix Prize data set contains:
17770 movies
480189 customers (with customer ids ranging from 1 to 2649429)
100480507 total ratings

ratings are grouped for each movie, but I want to group them by costumers.
This means that each customer has few ratings (200, on average), so 
there are a lot of small arrays.

Since ratings for each customers are parsed "at the same time", using a 
plain list would consume a lot of memory, since stream fusion can only 
be executed at the end of the parsing.

On the other hand, when I want to group ratings by movies, stream fusion 
seems to work fine.

This is IMHO, of course.

> If you can afford the larger memory profile
> during parsing, this should give better overall performance, with the
> same memory need for the final state, but larger memory needs up
> to that state. I'd be interested in the differences, if you try it.
> 

Unfortunately, I can not afford this memory profile!
My PC has only 2 GB of RAM.

And I doubt any other could afford the required memory.
Using UArr, the total required memory is about 900 MB.

Thanks  Manlio