[Haskell-cafe] memory-efficient data type for Netflix data - UArray Int Int vs UArray Int Word8

Manlio Perillo manlio_perillo at libero.it
Sun Mar 1 09:03:14 EST 2009


Kenneth Hoste ha scritto:
> Hello,
> 
> I'm having a go at the Netflix Prize using Haskell. Yes, I'm brave.
> 
> I kind of have an algorithm in mind that I want to implement using Haskell,
> but up until now, the main issue has been to find a way to efficiently 
> represent
> the data...
> 
> For people who are not familiar with the Netflix data, in short, it 
> consist of
> roughly 100M (1e8) user ratings (1-5, integer) for 17,770 different 
> movies, coming from
> 480,109 different users.
> 

Hi Kenneth.

I have written a simple program that parses the Netflix training data 
set, using this data structure:

type MovieRatings = IntMap (UArr Word32, UArr Word8)

The ratings are grouped by movies.

The parsing is done in:
real	8m32.476s
user	3m5.276s
sys	0m8.681s

On a DELL Inspiron 6400 notebook,
Intel Core2 T7200 @ 2.00GHz, and 2 GB memory.


However the memory used is about 1.4 GB.
How did you manage to get 700 MB memory usage?


Note that the minimum space required is about 480 MB (assuming 4 byte 
integer for the ID, and 1 byte integer for rating).
Using a 4 byte integer for both ID and rating, the space required is 
about 765 MB.

1.5 GB is the space required if one uses a total of 16 bytes to store 
both the ID and the rating.



Maybe it is the garbage collector that does not release memory to the 
operating system?



Thanks  Manlio Perillo


More information about the Haskell-Cafe mailing list