[Haskell-cafe] memory-efficient data type for Netflix data - UArray Int Int vs UArray Int Word8

Sun Mar 1 09:03:14 EST 2009

Kenneth Hoste ha scritto:
> Hello,
> 
> I'm having a go at the Netflix Prize using Haskell. Yes, I'm brave.
> 
> I kind of have an algorithm in mind that I want to implement using Haskell,
> but up until now, the main issue has been to find a way to efficiently 
> represent
> the data...
> 
> For people who are not familiar with the Netflix data, in short, it 
> consist of
> roughly 100M (1e8) user ratings (1-5, integer) for 17,770 different 
> movies, coming from
> 480,109 different users.
> 

Hi Kenneth.

I have written a simple program that parses the Netflix training data 
set, using this data structure:

type MovieRatings = IntMap (UArr Word32, UArr Word8)

The ratings are grouped by movies.

The parsing is done in:
real	8m32.476s
user	3m5.276s
sys	0m8.681s

On a DELL Inspiron 6400 notebook,
Intel Core2 T7200 @ 2.00GHz, and 2 GB memory.

However the memory used is about 1.4 GB.
How did you manage to get 700 MB memory usage?

Note that the minimum space required is about 480 MB (assuming 4 byte 
integer for the ID, and 1 byte integer for rating).
Using a 4 byte integer for both ID and rating, the space required is 
about 765 MB.

1.5 GB is the space required if one uses a total of 16 bytes to store 
both the ID and the rating.

Maybe it is the garbage collector that does not release memory to the 
operating system?

Thanks  Manlio Perillo