[Haskell-cafe] memory-efficient data type for Netflix data -
UArray Int Int vs UArray Int Word8
Manlio Perillo
manlio_perillo at libero.it
Sun Mar 1 09:03:14 EST 2009
Kenneth Hoste ha scritto:
> Hello,
>
> I'm having a go at the Netflix Prize using Haskell. Yes, I'm brave.
>
> I kind of have an algorithm in mind that I want to implement using Haskell,
> but up until now, the main issue has been to find a way to efficiently
> represent
> the data...
>
> For people who are not familiar with the Netflix data, in short, it
> consist of
> roughly 100M (1e8) user ratings (1-5, integer) for 17,770 different
> movies, coming from
> 480,109 different users.
>
Hi Kenneth.
I have written a simple program that parses the Netflix training data
set, using this data structure:
type MovieRatings = IntMap (UArr Word32, UArr Word8)
The ratings are grouped by movies.
The parsing is done in:
real 8m32.476s
user 3m5.276s
sys 0m8.681s
On a DELL Inspiron 6400 notebook,
Intel Core2 T7200 @ 2.00GHz, and 2 GB memory.
However the memory used is about 1.4 GB.
How did you manage to get 700 MB memory usage?
Note that the minimum space required is about 480 MB (assuming 4 byte
integer for the ID, and 1 byte integer for rating).
Using a 4 byte integer for both ID and rating, the space required is
about 765 MB.
1.5 GB is the space required if one uses a total of 16 bytes to store
both the ID and the rating.
Maybe it is the garbage collector that does not release memory to the
operating system?
Thanks Manlio Perillo
More information about the Haskell-Cafe
mailing list