[Haskell-cafe] memory-efficient data type for Netflix data - UArray
Int Int vs UArray Int Word8
Kenneth Hoste
kenneth.hoste at ugent.be
Mon Feb 23 13:39:34 EST 2009
Hello,
I'm having a go at the Netflix Prize using Haskell. Yes, I'm brave.
I kind of have an algorithm in mind that I want to implement using
Haskell,
but up until now, the main issue has been to find a way to efficiently
represent
the data...
For people who are not familiar with the Netflix data, in short, it
consist of
roughly 100M (1e8) user ratings (1-5, integer) for 17,770 different
movies, coming from
480,109 different users.
The idea I have in mind requires fast access to all the ratings for a
particular user,
so I was thinking about an IntMap in terms of user ids, which maps to
movie id/rating pairs somehow.
To see if I could efficiently represent the data set in this way, I
wrote a small
Haskell program (attached) which uses the following data type:
-------------- next part --------------
A non-text attachment was scrubbed...
Name: testMemSizeUArray_UArray_Word8.hs
Type: application/octet-stream
Size: 1389 bytes
Desc: not available
Url : http://www.haskell.org/pipermail/haskell-cafe/attachments/20090223/0cd19100/testMemSizeUArray_UArray_Word8.obj
-------------- next part --------------
type data = IntMap (UArray Int Word8) -- map of user IDs to ratings
(lacks movie IDs)
For convenience, and because I've been discussing this with various
people in #haskell @ IRC,
the code is also available here: http://hpaste.org/fastcgi/hpaste.fcgi/view?id=1671#a1671
.
I'm aware that the way in which the UArray's are constructed (i.e. by
adding a single element each
time), is far from efficient performance-wise, but that's not the
point here...
By reformatting the raw data, I can easily read in the data more
efficiently.
The issue I ran into is that changing the data type to the following,
doesn't yield any significant
different in memory usage.
type data = IntMap (UArray Int Int) -- use Int instead of Word8 for a
user rating
Someone (dolio) @ #haskell suggested that maybe UArray is not byte
packed for Word8,
which would cause little difference with a UArray containing Int's,
but someone else (dons @ #ghc)
was able to tell me it _is_ byte packed.
Does anyone know why the Word8 version is not significantly better in
terms of memory usage?
greetings,
Kenneth
PS: My adventures on trying to tackle the Netflix Prize problem with
Haskell can be followed at http://boegel.kejo.be.
--
Kenneth Hoste
ELIS - Ghent University
email: kenneth.hoste at elis.ugent.be
blog: http://www.elis.ugent.be/~kehoste/blog
website: http://www.elis.ugent.be/~kehoste
More information about the Haskell-Cafe
mailing list