[Haskell-cafe] memory-efficient data type for Netflix data - UArray Int Int vs UArray Int Word8

Kenneth Hoste kenneth.hoste at ugent.be
Mon Feb 23 13:39:34 EST 2009


Hello,

I'm having a go at the Netflix Prize using Haskell. Yes, I'm brave.

I kind of have an algorithm in mind that I want to implement using  
Haskell,
but up until now, the main issue has been to find a way to efficiently  
represent
the data...

For people who are not familiar with the Netflix data, in short, it  
consist of
roughly 100M (1e8) user ratings (1-5, integer) for 17,770 different  
movies, coming from
480,109 different users.

The idea I have in mind requires fast access to all the ratings for a  
particular user,
so I was thinking about an IntMap in terms of user ids, which maps to
movie id/rating pairs somehow.

To see if I could efficiently represent the data set in this way, I  
wrote a small
Haskell program (attached) which uses the following data type:
-------------- next part --------------
A non-text attachment was scrubbed...
Name: testMemSizeUArray_UArray_Word8.hs
Type: application/octet-stream
Size: 1389 bytes
Desc: not available
Url : http://www.haskell.org/pipermail/haskell-cafe/attachments/20090223/0cd19100/testMemSizeUArray_UArray_Word8.obj
-------------- next part --------------



type data = IntMap (UArray Int Word8) -- map of user IDs to ratings  
(lacks movie IDs)

For convenience, and because I've been discussing this with various  
people in #haskell @ IRC,
the code is also available here: http://hpaste.org/fastcgi/hpaste.fcgi/view?id=1671#a1671 
  .

I'm aware that the way in which the UArray's are constructed (i.e. by  
adding a single element each
time), is far from efficient performance-wise, but that's not the  
point here...
By reformatting the raw data, I can easily read in the data more  
efficiently.

The issue I ran into is that changing the data type to the following,  
doesn't yield any significant
different in memory usage.

type data = IntMap (UArray Int Int) -- use Int instead of Word8 for a  
user rating

Someone (dolio) @ #haskell suggested that maybe UArray is not byte  
packed for Word8,
which would cause little difference with a UArray containing Int's,  
but someone else (dons @ #ghc)
was able to tell me it _is_ byte packed.

Does anyone know why the Word8 version is not significantly better in  
terms of memory usage?

greetings,

Kenneth

PS: My adventures on trying to tackle the Netflix Prize problem with  
Haskell can be followed at http://boegel.kejo.be.

-- 

Kenneth Hoste
ELIS - Ghent University
email: kenneth.hoste at elis.ugent.be
blog: http://www.elis.ugent.be/~kehoste/blog
website: http://www.elis.ugent.be/~kehoste



More information about the Haskell-Cafe mailing list