[Haskell-cafe] memory-efficient data type for Netflix data -
UArray Int Int vs UArray Int Word8
kenneth.hoste at ugent.be
Thu Feb 26 09:14:49 EST 2009
On Feb 26, 2009, at 13:00 , Manlio Perillo wrote:
> Kenneth Hoste ha scritto:
>> I'm having a go at the Netflix Prize using Haskell. Yes, I'm brave.
>> To see if I could efficiently represent the data set in this way, I
>> wrote a small
>> Haskell program (attached) which uses the following data type:
> From what I see, to append a new integer to the Array, you convert
> the array to a list, append the new element to the list, and then
> convert to array again.
> Isn't this a bit inefficient?
Yes, performance-wise this is terribly inefficient, I agree. But, it
was just an artefact of how the raw data is organized.
My main concern was the memory usage of the huge IntMap with UArray
Once I solved that, I would be able to get around the performance
issue by reorganizing the raw data.
However, as I posted yesterday, I've been able to circumvent the issue
by rethinking my data type, i.e. using
the ~18K movie IDs as key instead of the 480K user IDs, which
radically limits the overhead...
That way, I'm able to fit the data set in <700M of memory, without
having to reorganize the raw data.
> The uvector package implements a vector of unboxed types, and has an
> snocU operation, to append an element to the array.
> I don't know how efficient it is, however.
> By the way, about uvector: it has a Stream data type, and you can
> build a vector from a stream.
Thanks for letting me know, I'll keep this in mind.
Paris research group - ELIS - Ghent University, Belgium
email: kenneth.hoste at elis.ugent.be
More information about the Haskell-Cafe