[Haskell-cafe] uvector package appendU: memory leak?
wren ng thornton
wren at freegeek.org
Wed Apr 1 20:46:58 EDT 2009
Manlio Perillo wrote:
> wren ng thornton ha scritto:
> > Manlio Perillo wrote:
> > > Since ratings for each customers are parsed "at the same time", using
> > > a plain list would consume a lot of memory, since stream fusion can
> > > only be executed at the end of the parsing.
> > >
> > > On the other hand, when I want to group ratings by movies, stream
> > > fusion seems to work fine.
> > >
> > [...]
> >
> > For the problem as you've discussed it, I'd suggest a different
> > approach: You can't fit all the data into memory at once, so you
> > shouldn't try to. You should write one program that takes in the
> > per-movie grouping of data and produces a per-user file as output.
>
> Well, creating 480189 files in a directory is not a very nice thing to
> do to a normal file system.
I said *a* per-user file; that is, a file of (Movie,User,Rating) records
sorted by User, as opposed to your current file which is sorted by
Movie. As for the interim binning process, one file per user is only in
the extreme case.
Honestly, you should be able to get away with just reading through the
file and splitting it into a handful of bins (8 or 16 at most) and
assigning each user, u, to bin u = u `mod` qtyBins. The bins will be of
uneven sizes, but they should be even enough that each is small enough
to fit into memory. For this size data you can probably even leave the
bins as a bag of (User,Movie,Rating) records and forgo the follow up
step of sorting each bin by User.
--
Live well,
~wren
More information about the Haskell-Cafe
mailing list