[Haskell-cafe] help optimizing memory usage for a program

Mon Mar 2 12:57:05 EST 2009

Bulat Ziganshin ha scritto:
> Hello Manlio,
> 
> Monday, March 2, 2009, 8:16:10 PM, you wrote:
> 
>> By the way: I have written the first version of the program to parse
>> Netflix training data set in D.
>> I also used ncpu * 1.5 threads, to parse files concurrently.
> 
>> However execution was *really* slow, due to garbage collection.
>> I have also tried to disable garbage collection, and to manually run a
>> garbage cycle from time to time (every 200 file parsed), but the 
>> performance were the same.
> 
> may be it will be better to use somewhat like MapReduce and split
> your job into 100-file parts which are processed by ncpu concurrently
> executed scripts?
> 

For process-data-1 program there is no real need, since it is already 
fast enough (8 minutes on my laptop, and too memory usage is not a 
problem, unless it required more then 2 GB).

For process-data-2 there is some code that is left unevaluated (the 
array concatenation).
Simple file parsing is quite fast.
Most of the time is spent (IMHO) concatenating arrays in
foldl' (unionWith concatU), in the main function.

This should be possible to run in parallel, with MapReduce; I have to check.

But if parsing is so slow, there are only two solutions:

1) write the parser in C++, and then serialize the data in a compact
    binary format, to be read by Haskell [1]

    But, in this case, there are no reasons to write a piece of code in
    C++ and the other in Haskell ;-)

2) reimplement process-data-2 so that instead of grouping ratings by
    users in an IntMap, accumulate ratings in separate files
    (each per user, for a total of 480189 files [2]).

    This will avoid the need of array concatenation.

    Then parse the data again, and this should be more memory/GC
    friendly.

[1] hoping that array creation from a stream is memory efficient.
[2] versus 17770 files with ratings grouped by movies

Regards  Manlio Perillo