[Haskell-cafe] advice on a parsing function
Manlio Perillo
manlio_perillo at libero.it
Wed Mar 11 11:59:21 EDT 2009
Hi.
I'm still working on the Netflix Prize; now I have managed to implement
a parsing function for the qualifying data set (or quiz set).
The quiz set is in the format:
1:
10
20
30
2:
100
200
3:
1000
2000
3000
4000
5000
Where 1, 2, 3 are movie ids, and the others are user ids.
The parsing program is at:
http://hpaste.org/fastcgi/hpaste.fcgi/view?id=2300
The program reads the file using lazy IO.
One of the feature I want is, for the quiz function, to be a *good
producer*.
I'm quite satisfied with the result (this is the first "complex" parsing
function I have written in Haskell), and I have also managed to avoid
the use of an accumulator.
However I'm interested to know it there is a more efficient solution.
The qualifying.txt file is 51MB, 2834601 lines.
On my laptop, the function performance are:
real 1m14.117s
user 0m2.496s
sys 0m0.136s
CPU usage is about 3%,
system load is about 0.20,
memory usage is 4956 KB.
What I'm worried about is:
quiz' ((id, ":") : l) = (id, quiz'' l) : quiz' l
quiz' ((id, _) : l) = quiz' l
the problem here is that the same elements in the list are processed
multiple times.
I have written alternate versions.
The first using foldl
http://hpaste.org/fastcgi/hpaste.fcgi/view?id=2303
(that, however, builds the entire data structure in memory, and in
reverse order)
The latter using foldr
http://hpaste.org/fastcgi/hpaste.fcgi/view?id=2304
(that, however, is incorrect and I'm unable to fix).
The performances of the foldr version are very similar to the
performances of the first implementation (it make use, however, of 3704
KB, and it is about 3 seconds faster).
P.S:
the expected result for the sample quiz set I have posted is:
[(1,[10,20,30]),(2,[100,200]),(3,[1000,2000,3000,4000,5000])]
The foldl version produces:
[(3,[5000,4000,3000,2000,1000]),(2,[200,100]),(1,[30,20,10])]
The foldr version produces:
[(1,[]),(2,[10,20,30]),(3,[100,200]),(5000,[1000,2000,3000,4000])]
Thanks Manlio Perillo
More information about the Haskell-Cafe
mailing list