[Haskell-cafe] advice on a parsing function

Wed Mar 11 12:21:53 EDT 2009

2009/3/11 Manlio Perillo <manlio_perillo at libero.it>:
> Hi.
>
> I'm still working on the Netflix Prize; now I have managed to implement a
> parsing function for the qualifying data set (or quiz set).
>
> The quiz set is in the format:
>
> 1:
> 10
> 20
> 30
> 2:
> 100
> 200
> 3:
> 1000
> 2000
> 3000
> 4000
> 5000
>
>
> Where 1, 2, 3 are movie ids, and the others are user ids.
>
> The parsing program is at:
> http://hpaste.org/fastcgi/hpaste.fcgi/view?id=2300
>
>
> The program reads the file using lazy IO.
> One of the feature I want is, for the quiz function, to be a *good
> producer*.
>
> I'm quite satisfied with the result (this is the first "complex" parsing
> function I have written in Haskell), and I have also managed to avoid the
> use of an accumulator.
>
> However I'm interested to know it there is a more efficient solution.
>
>
> The qualifying.txt file is 51MB, 2834601 lines.
>
> On my laptop, the function performance are:
>
> real    1m14.117s
> user    0m2.496s
> sys     0m0.136s
>
> CPU usage is about 3%,
> system load is about 0.20,
> memory usage is 4956 KB.
>
>
> What I'm worried about is:
>
> quiz' ((id, ":") : l) = (id, quiz'' l) : quiz' l
> quiz' ((id, _) : l) = quiz' l
>
>
> the problem here is that the same elements in the list are processed
> multiple times.
>
>
> I have written alternate versions.
> The first using foldl
> http://hpaste.org/fastcgi/hpaste.fcgi/view?id=2303
> (that, however, builds the entire data structure in memory, and in reverse
> order)
>
> The latter using foldr
> http://hpaste.org/fastcgi/hpaste.fcgi/view?id=2304
> (that, however, is incorrect and I'm unable to fix).

Hi,

I suggest you try an alternative strategy.
That altenrative strategy is twofold, just like you have
quiz' and quiz'.
This alternate strategy avoid pattern matching on strings
(which would be cumbersome for a bit more complex syntax).

So you have to write two functions :
one try to parse an 'id:' from the string.
It either succeed and returns the Int value of
the id, or it fails. You can use Either to encode
success and failure. Furthermore that function has
to return the remaining string (which is the same
as received upon failure).

The second function do a similar thing, this time
failing on 'id:'. (And it still returns alos the remainstring).

If you're familiar with the State monad, you can
write the above two functions by using the string
as the state.

Now, given those two functions, try to apply them
on your input string, feeding the next function application
with the resulting string of the current application.

What I proposed here is the very basics of a natural and
very used parsing technique (google for parser combinators).
That technic will scale well for more complex inputs, and,
I believe, should provide you with sufficient performance.

Cheers,
Thu