[Haskell-cafe] Alternative instance for non-backtracking parsers
olf at aatal-apotheke.de
Thu Aug 30 18:21:07 UTC 2018
> Hello, Olaf. I have some distrust of elegant solutions (one of them are
> C.P. libs).
I have a program that parses several CSV files, one of them 50MB in size, and writes its result as HTML. When I started optimizing, the execution time was 144 seconds. Profiling (thanks to Jasper Van der Jeugt for writing profiteur!) revealed that most of the time was spent parsing and postprocessing the 50MB CSV file. Changing the data structure of the postprocessing stage cut down the execution time to 32 seconds, but still the majority is spent on parsing.
Then I realized that (StateT String Maybe) is a parser which conveniently has all the class instances one needs, most notably its Alternative instance make it a backtracking parser. After defining a few combinators I was able to swap out my megaparsec parser against the new parser, which slashed execution time in half. Now most of the parsing time is dedicated to transforming text to numbers and dates. I doubt that parsing time can be reduced much further [*]. The new parser was identical to the old parser, only the combinators now come from another module. That is the elegant thing about monadic parser libraries.
I will now use the fast parser by default, and if it returns a Nothing, the program will suggest a command line flag that switches to the original megaparsec parser, exactly telling the user where the parse failed and why.
I am not sure whether there is another family of parsers that have interfaces so similar that switching from one package to another is as effortless as monadic parsers.
[*] To the parser experts on this list: How much time should a parser take that processes a 50MB, 130000-line text file, extracting 5 values (String, UTCTime, Int, Double) from each line?
More information about the Haskell-Cafe