[Haskell-cafe] Alternative instance for non-backtracking parsers
will.yager at gmail.com
Mon Sep 3 15:29:35 UTC 2018
On Aug 30, 2018, at 11:21, Olaf Klinke <olf at aatal-apotheke.de> wrote:
> [*] To the parser experts on this list: How much time should a parser take that processes a 50MB, 130000-line text file, extracting 5 values (String, UTCTime, Int, Double) from each line?
The combination of attoparsec + a streaming adapter for pipes/conduit/streaming should easily be able to handle tens of megabytes per second and hundreds of thousands of lines per second.
For an example, check out https://github.com/wyager/Callsigns/blob/master/Callsigns.hs
Which parses a pipe-separated-value file from the FCC pretty quickly. As I recall it goes through a >100MB file in under three seconds, and it has to do a bunch of other work besides.
I also ported the above code to use Streaming instead of Pipes. I recall that using Streaming master, the parser I use to read the dictionary:
takeTill isEndOfLine <* endOfLine
Handles about 3 million lines per second. I can’t remember what the number is for Pipes but it’s probably similar. That’s really good for such a simple thing to write!
Unfortunately there is a performance bug in Streaming that’s fixed in master but hasn’t been released for a number of months :-/
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Haskell-Cafe