[Haskell-cafe] Alternative instance for non-backtracking parsers

Mon Sep 3 18:33:06 UTC 2018

> Am 03.09.2018 um 17:29 schrieb Will Yager <will.yager at gmail.com>:
> 
> 
> 
> On Aug 30, 2018, at 11:21, Olaf Klinke <olf at aatal-apotheke.de> wrote:
> 
>> 
>> [*] To the parser experts on this list: How much time should a parser take that processes a 50MB, 130000-line text file, extracting 5 values (String, UTCTime, Int, Double) from each line?
>> _______________________________________________
>> 
> 
> The combination of attoparsec + a streaming adapter for pipes/conduit/streaming should easily be able to handle tens of megabytes per second and hundreds of thousands of lines per second. 
That's good to know, so there is plenty of room for improvement left. 
> 
> For an example, check out   https://github.com/wyager/Callsigns/blob/master/Callsigns.hs
> 
> Which parses a pipe-separated-value file from the FCC pretty quickly. As I recall it goes through a  >100MB file in under three seconds, and it has to do a bunch of other work besides. 
The parser does nothing except chunk up the line's text and replace parts of it by constants. I'm surprised and pleased though that HashMaps have such good performance. 

Profiling shows that my parser now spends most time converting to numbers and dates. I wrote a primitive 

skipSep :: Int -> Char -> Parser ()

which skips over input until it has read a given number of the given character. This cut down overall execution time from 12s to 6s, meaning the parsing time is down by more than 50%. Seems the combinators like *> and >>= and 'manyTill' do have a non-neglegible cost, so combining from fewer parts makes the parser faster. Would other parser users agree? 

Olaf