[Haskell-cafe] Alternative instance for non-backtracking parsers
Olaf Klinke
olf at aatal-apotheke.de
Sun Sep 2 20:05:56 UTC 2018
Thanks Bardur for the pointer to bytesting-lexing and cassava. The problem is that my csv files are a little bit idiosyncratic. That is, they have BOM, semicolons as separators, The fractional numbers are in the format 1.234,567 and there are dates to parse, too. I tried parseTimeM from the time package, but that is slower than my own parser. That said, my megaparsec parser seems to spend quite some time skipping over text with regex ';"[^"]*"', that is, fields whose content does not concern the application at hand. Hence using a CSV library for tokenizing might be a good idea.
Thanks to Peter Simons for pointing out that cassava indeed has attoparsec as a dependency.
Maybe CSV is a red herring, after all. It's just a some text-based syntax where the count of semicolons indicate where in the line I can find the data I'm interested in. I'm more curious about how to make number conversion fast.
I've looked at the source of megaparsec-6.5.0, attoparsec-0.13.2.2, bytestring-lexing-0.5.0.2 and base-4.11.1.0. They all do the same thing: Convert digits to numbers individually, then fold the list of digits as follows:
f x digit = x * 10 + value digit
number = foldl' f 0 digits
For 'value' above, Megaparsec uses Data.Char.digitToInt while Attoparsec uses Data.Char.ord.
I also rolled my own Double parser for locale reasons. Are there any libraries that handle all the formats
1,234.567
1234.567
1.234,567
1234,567
maybe by specifying a locale up front? It can't be done without, since 123.456 on its own is ambigous. Maybe the locale can be guessed from the context, maybe not, but that is certainly an expensive operation. MS Excel guesses eagerly, with occasional amusing consequences.
Thanks to all who contributed to this thread so far!
Olaf
More information about the Haskell-Cafe
mailing list