[Haskell-cafe] Parsing a bunch of strings without backtracking

Thu Mar 16 21:50:51 UTC 2017

Am 09.03.2017 um 01:21 schrieb Javran Cheng:
> I come across an idea about parsing a bunch of strings,
> just want to share this idea and looking for some comments and maybe
> pointing me to some related articles.
>
> The idea is when you have:
>
> 1 <$ string "aabb" <|> 2 <$ string "aacc"
>
> you can rewrite it as:
>
> string "aa" >> (1 <$ string "bb" <|> 2 <$ string "cc")
>
> so that you don't have to deal with the common prefix "aa" twice.

In the parsing literature this is called "left factoring (a grammar)". 
Left factoring is known to improve efficiency and there are published 
algorithms to do it automatically; but -- as you observed -- to perform 
it you need an explicit representation of the grammar, so you can 
transform it before turning it into a parser; some would call this a 
"deep embedding".

> Then I further extend this idea: if we have a set of strings (s1, s2, ...),
> whose element is not a prefix of any other set member,
> why don't we turn things like (v1 <$ string s1 <|> v2 <$ string s2 <|> ...)
> into a "tree-structure" so we don't need to deal with
> any common prefix more than once?

The problem is how to represent the grammar in a statically typed way 
and then to transform it preserving the well-typing. This can be done, 
but the price is high. Look for packages (and papers) named 
'christmastree' and 'tttas' to get an idea. Even the authors admit that 
their approach is extremely involved and that offline methods (such as 
TH which you mentioned) may be more practical.

Cheers
Ben