[Haskell-cafe] uu-parsinglib - Greedy Parser

S. Doaitse Swierstra doaitse at swierstra.net
Tue Jan 13 11:22:27 UTC 2015


> On 13 Jan 2015, at 11:57 , Marco Vassena <marco.vax91 at gmail.com> wrote:
> 
> In the uu-parsinglib the choice combinator (<|>) is symmetric and does not commit to any alternative.
> If the grammar being parsed is ambiguous, the corresponding parser will fail at run time on an ambiguous input.
> 
> Consider for instance this html parser.
> 
> pTag = pOpenTag <|> pCloseTag <|> pCommentTag <|> pContent
> 
> pContent = Content <$> some (satisfy (/= '<'))
> pHtml = some pTag
> 
> This parser will fail on "<a>123</a>", because this may be interpreted as:
> [ [Open a , Content "123", Close a], 
> [Open a, Content "12", Content "3", Close a],
> [Open a, Content "1", Content "2", Content "3", Close a] ... ]
> 
> In other parsing libraries such as parsec the choice operator <|> is greedy and commits to the first alternative that makes
> any progress, so that some is greedy and Content "123" would be parsed.
> The operator <<|> in uu-parsinglib has the same greedy behaviour.
> 
> I would like to disambiguate the grammar so that the first result ([Open a, Content "123", Close a]) is selected (longest matching rule).
> I suppose that this may be done using <<|> to define a greedySome to be used in in pContent, however I am wondering whether 
> there is another way to do this, without using such operator <<|>.

The thing to do it to use the pList combinator which is greedy (the pList_ng counterpart is non-greedy):

pContent = Content <$> pList (satisfy (/= '<'))

An even better approach would be to reflect the structure of your HTML in your parser:

pContent = tag_semantics <$> pOpenTag <*> pContent <* pCloseTag

In case you re sure about the correctness of your HTML you might want to enforce, by using a monadic construct to check for the closing tag to correspons to the opening tag (note that I am in general not in favour of using monads, since they make the abstrcat interpretation going on expensive):

pTagged = do t <- pOpentag
             c <- pContents
             pClosetag t
             return (Tagged t c)

(of course with appropriate definitions for pOpentag and pCloseTag.

Finally the module BasicInstances contains a combinator pMunch which you could have used:

pContent = Content <$>  pMunch (/= '<')

This is far more efficient since it operates directly at the state.

   Doaitse




> 
> Any help is appreciated.
> 
> All the best,
> Marco
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe



More information about the Haskell-Cafe mailing list