[Haskell-cafe] uu-parsinglib - Greedy Parser

Tue Jan 13 13:25:48 UTC 2015

> The thing to do it to use the pList combinator which is greedy (the pList_ng counterpart is non-greedy):
> 
> pContent = Content <$> pList (satisfy (/= '<'))

Checking the source code pList is defined in terms of pFoldr, which uses opt, which eventually is defined in terms  of <<|>.

> An even better approach would be to reflect the structure of your HTML in your parser:
> 
> pContent = tag_semantics <$> pOpenTag <*> pContent <* pCloseTag

> In case you re sure about the correctness of your HTML you might want to enforce, by using a monadic construct to check for the closing tag to correspons to the opening tag (note that I am in general not in favour of using monads, since they make the abstrcat interpretation going on expensive):
> 
> pTagged = do t <- pOpentag
>             c <- pContents
>             pClosetag t
>             return (Tagged t c)
> 
> (of course with appropriate definitions for pOpentag and pCloseTag.

Unfortunately in html there are also empty tags, which don't need to be closed. For instance the line-break tag <br>:
<h1> Line break tags are <br> not closed </h1>

The bigger picture is that I am trying to figure out what are the core constructs needed to define a parser, therefore I want to
have a rather abstract interface. In my set of core constructs there are:
	<$> : (a -> b) -> f a -> f b
 	<*> : f (a -> b) -> f a -> f b
	<|> : f a -> f a -> f a  -- (symmetric choice)
	pure : a -> f a
	fail : f a 
	pToken : f Char 

Is it possible to define a parser that applies the longest matching rule using these constructs only?
Or is it necessary to extend it with another primitive, for instance greedy choice <<|> ?
(Note that f is abstract and it is not necessarily uu-parsinglib parsers).

Marco

On 13/gen/2015, at 12.22, S. Doaitse Swierstra wrote:

> 
>> On 13 Jan 2015, at 11:57 , Marco Vassena <marco.vax91 at gmail.com> wrote:
>> 
>> In the uu-parsinglib the choice combinator (<|>) is symmetric and does not commit to any alternative.
>> If the grammar being parsed is ambiguous, the corresponding parser will fail at run time on an ambiguous input.
>> 
>> Consider for instance this html parser.
>> 
>> pTag = pOpenTag <|> pCloseTag <|> pCommentTag <|> pContent
>> 
>> pContent = Content <$> some (satisfy (/= '<'))
>> pHtml = some pTag
>> 
>> This parser will fail on "<a>123</a>", because this may be interpreted as:
>> [ [Open a , Content "123", Close a], 
>> [Open a, Content "12", Content "3", Close a],
>> [Open a, Content "1", Content "2", Content "3", Close a] ... ]
>> 
>> In other parsing libraries such as parsec the choice operator <|> is greedy and commits to the first alternative that makes
>> any progress, so that some is greedy and Content "123" would be parsed.
>> The operator <<|> in uu-parsinglib has the same greedy behaviour.
>> 
>> I would like to disambiguate the grammar so that the first result ([Open a, Content "123", Close a]) is selected (longest matching rule).
>> I suppose that this may be done using <<|> to define a greedySome to be used in in pContent, however I am wondering whether 
>> there is another way to do this, without using such operator <<|>.
> 
> The thing to do it to use the pList combinator which is greedy (the pList_ng counterpart is non-greedy):
> 
> pContent = Content <$> pList (satisfy (/= '<'))
> 
> An even better approach would be to reflect the structure of your HTML in your parser:
> 
> pContent = tag_semantics <$> pOpenTag <*> pContent <* pCloseTag
> 
> In case you re sure about the correctness of your HTML you might want to enforce, by using a monadic construct to check for the closing tag to correspons to the opening tag (note that I am in general not in favour of using monads, since they make the abstrcat interpretation going on expensive):
> 
> pTagged = do t <- pOpentag
>             c <- pContents
>             pClosetag t
>             return (Tagged t c)
> 
> (of course with appropriate definitions for pOpentag and pCloseTag.
> 
> Finally the module BasicInstances contains a combinator pMunch which you could have used:
> 
> pContent = Content <$>  pMunch (/= '<')
> 
> This is far more efficient since it operates directly at the state.
> 
>   Doaitse
> 
> 
> 
> 
>> 
>> Any help is appreciated.
>> 
>> All the best,
>> Marco
>> _______________________________________________
>> Haskell-Cafe mailing list
>> Haskell-Cafe at haskell.org
>> http://www.haskell.org/mailman/listinfo/haskell-cafe
>