[Haskell-cafe] uu-parsinglib - Greedy Parser

Tue Jan 13 13:56:25 UTC 2015

> On 13 Jan 2015, at 14:25 , Marco Vassena <marco.vax91 at gmail.com> wrote:
> 
>> The thing to do it to use the pList combinator which is greedy (the pList_ng counterpart is non-greedy):
>> 
>> pContent = Content <$> pList (satisfy (/= '<'))
> 
> Checking the source code pList is defined in terms of pFoldr, which uses opt, which eventually is defined in terms  of <<|>.
> 
>> An even better approach would be to reflect the structure of your HTML in your parser:
>> 
>> pContent = tag_semantics <$> pOpenTag <*> pContent <* pCloseTag
> 
>> In case you re sure about the correctness of your HTML you might want to enforce, by using a monadic construct to check for the closing tag to correspons to the opening tag (note that I am in general not in favour of using monads, since they make the abstrcat interpretation going on expensive):
>> 
>> pTagged = do t <- pOpentag
>>            c <- pContents
>>            pClosetag t
>>            return (Tagged t c)
>> 
>> (of course with appropriate definitions for pOpentag and pCloseTag.
> 
> Unfortunately in html there are also empty tags, which don't need to be closed. For instance the line-break tag <br>:
> <h1> Line break tags are <br> not closed </h1>
> 
> The bigger picture is that I am trying to figure out what are the core constructs needed to define a parser, therefore I want to
> have a rather abstract interface. In my set of core constructs there are:
> 	<$> : (a -> b) -> f a -> f b
> 	<*> : f (a -> b) -> f a -> f b
> 	<|> : f a -> f a -> f a  -- (symmetric choice)
> 	pure : a -> f a
> 	fail : f a 
> 	pToken : f Char 
> 
> Is it possible to define a parser that applies the longest matching rule using these constructs only?
> Or is it necessary to extend it with another primitive, for instance greedy choice <<|> ?
> (Note that f is abstract and it is not necessarily uu-parsinglib parsers).
> 
> Marco

I do not think so. As long as your grammar is essentially ambiguous you will get all parses. Using the amb combinator you can capture this, and avoid the runtime error. Then you can lateron decide which tree too take. I am afraid however that this will may lead to an exponential blowup in your case.

If the set og <br>like tags is statically known they might be taken care of in a special way? by the parser?

 Doaitse

> 
> On 13/gen/2015, at 12.22, S. Doaitse Swierstra wrote:
> 
>> 
>>> On 13 Jan 2015, at 11:57 , Marco Vassena <marco.vax91 at gmail.com> wrote:
>>> 
>>> In the uu-parsinglib the choice combinator (<|>) is symmetric and does not commit to any alternative.
>>> If the grammar being parsed is ambiguous, the corresponding parser will fail at run time on an ambiguous input.
>>> 
>>> Consider for instance this html parser.
>>> 
>>> pTag = pOpenTag <|> pCloseTag <|> pCommentTag <|> pContent
>>> 
>>> pContent = Content <$> some (satisfy (/= '<'))
>>> pHtml = some pTag
>>> 
>>> This parser will fail on "<a>123</a>", because this may be interpreted as:
>>> [ [Open a , Content "123", Close a], 
>>> [Open a, Content "12", Content "3", Close a],
>>> [Open a, Content "1", Content "2", Content "3", Close a] ... ]
>>> 
>>> In other parsing libraries such as parsec the choice operator <|> is greedy and commits to the first alternative that makes
>>> any progress, so that some is greedy and Content "123" would be parsed.
>>> The operator <<|> in uu-parsinglib has the same greedy behaviour.
>>> 
>>> I would like to disambiguate the grammar so that the first result ([Open a, Content "123", Close a]) is selected (longest matching rule).
>>> I suppose that this may be done using <<|> to define a greedySome to be used in in pContent, however I am wondering whether 
>>> there is another way to do this, without using such operator <<|>.
>> 
>> The thing to do it to use the pList combinator which is greedy (the pList_ng counterpart is non-greedy):
>> 
>> pContent = Content <$> pList (satisfy (/= '<'))
>> 
>> An even better approach would be to reflect the structure of your HTML in your parser:
>> 
>> pContent = tag_semantics <$> pOpenTag <*> pContent <* pCloseTag
>> 
>> In case you re sure about the correctness of your HTML you might want to enforce, by using a monadic construct to check for the closing tag to correspons to the opening tag (note that I am in general not in favour of using monads, since they make the abstrcat interpretation going on expensive):
>> 
>> pTagged = do t <- pOpentag
>>            c <- pContents
>>            pClosetag t
>>            return (Tagged t c)
>> 
>> (of course with appropriate definitions for pOpentag and pCloseTag.
>> 
>> Finally the module BasicInstances contains a combinator pMunch which you could have used:
>> 
>> pContent = Content <$>  pMunch (/= '<')
>> 
>> This is far more efficient since it operates directly at the state.
>> 
>>  Doaitse
>> 
>> 
>> 
>> 
>>> 
>>> Any help is appreciated.
>>> 
>>> All the best,
>>> Marco
>>> _______________________________________________
>>> Haskell-Cafe mailing list
>>> Haskell-Cafe at haskell.org
>>> http://www.haskell.org/mailman/listinfo/haskell-cafe
>> 
> 
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe