[Haskell-cafe] Parsing a bunch of strings without backtracking

Thu Mar 9 09:13:35 UTC 2017

After thinking about it more (I still don't completely understand uu-parsinglib myself). I think achieves non-backtracking by running the parsers in parallel and then returning the first one that succeeds. So I guess it doesn't really optimize away prefixes. The 'best' function is the function that merges two "sequences" of the intermediate type 'Steps a' and that causes the parsers to be run in parallel. The source is here: https://hackage.haskell.org/package/uu-parsinglib-2.9.1.1/docs/src/Text-ParserCombinators-UU-Core.html#best.

On 9 March 2017 03:41:30 GMT+01:00, Javran Cheng <javran.c at gmail.com> wrote:
>Hi Jaro, thanks for the link, having worked with few parsing libs but
>uu-parsinglib sounds new to me and I'm definitely thinking about having
>a
>try.
>I'm taking a look but haven't yet found any part of code that attempts
>to
>factor common prefixes out.
>
>On Wed, Mar 8, 2017 at 7:48 PM, Jaro <noughtmare at openmailbox.org>
>wrote:
>
>> Have you seen uu-parsinglib (http://foswiki.cs.uu.nl/
>> foswiki/HUT/ParserCombinators)? It implements a non-backtracking
>parser
>> that works similarly to what you describe.
>>
>>
>> On 9 March 2017 01:21:09 GMT+01:00, Javran Cheng <javran.c at gmail.com>
>> wrote:
>>>
>>> Hi all,
>>>
>>> I come across an idea about parsing a bunch of strings,
>>> just want to share this idea and looking for some comments and maybe
>>> pointing me to some related articles.
>>>
>>> The idea is when you have:
>>>
>>> 1 <$ string "aabb" <|> 2 <$ string "aacc"
>>>
>>> you can rewrite it as:
>>>
>>> string "aa" >> (1 <$ string "bb" <|> 2 <$ string "cc")
>>>
>>> so that you don't have to deal with the common prefix "aa" twice.
>>>
>>> Then I further extend this idea: if we have a set of strings (s1,
>s2,
>>> ...), whose element is not a prefix of any other set member,
>>> why don't we turn things like (v1 <$ string s1 <|> v2 <$ string s2
><|>
>>> ...) into a "tree-structure" so we don't need to deal with
>>> any common prefix more than once?
>>>
>>> So this motivates my little toy implementation at
>>> https://gist.github.com/Javran/1cbfe9897d9a5c8fae5d20a33fd83813
>>>
>>> basically given a list [(s1,v1),(s2,v2)...], we want to generate a
>parser
>>> equivalent to (v1 <$ string s1 <|> v2 <$ string s2 <|> ...)
>>> but perhaps with less backtracking. My code does this by turning the
>set
>>> of strings into a tree structure (see function "compact") and
>>> using this "Compact" representation as a driver to generate the
>parser.
>>> I used ReadP to have a concrete parser for testing, but this should
>work
>>> with any parsing library supporting MonadPlus.
>>>
>>> I think they are bunch of things that can be improved:
>>>
>>> - for now every intermediate node of "Compact" contains just a char,
>so
>>> this will end up generating a parser that has branch `char 'f' >>
>char 'o'
>>> >> 'o'` in it
>>>     instead of the slightly more efficient version `string "foo"`
>>> - we could further attach a weight to each string, this allows
>>> rearranging (<|>) chains so that a more frequent string parser
>appears
>>> earlier.
>>> - if the input [(s1,v1),(s2,v2)...] are all known constants at
>compile
>>> time, probably TemplateHaskell can be used for generating the parser
>at
>>> compile time (as a Haskell expression).
>>>     I'm not sure how difficult this would be though, as I have very
>>> limited understanding of TH.
>>>
>>> but before I explore further, I want to collect some comments as I
>said
>>> in the beginning.
>>>
>>> Best,
>>> --
>>> Javran (Fang) Cheng
>>>
>>
>> --
>> Sent from my Android device with K-9 Mail. Please excuse my brevity.
>>
>
>
>
>-- 
>Javran (Fang) Cheng

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20170309/7a87a76f/attachment.html>