[Haskell-cafe] Parsing a bunch of strings without backtracking

Thu Mar 9 02:41:30 UTC 2017

Hi Jaro, thanks for the link, having worked with few parsing libs but
uu-parsinglib sounds new to me and I'm definitely thinking about having a
try.
I'm taking a look but haven't yet found any part of code that attempts to
factor common prefixes out.

On Wed, Mar 8, 2017 at 7:48 PM, Jaro <noughtmare at openmailbox.org> wrote:

> Have you seen uu-parsinglib (http://foswiki.cs.uu.nl/
> foswiki/HUT/ParserCombinators)? It implements a non-backtracking parser
> that works similarly to what you describe.
>
>
> On 9 March 2017 01:21:09 GMT+01:00, Javran Cheng <javran.c at gmail.com>
> wrote:
>>
>> Hi all,
>>
>> I come across an idea about parsing a bunch of strings,
>> just want to share this idea and looking for some comments and maybe
>> pointing me to some related articles.
>>
>> The idea is when you have:
>>
>> 1 <$ string "aabb" <|> 2 <$ string "aacc"
>>
>> you can rewrite it as:
>>
>> string "aa" >> (1 <$ string "bb" <|> 2 <$ string "cc")
>>
>> so that you don't have to deal with the common prefix "aa" twice.
>>
>> Then I further extend this idea: if we have a set of strings (s1, s2,
>> ...), whose element is not a prefix of any other set member,
>> why don't we turn things like (v1 <$ string s1 <|> v2 <$ string s2 <|>
>> ...) into a "tree-structure" so we don't need to deal with
>> any common prefix more than once?
>>
>> So this motivates my little toy implementation at
>> https://gist.github.com/Javran/1cbfe9897d9a5c8fae5d20a33fd83813
>>
>> basically given a list [(s1,v1),(s2,v2)...], we want to generate a parser
>> equivalent to (v1 <$ string s1 <|> v2 <$ string s2 <|> ...)
>> but perhaps with less backtracking. My code does this by turning the set
>> of strings into a tree structure (see function "compact") and
>> using this "Compact" representation as a driver to generate the parser.
>> I used ReadP to have a concrete parser for testing, but this should work
>> with any parsing library supporting MonadPlus.
>>
>> I think they are bunch of things that can be improved:
>>
>> - for now every intermediate node of "Compact" contains just a char, so
>> this will end up generating a parser that has branch `char 'f' >> char 'o'
>> >> 'o'` in it
>>     instead of the slightly more efficient version `string "foo"`
>> - we could further attach a weight to each string, this allows
>> rearranging (<|>) chains so that a more frequent string parser appears
>> earlier.
>> - if the input [(s1,v1),(s2,v2)...] are all known constants at compile
>> time, probably TemplateHaskell can be used for generating the parser at
>> compile time (as a Haskell expression).
>>     I'm not sure how difficult this would be though, as I have very
>> limited understanding of TH.
>>
>> but before I explore further, I want to collect some comments as I said
>> in the beginning.
>>
>> Best,
>> --
>> Javran (Fang) Cheng
>>
>
> --
> Sent from my Android device with K-9 Mail. Please excuse my brevity.
>

-- 
Javran (Fang) Cheng
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20170308/ae7cf1ff/attachment.html>