[Haskell-cafe] Parsing a bunch of strings without backtracking
Javran Cheng
javran.c at gmail.com
Thu Mar 9 00:21:09 UTC 2017
Hi all,
I come across an idea about parsing a bunch of strings,
just want to share this idea and looking for some comments and maybe
pointing me to some related articles.
The idea is when you have:
1 <$ string "aabb" <|> 2 <$ string "aacc"
you can rewrite it as:
string "aa" >> (1 <$ string "bb" <|> 2 <$ string "cc")
so that you don't have to deal with the common prefix "aa" twice.
Then I further extend this idea: if we have a set of strings (s1, s2, ...),
whose element is not a prefix of any other set member,
why don't we turn things like (v1 <$ string s1 <|> v2 <$ string s2 <|> ...)
into a "tree-structure" so we don't need to deal with
any common prefix more than once?
So this motivates my little toy implementation at
https://gist.github.com/Javran/1cbfe9897d9a5c8fae5d20a33fd83813
basically given a list [(s1,v1),(s2,v2)...], we want to generate a parser
equivalent to (v1 <$ string s1 <|> v2 <$ string s2 <|> ...)
but perhaps with less backtracking. My code does this by turning the set of
strings into a tree structure (see function "compact") and
using this "Compact" representation as a driver to generate the parser.
I used ReadP to have a concrete parser for testing, but this should work
with any parsing library supporting MonadPlus.
I think they are bunch of things that can be improved:
- for now every intermediate node of "Compact" contains just a char, so
this will end up generating a parser that has branch `char 'f' >> char 'o'
>> 'o'` in it
instead of the slightly more efficient version `string "foo"`
- we could further attach a weight to each string, this allows rearranging
(<|>) chains so that a more frequent string parser appears earlier.
- if the input [(s1,v1),(s2,v2)...] are all known constants at compile
time, probably TemplateHaskell can be used for generating the parser at
compile time (as a Haskell expression).
I'm not sure how difficult this would be though, as I have very limited
understanding of TH.
but before I explore further, I want to collect some comments as I said in
the beginning.
Best,
--
Javran (Fang) Cheng
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20170308/c8606a84/attachment.html>
More information about the Haskell-Cafe
mailing list