[Haskell-cafe] Troubles understanding Parsec Error Handling

Wed May 30 21:36:13 CEST 2012

I recently started writing my first application at work in Haskell and it
deals with a lot of parsing.
Among other things I often have to check for a lot of alternatives for
fixed strings (parsing natural
language text and people have a lot of ways to abbreviate the same thing in
labels). So far I have been
doing this basically via

choice $ map (try . string) [ "foo", "bar", ... ]

This works fine but has two disadvantages, it isn't very fast, in
particular when many of the strings
start with the same prefix and it also is a bit error prone since it breaks
when you place a prefix of
another string earlier in the list.

My attempt at a solution was to use the bytestring-trie package for a
little utility function that basically
parses one character at a time, checks if the string parsed so far is in
the trie and then calls itself recursively
with the trie starting with that string. My attempt at that so far looks
like this:

(dependencies bytestring-trie, utf8-string and parsec 3)

import qualified Data.ByteString.UTF8 as UTF8
import qualified Data.Trie as Trie
import Text.Parsec
import Text.Parsec.Text (GenParser)

anyOf :: [String] -> GenParser u String
anyOf l =
  try $ anyOf' t ""
    where t = Trie.fromList $ zip (map UTF8.fromString l) (repeat ())
          anyOf' :: Trie.Trie () -> String -> GenParser u String
          anyOf' t s = try $ do
              c <- lookAhead $ anyChar
              let newS = s ++ [ c ] in
                case Trie.submap (UTF8.fromString newS) t of
                  emptyT | Trie.null emptyT ->
                    case Trie.member (UTF8.fromString s) t of
                      True ->
                        return s
                      False ->
                        unexpected $ show newS++", expecting one of "++show
l
                  restT -> do
                    _ <- anyChar
                    try $ anyOf' restT newS

A successful example usage would be:

parseTest (do; r1 <- anyOf ["Hello", "Hallo", "Foo", "HallofFame"]; r2 <-
string "bla"; return (r1, r2)) "Hallobla"

which results in

("Hallo","bla")

(the extra string parser is there so errors in parsing too much are not
hidden). An error would result .e.g. from

parseTest (do; r1 <- anyOf ["Hello", "Hallo", "Foo", "HallofFame"]; r2 <-
string "bla"; return (r1, r2)) "Hallofbla"

which prints this:

parse error at (line 1, column 8):unknown parse error

And my question about this is made up of two parts

1. Why doesn't it print my "unexpected" message but instead says unknown
parse error
2. Why is the location in the text off (I would expect it to fail at column
6 (first character beyond the result it could return) or 7 (first character
that makes the string no prefix of any acceptable string)

I am afraid my knowledge of Parsec internals is a bit too limited, some
Google queries showed no similar problems and no obvious places in the
Parsec source code to check for the answer to the first question in
particular and I suspect the second is closely related to the first.

Thanks for reading through my question and I hope someone knows the answer
or at least some clues on where i might find it.

Matthias Hoermann

P.S.: I am hoping this time this works, last time it was rejected because
google sends with @googlemail.com instead of @gmail.com for some reason.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20120530/e9217b73/attachment.htm>