[Haskell-cafe] Problems with attoparsec (or maybe URI.ByteString)

Wed Feb 24 19:11:14 UTC 2016

What I'm trying to do is write a function with the signature:

    data Urls =
        Txt Text.Text
        | Url Text.Text URI.URI
        deriving (Show)

    parseUrls :: Text.Text -> Either String [ Urls ]
    parseUrls text = ...

Given a text block, it finds all the URLs, and breaks things into either
URLs, or blocks of text which are not URLs.  The full text is attached, for
those who are interested.  But the problem I'm hitting is using the
Attoparsec parser URI.ByteString exports.  When I do:

*Base.DetectURL AP> AP.parseOnly (URI.uriParser URI.laxURIParserOptions) "
http://foo/bar"
Right (...)

So, that works.  But when I add a single space on the end of the string:

*Base.DetectURL AP> AP.parseOnly (URI.uriParser URI.laxURIParserOptions) "
http://foo/bar "
Left "Failed reading: MalformedPath"

It fails.  Note that this isn't a problem with parseOnly- the real code
looks like:

    parseAllUris :: AP.Parser (Bldr.Builder, [ Urls ])
    parseAllUris = msum [ aUri, noUri, finished ]
        where
            finished = return (mempty, [])
            aUri = do
                (txt, url) <- AP.match $
                                    URI.uriParser URI.laxURIParserOptions
                (bldr, us) <- msum [ noUri, finished ]
                return $ (mempty, (Url (E.decodeUtf8 txt) url
                                        : prependText bldr us))
            noUri = do
                c <- AP.anyChar
                (bldr, us) <- parseAllUris
                return $ ((Bldr.charUtf8 c) `mappend` bldr, us)

And this has the problem as well- parsing a URL with anything following it
fails, and it doesn't detect any URLs.  The parseOnly is just the easy way
to demonstrate it.

So, my question is, is there some way in attoparsec to tell it to just
parse as much as makes sense, and leave the rest?  Alternatively, is this a
problem with the way URI.ByteString module constructed it's parser, and a
different parser could work?  Or, worst of all, is this a problem with the
way that URIs are defined and no conforming parser will work?

Thanks.

Brian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20160224/e201418a/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: DetectURL.hs
Type: text/x-haskell
Size: 1901 bytes
Desc: not available
URL: <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20160224/e201418a/attachment.hs>