[Haskell-cafe] Wikipedia archiving bot - code review

Mon Jun 25 20:48:11 EDT 2007

Hi

You may find that the slow down is coming from your use of the TagSoup
library - I'm currently reworking the parser to make sure its fully
lazy and doesn't space leak. I hope that the version in darcs tomorrow
will have all those issues fixed.

Thanks

Neil

On 6/26/07, Gwern Branwen <gwern0 at gmail.com> wrote:
> Hey everyone. So I've been learning Haskell for a while now, and I've found the best way to move from theory to practice is to just write something useful for yourself. Now, I'm keen on editing Wikipedia and I've long wanted some way to stop links to external websites from breaking on me. So I wrote this little program using the TagSoup library which will download Wikipedia articles, parse out external links, and then ask WebCite to archive them.
>
> But there's a problem: no matter how I look at it, it's just way too slow. Running on a measly 100 articles at a time, it'll eat up to half my processor time and RAM (according to top). I converted it over to ByteStrings since that's supposed to be a lot better than regular Strings, but that didn't seem to help much.
> So I'm curious: in what way could this code be better? How could it be more idiomatic or shorter? Particularly, how could it be more efficient either in space or time? Any comments are appreciate.
>
> {- Module      :  Main.hs
>    License     :  public domain
>    Maintainer  :  Gwern Branwen <gwern0 at gmail.com>
>    Stability   :  unstable
>    Portability :  portable
>    Functionality: retrieve specified articles from Wikipedia and request WebCite to archive all URLs found.
>    TODO: send an equivalent request to the Internet Archive.
>          Not in any way rate-limited.
>    BUGS: Issues redundant archive requests.
>          Currently uses Data.ByteString.Lazy.Char8. If I'm understanding the documentation right, this barfs
>          on the full UTF-8 character set, but Wikipedia definitely exercises the full UTF-8 set.
>    USE: Print to stdin a succession of Wikipedia article names (whitespace in names should be escaped as '_').
>         A valid invocation might be, say: '$echo Fujiwara_no_Teika Fujiwara_no_Shunzei | archive-bot'
>         All URLs in [[Fujiwara no Teika]] and [[Fujiwara no Shunzei]] would then be backed up.
>         If you wanted to run this on all of Wikipedia, you could take the current 'all-titles-in-ns0'
>         gzipped file from [[WP:DUMP]], gunzip it, and then pipe it into archive-bot. -}
>
> module Main where
> import Text.HTML.TagSoup (parseTags, Tag(TagOpen))
> import Text.HTML.Download (openURL)
> import Data.List (isPrefixOf)
> import Monad (liftM)
> import Data.Set (toList, fromList)
> import qualified Data.ByteString.Lazy.Char8 as B (ByteString(), getContents, lines, unlines, pack, unpack, words)
>
> main :: IO ()
> main = do mapM_ archiveURL =<< (liftM sortNub $ mapM fetchArticleText =<< (liftM B.words $ B.getContents))
>               where sortNub :: [[B.ByteString]] -> [B.ByteString]
>                     sortNub = toList . fromList . concat
>
> fetchArticleText :: B.ByteString -> IO [B.ByteString]
> fetchArticleText article = liftM (B.lines . extractURLs) (openURL(wikipedia ++ B.unpack article))
>                            where wikipedia = "http://en.wikipedia.org/wiki/"
>
> extractURLs :: String -> B.ByteString
> extractURLs arg = B.unlines $ map B.pack ([x | TagOpen "a" atts <- (parseTags arg), (_,x) <- atts, "http://" `isPrefixOf` x])
>
> archiveURL :: B.ByteString -> IO String
> archiveURL url = openURL("www.webcitation.org/archive?url=" ++ (B.unpack url) ++ emailAddress)
>                  where emailAddress = "&email=gwern0 at gmail.com"
>
> --
> gwern
> MAC10 M3 L34A1 Walther MPL AKS-74 HK-GR6 subsonic rounds ballistic media special
>
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>
>
>