[Haskell-cafe] Wikipedia archiving bot - code review

Mon Jun 25 23:18:04 EDT 2007

gwern0:
> Hey everyone. So I've been learning Haskell for a while now, and I've
> found the best way to move from theory to practice is to just write
> something useful for yourself. Now, I'm keen on editing Wikipedia and
> I've long wanted some way to stop links to external websites from
> breaking on me. So I wrote this little program using the TagSoup
> library which will download Wikipedia articles, parse out external
> links, and then ask WebCite to archive them.
> 
> But there's a problem: no matter how I look at it, it's just way too
> slow. Running on a measly 100 articles at a time, it'll eat up to half
> my processor time and RAM (according to top). I converted it over to
> ByteStrings since that's supposed to be a lot better than regular
> Strings, but that didn't seem to help much.  So I'm curious: in what
> way could this code be better? How could it be more idiomatic or
> shorter? Particularly, how could it be more efficient either in space
> or time? Any comments are appreciate.
> 
> {- Module      :  Main.hs
>    License     :  public domain
>    Maintainer  :  Gwern Branwen <gwern0 at gmail.com>
>    Stability   :  unstable
>    Portability :  portable
>    Functionality: retrieve specified articles from Wikipedia and request WebCite to archive all URLs found.
>    TODO: send an equivalent request to the Internet Archive.
>          Not in any way rate-limited.
>    BUGS: Issues redundant archive requests.
>          Currently uses Data.ByteString.Lazy.Char8. If I'm understanding the documentation right, this barfs
>          on the full UTF-8 character set, but Wikipedia definitely exercises the full UTF-8 set.
>    USE: Print to stdin a succession of Wikipedia article names (whitespace in names should be escaped as '_').
>         A valid invocation might be, say: '$echo Fujiwara_no_Teika Fujiwara_no_Shunzei | archive-bot'
>         All URLs in [[Fujiwara no Teika]] and [[Fujiwara no Shunzei]] would then be backed up.
>         If you wanted to run this on all of Wikipedia, you could take the current 'all-titles-in-ns0'
>         gzipped file from [[WP:DUMP]], gunzip it, and then pipe it into archive-bot. -}
> 
> module Main where
> import Text.HTML.TagSoup (parseTags, Tag(TagOpen))
> import Text.HTML.Download (openURL)
> import Data.List (isPrefixOf)
> import Monad (liftM)
> import Data.Set (toList, fromList)
> import qualified Data.ByteString.Lazy.Char8 as B (ByteString(), getContents, lines, unlines, pack, unpack, words)
> 
> main :: IO ()
> main = do mapM_ archiveURL =<< (liftM sortNub $ mapM fetchArticleText =<< (liftM B.words $ B.getContents))
>               where sortNub :: [[B.ByteString]] -> [B.ByteString]
>                     sortNub = toList . fromList . concat
> 
> fetchArticleText :: B.ByteString -> IO [B.ByteString]
> fetchArticleText article = liftM (B.lines . extractURLs) (openURL(wikipedia ++ B.unpack article))
>                            where wikipedia = "http://en.wikipedia.org/wiki/"
> 
> extractURLs :: String -> B.ByteString
> extractURLs arg = B.unlines $ map B.pack ([x | TagOpen "a" atts <- (parseTags arg), (_,x) <- atts, "http://" `isPrefixOf` x])
> 
> archiveURL :: B.ByteString -> IO String
> archiveURL url = openURL("www.webcitation.org/archive?url=" ++ (B.unpack url) ++ emailAddress)
>                  where emailAddress = "&email=gwern0 at gmail.com"
> 

you don't seem to be using bytestrings for anything important here --
you just pass them in, and immediately unpack them back to String anyway
-- since tagsoup only downloads String, and parses String. 

Probably, as neil says, TagSoup just isn't optimised much yet. Perhaps
try the bytestring-based urlcheck?

    http://hackage.haskell.org/cgi-bin/hackage-scripts/package/urlcheck-0.1

Neil, perhaps tagsoup should provide at the bottom a bytestring layer --
so there's some hope of efficient downloading, with a String layer on
top -- not the other way around?

-- Don