[Haskell-cafe] Wikipedia archiving bot - code review
Donald Bruce Stewart
dons at cse.unsw.edu.au
Mon Jun 25 23:18:04 EDT 2007
gwern0:
> Hey everyone. So I've been learning Haskell for a while now, and I've
> found the best way to move from theory to practice is to just write
> something useful for yourself. Now, I'm keen on editing Wikipedia and
> I've long wanted some way to stop links to external websites from
> breaking on me. So I wrote this little program using the TagSoup
> library which will download Wikipedia articles, parse out external
> links, and then ask WebCite to archive them.
>
> But there's a problem: no matter how I look at it, it's just way too
> slow. Running on a measly 100 articles at a time, it'll eat up to half
> my processor time and RAM (according to top). I converted it over to
> ByteStrings since that's supposed to be a lot better than regular
> Strings, but that didn't seem to help much. So I'm curious: in what
> way could this code be better? How could it be more idiomatic or
> shorter? Particularly, how could it be more efficient either in space
> or time? Any comments are appreciate.
>
> {- Module : Main.hs
> License : public domain
> Maintainer : Gwern Branwen <gwern0 at gmail.com>
> Stability : unstable
> Portability : portable
> Functionality: retrieve specified articles from Wikipedia and request WebCite to archive all URLs found.
> TODO: send an equivalent request to the Internet Archive.
> Not in any way rate-limited.
> BUGS: Issues redundant archive requests.
> Currently uses Data.ByteString.Lazy.Char8. If I'm understanding the documentation right, this barfs
> on the full UTF-8 character set, but Wikipedia definitely exercises the full UTF-8 set.
> USE: Print to stdin a succession of Wikipedia article names (whitespace in names should be escaped as '_').
> A valid invocation might be, say: '$echo Fujiwara_no_Teika Fujiwara_no_Shunzei | archive-bot'
> All URLs in [[Fujiwara no Teika]] and [[Fujiwara no Shunzei]] would then be backed up.
> If you wanted to run this on all of Wikipedia, you could take the current 'all-titles-in-ns0'
> gzipped file from [[WP:DUMP]], gunzip it, and then pipe it into archive-bot. -}
>
> module Main where
> import Text.HTML.TagSoup (parseTags, Tag(TagOpen))
> import Text.HTML.Download (openURL)
> import Data.List (isPrefixOf)
> import Monad (liftM)
> import Data.Set (toList, fromList)
> import qualified Data.ByteString.Lazy.Char8 as B (ByteString(), getContents, lines, unlines, pack, unpack, words)
>
> main :: IO ()
> main = do mapM_ archiveURL =<< (liftM sortNub $ mapM fetchArticleText =<< (liftM B.words $ B.getContents))
> where sortNub :: [[B.ByteString]] -> [B.ByteString]
> sortNub = toList . fromList . concat
>
> fetchArticleText :: B.ByteString -> IO [B.ByteString]
> fetchArticleText article = liftM (B.lines . extractURLs) (openURL(wikipedia ++ B.unpack article))
> where wikipedia = "http://en.wikipedia.org/wiki/"
>
> extractURLs :: String -> B.ByteString
> extractURLs arg = B.unlines $ map B.pack ([x | TagOpen "a" atts <- (parseTags arg), (_,x) <- atts, "http://" `isPrefixOf` x])
>
> archiveURL :: B.ByteString -> IO String
> archiveURL url = openURL("www.webcitation.org/archive?url=" ++ (B.unpack url) ++ emailAddress)
> where emailAddress = "&email=gwern0 at gmail.com"
>
you don't seem to be using bytestrings for anything important here --
you just pass them in, and immediately unpack them back to String anyway
-- since tagsoup only downloads String, and parses String.
Probably, as neil says, TagSoup just isn't optimised much yet. Perhaps
try the bytestring-based urlcheck?
http://hackage.haskell.org/cgi-bin/hackage-scripts/package/urlcheck-0.1
Neil, perhaps tagsoup should provide at the bottom a bytestring layer --
so there's some hope of efficient downloading, with a String layer on
top -- not the other way around?
-- Don
More information about the Haskell-Cafe
mailing list