[Haskell-cafe] ANN: wp-archivebot 0.1 - archive Wikipedia's external links in WebCite

Gwern Branwen gwern0 at gmail.com
Thu Jun 4 12:52:58 EDT 2009


I'd like to announce wp-archivebot.

# What

wp-archivebot is a relatively simple little script which follows all
the links in a RSS feed, combs the destination for http:// links, and
submits them to WebCite.

WebCite  https://secure.wikimedia.org/wikipedia/en/wiki/WebCite is an
organization much like the more famous Internet Archive. Unlike the
Wayback Machine, however, WebCite will archive pages on-demand.*

# Why

This is good, since link-rot and 404 errors are a fact of life on
Wikipedia. Links go stale, fall dead, get banned, edited, censored,
etc. If those links are being used as a reference for some important
fact or detail, then there is a very big problem. Even the hit-or-miss
Internet Archive has proven to be very useful for editors**, so a more
reliable way of archiving links would be even better.

# Limitations

The WebCite FAQhttp://webcitation.org/faq  mentions that a good
project would be to

> develop a wikipedia bot which scans new wikipedia articles for cited URLs, submits an archiving request to WebCite®, and then adds a link to the archived URL behind the cited URL

Adding a link would be both quite difficult and require community
approval; further, although I have thought about this for years,
there's no obvious good way to add a link. Any method is either
visually awkward, possibly otiose (if [[Google]] links to google.com
as the homepage in its infobox, there's no purpose to have an archived
version of google.com!), and certainly will bloat up the markup - even
if there's any way to insert links without bolloxing templates and
other such constructs.

So I'm satisfied to just archive the link. WebCite is searchable,
after all. If enough people run bots like this and achieve enough
coverage, then perhaps editors can be educated to always check in
WebCite as well.

# Download & Install

As ever, wp-archivebot is Free and is available from Hackage at:
http://hackage.haskell.org/cgi-bin/hackage-scripts/package/wp-archivebot

You can install with ease by a simple 'cabal install wp-archivebot',
or download the tarball and compile it yourself with the usual
'runhaskell Setup configure && runhaskell Setup build && runhaskell
Setup install' dance.

# Usage

wp-archivebot takes one mandatory argument, an email address; WebCite
needs to have somewhere to send notices of archival success/failure.

wp-archivebot takes a second, optional, argument. This is a RSS feed
to use. It defaults to Special:NewPages on the English Wikipedia, but
one could just as well follow, say, RecentChanges. Here's an example:

>  wp-archivebot gwern0 at gmail.com 'http://en.wikipedia.org/w/index.php?title=Special:RecentChanges&feed=rss'

(This sets my email address as the recipient, and follows
RecentChanges. This may not be a good idea as RecentChanges is *much*
busier than NewPages.)

## Example

Here's an example session's output:

[12:35 PM] 829Mb$ wp-archivebot gwern0 at gmail.com
"http://www.webcitation.org/archive?url=http://en.wikisource.org/wiki/Berkeley,_George,_first_earl_of_Berkeley_(DNB00)&email=gwern0 at gmail.com"
"http://www.webcitation.org/archive?url=http://www.baseball-reference.com/players/u/uhaltfr01.shtml&email=gwern0@gmail.com"
"http://www.webcitation.org/archive?url=http://www.baseball-reference.com/players/u/uhaltfr01.shtml&email=gwern0@gmail.com"
"http://www.webcitation.org/archive?url=http://www.baseball-reference.com/minors/player.cgi?id=uhalt-001ber&email=gwern0@gmail.com"
"http://www.webcitation.org/archive?url=http://www.baseball-reference.com/minors/player.cgi?id=uhalt-001ber&email=gwern0@gmail.com"
"http://www.webcitation.org/archive?url=http://www.timesargus.com/apps/pbcs.dll/article?AID=/20080509/FEATURES02/805090316/1011/FEATURES02&email=gwern0@gmail.com"
"http://www.webcitation.org/archive?url=http://www.timesargus.com/apps/pbcs.dll/article?AID=/20080509/FEATURES02/805090316/1011/FEATURES02&email=gwern0@gmail.com"
"http://www.webcitation.org/archive?url=http://www.timesargus.com/apps/pbcs.dll/article?AID=/20080509/FEATURES02/805090316/1011/FEATURES02&email=gwern0@gmail.com"
"http://www.webcitation.org/archive?url=http://www.timesargus.com/apps/pbcs.dll/article?AID=/20080509/FEATURES02/805090316/1011/FEATURES02&email=gwern0@gmail.com"
"http://www.webcitation.org/archive?url=http://www.erniestires.net/&email=gwern0@gmail.com"
"http://www.webcitation.org/archive?url=http://www.erniestires.net/&email=gwern0@gmail.com"
"http://www.webcitation.org/archive?url=http://thepeerage.com/p4893.htm#i48927&email=gwern0@gmail.com"
"http://www.webcitation.org/archive?url=http://thepeerage.com/p4893.htm#i48927&email=gwern0@gmail.com"
"http://www.webcitation.org/archive?url=http://www.leighrayment.com/commons/Acommons3.htm&email=gwern0@gmail.com"
"http://www.webcitation.org/archive?url=http://www.leighrayment.com/commons/Acommons3.htm&email=gwern0@gmail.com"
"http://www.webcitation.org/archive?url=http://thepeerage.com/p4893.htm#i48927&email=gwern0@gmail.com"
"http://www.webcitation.org/archive?url=http://thepeerage.com/p4893.htm#i48927&email=gwern0@gmail.com"
"http://www.webcitation.org/archive?url=http://www.esec.edu&email=gwern0@gmail.com"
"http://www.webcitation.org/archive?url=http://www.esec.edu&email=gwern0@gmail.com"
...

# Related

The development version (HEAD) of the Gitit wiki has plugin support;
one of those plugins, WebArchiver.hs, will on every page-save comb
through for off-wiki links and submit them to WebCite in the same way
as this bot. It's nice to know that if those links ever disapear, you
can retrieve them from WebCite and 'see' the revision with the same
set of external links as when the revision was created.

* Technically, the Internet Archive will archive on demand as well -
but you need to pay them.
** In many more ways than one might expect. For example, not
infrequently someone will visit an article and claim it is
plagiarizing some other webpage. With the IA, it's easy to go back to
the first version of that webpage and crosscheck against the article's
history - quite often it is the other website that plagiarized us!

-- 
gwern


More information about the Haskell-Cafe mailing list