[Haskell-cafe] Re: haskell wiki indexing

Simon Marlow simonmarhaskell at gmail.com
Fri Jun 8 05:35:34 EDT 2007


Jason Dagit wrote:
> On 5/22/07, Robin Green <greenrd at greenrd.org> wrote:
>> On Tue, 22 May 2007 15:05:48 +0100
>> Duncan Coutts <duncan.coutts at worc.ox.ac.uk> wrote:
>>
>> > On Tue, 2007-05-22 at 14:40 +0100, Claus Reinke wrote:
>> >
>> > > so the situation for mailing lists and online docs seems to have
>> > > improved, but there is still the wiki indexing/rogue bot issue,
>> > > and lots of fine tuning (together with watching the logs to spot
>> > > any issues arising out of relaxing those restrictions). perhaps
>> > > someone on this list would be willing to volunteer to look into
>> > > those robots/indexing issues on haskell.org?-)
>> >
>> > The main problem, and the reason for the original (temporary!) measure
>> > was bots indexing all possible diffs between old versions of wiki
>> > pages. URLs like:
>> >
>> > http://haskell.org/haskellwiki/?title=Quicksort&diff=9608&oldid=9607
>> >
>> > For pages with long histories this O(n^2) number of requests starts to
>> > get quite large and the wiki engine does not seem well optimised for
>> > getting arbitrary diffs. So we ended up with bots holding open many
>> > http server connections. They were not actually causing much server
>> > cpu load or generating much traffic but once the number of nearly hung
>> > connections got up to the http child process limit then we are
>> > effectively in a DOS situation.
>> >
>> > So if we can ban bots from the page histories or turn them off for the
>> > bot user agents or something then we might have a cure. Perhaps we
>> > just need to upgrade our media wiki software or find out how other
>> > sites using this software deal with the same issue of bots reading
>> > page histories.
>>
>> http://en.wikipedia.org/robots.txt
>>
>> Wikipedia uses URLs starting with /w/ for "dynamic" pages (well, all
>> pages are dynamic in a sense, but you know what I mean I hope.) And
>> then puts /w/ in robots.txt.
> 
> Does anyone know the status of applying a workaround such as this?  I
> really miss being able to find things on the haskell wiki via google
> search.  I don't like the mediawiki search at all.

The status is that nobody has stepped up and volunteered to look after 
haskell.org's robots.txt file.  It needs someone with the time and experience to 
look into what needs doing, make the changes, fix problems as the arise, and 
update it as necessary in the future.  Anyone?

Cheers,
	Simon


More information about the Haskell-Cafe mailing list