[Haskell-cafe] haskell wiki indexing

Claus Reinke claus.reinke at talk21.com
Tue May 22 09:40:33 EDT 2007


>> as was pointed out on the programming reddit [1], crawling of the
>> haskell wiki is forbidden, since http://www.haskell.org/robots.txt contains
>>
>> User-agent: *
>> Disallow: /haskellwiki/

i agree that having the wiki searchable would be preferred,
but was told that there were performance issues. even giving
Googlebot a wider range than other spiders won't help if, as
the irc page suggests, some of those faulty bots pretend to be
Googlebot..

> This also applies to Haskell mailing lists as I mentioned recently:
> http://www.haskell.org/pipermail/haskell-cafe/2007-April/025006.html

ah, yes, sorry. there was an ongoing offlist discussion at the
time, following an earlier thread on ghc-users. Simon M has
since changed robots.txt to the above, which *does* permit
indexing of the pipermail archives, as long as google can find
them. that still doesn't mean that they'll show up first in google's
ranking system. for instance, if you google for

    'ghc manuals online'

(that's the subject for that earlier thread i mentioned), you'll
get mail-archive and nabble first, but the haskell.org archives
are there as well now, as you can see by googling for

    'ghc manuals online inurl:pipermail'

also, the standard test of googling for 'site:haskell.org' looks
a lot healthier these days. and googling for

    'inurl:ghc/docs/latest LANGUAGE pragma'

gives me two relevant answers (not the most specific sub-page).

so the situation for mailing lists and online docs seems to have
improved, but there is still the wiki indexing/rogue bot issue,
and lots of fine tuning (together with watching the logs to spot
any issues arising out of relaxing those restrictions). perhaps
someone on this list would be willing to volunteer to look into
those robots/indexing issues on haskell.org?-)

claus




More information about the Haskell-Cafe mailing list