[Haskell-cafe] Downloading Haskell repos from GitHub

Gwern Branwen gwern0 at gmail.com
Sun Mar 20 18:21:09 CET 2011

On Fri, Apr 30, 2010 at 12:02 PM, Gwern Branwen <gwern0 at gmail.com> wrote:
> On Fri, Apr 30, 2010 at 11:51 AM, Jesper Louis Andersen
> <jesper.louis.andersen at gmail.com> wrote:
>> On Fri, Apr 30, 2010 at 5:38 PM, Gwern Branwen <gwern0 at gmail.com> wrote:
>>> Nothing in http://develop.github.com/ seems especially useful for
>>> grabbing the git:// URLs of all repos by language - just by user.
>>> The only real list of repos by language seems to be gotten at via
>>> http://github.com/languages/Haskell/updated or
>>> http://github.com/languages/Haskell/created . (You might think
>>> http://github.com/languages/Haskell would be good, but no, it's just a
>>> few random repos by interest and not a full listing.)
>> Github has a REST API for accessing data. Unfortunately it can't give
>> you the wanted
>> breakdown, but I would ask them for it. It is much simpler for you,
> You mean ask for a new feature? (Just a one-time list is no good since
> I intend to repeat it regularly to pick up new repos, just like with
> patch-tag.)
>> and it does not put an extra strain on their servers due to the
>> scraping.
> Well, it'd only be about 2000 HTTP hits. (98 + (20 * 98)). The
> downloading of the repos would probably reduce that demand to
> insignificance, especially the first time around when most of the
> repos would need to be downloaded.
>> Usually, the github guys are helpful when you have a
>> question.

Ultimately, they never did anything about it:

So I wrote a TagSoup scraper; then I wrote a long tutorial explaining
how I wrote it, step by step.

1. my tutorial: http://www.gwern.net/haskell/Archiving%20GitHub.html
2. the script itself:
3. Reddit submission of #1 for those who prefer to comment there:

(While writing the tutorial, I tweaked the script code, so I'm not
100% confident that it still works - it uses too much GitHub bandwidth
(and local disk space) for me to re-run it just to see whether it
still works. So if anyone does run it, I would appreciate knowing
whether it still works.)


