[Haskell-cafe] Re: String vs ByteString

Wed Aug 18 11:29:02 EDT 2010

On Wed, Aug 18, 2010 at 4:12 AM, wren ng thornton <wren at freegeek.org> wrote:

> There was a study recently on this. They found that there are four main
> parts of the Internet:
>
> * a densely connected core, where from any site you can get to any other
> * an "in cone", from which you can reach the core (but not other in-cone
> members, since then you'd both be in the core)
> * an "out cone", which can be reached from the core (but which cannot reach
> each other)
> * and, unconnected islands
>
> The surprising part is they found that all four parts are approximately the
> same size. I forget the exact numbers, but they're all 25+/-5%.
>
> This implies that an exhaustive crawl of the web would require having about
> 50% of all websites as seeds (the in-cone plus the islands). If we're only
> interested in a representative sample, then we could get by with fewer.
> However, that depends a lot on the definition of "representative". And we
> can't have an accurate definition of representative without doing the entire
> crawl at some point in order to discover the appropriate distributions. Then
> again, distributions change over time...
>
> Thus, I would guess that Google only has 50~75% of the net: the core, the
> out-cone, and a fraction of the islands and in-cone.
>

That's an interesting result.

However, if you weigh each page with its page views you'll probably find
that Google (and other search engines) probably cover much more than that
since page views on sites tend to follow a power-law distribution.

-- Johan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.haskell.org/pipermail/haskell-cafe/attachments/20100818/ecb74144/attachment.html