[Haskell-cafe] Re: String vs ByteString

wren ng thornton wren at freegeek.org
Tue Aug 17 23:31:54 EDT 2010


Ivan Lazar Miljenovic wrote:
> On 18 August 2010 12:12, wren ng thornton <wren at freegeek.org> wrote:
>> Johan Tibell wrote:
>>> To my knowledge the data we have about prevalence of encoding on the web
>>> is
>>> accurate. We crawl all pages we can get our hands on, by starting at some
>>> set of seeds and then following all the links. You cannot be sure that
>>> you've reached all web sites as there might be cliques in the web graph
>>> but
>>> we try our best to get them all. You're unlikely to get a better estimate
>>> anywhere else. I doubt few organizations have the machinery required to
>>> crawl most of the web.
>> There was a study recently on this. They found that there are four main
>> parts of the Internet:
>>
>> * a densely connected core, where from any site you can get to any other
>> * an "in cone", from which you can reach the core (but not other in-cone
>> members, since then you'd both be in the core)
>> * an "out cone", which can be reached from the core (but which cannot reach
>> each other)
>> * and, unconnected islands
> 
> I'm guessing here that you're referring to what I've heard called the
> "hidden web": databases, etc. that require sign-ins, etc. (as stuff
> that isn't in the core, to differing degrees: some of these databases
> are indexed by google but you can't actually read them without an
> account, etc.) ?

Not so far as I recall. I'd have to find a copy of the paper to be sure 
though. Because the metric used was graph connectivity, if those hidden 
pages have links out into non-hidden pages (e.g., the login page), then 
they'd be counted in the same way as the non-hidden pages reachable from 
them.

-- 
Live well,
~wren


More information about the Haskell-Cafe mailing list