[Haskell-cafe] Downloading web page in Haskell
Albert Y. C. Lai
trebla at vex.net
Sat Nov 20 19:12:03 EST 2010
On 10-11-20 02:54 PM, José Romildo Malaquias wrote:
> In order to download a given web page, I wrote the attached program. The
> problem is that the page is not being full downloaded. It is being
> somehow intettupted.
The specific website and url
http://www.adorocinema.com/common/search/search_by_film/?criteria=Bourne
truncates when the web server chooses the identity encoding (i.e., as
opposed to compressed ones such as gzip). The server chooses identity
when your request's Accept-Encoding field specifies identity or simply
your request has no Accept-Encoding field, such as when you use
simpleHTTP (getRequest url), curl, wget, elinks.
When the server chooses gzip (its favourite), which is when your
Accept-Encoding field includes gzip, the received data is complete (but
then you have to gunzip it yourself). This happens with mainstream
browsers and W3C's validator at validator.w3.org (which destroys the
"you need javascript" hypothesis). I haven't tested other compressed
encodings.
Methodology
My methodology of discovering and confirming this is a great lesson in
the triumph of the scientific methodology (over the prevailing
opinionative methodology, for example).
The first step is to confirm or deny a Network.HTTP problem. For a
maximally controlled experiment, I enter HTTP by hand using nc:
$ nc www.adorocinema.com 80
GET /common/search/search_by_film/?criteria=Bourne HTTP/1.1
Host: www.adorocinema.com
<blank line>
It still truncates, so at least Network.HTTP is not alone. I also try
elinks. Other people try curl and wget for the same reason and the same
result.
The second step is to confirm or deny javascript magic. Actually the
truncation strongly suggests that javascript is not involved: the
truncation ends with an incomplete end-tag "</". This is abnormal even
for very buggy javascript-heavy web pages. To certainly deny javascript
magic, I first try Firefox with javascript off (also java off, flash
off, even css off), and then I also ask validator.w3.org to validate the
page. Both receive complete data. Of course the validator is going to
say "many errors", but the point is that if the validator reports errors
at locations way beyond our truncation point, then the validator sees
data we don't see, and the validator doesn't even care about javascript.
The validator may be very sophisticated in parsing html, but in sending
an HTTP request it ought to be very simple-minded. The third step is to
find out what extra thing the validator does to deserve complete data.
So I try diagonalization: I give this CGI script to the validator:
#! /bin/sh
echo 'Content-Type: text/html'
echo ''
e=`env`
cat <<EOF
<html><head><title>title</title></head><body><pre>
$e
</pre></body></html>
EOF
If I also tell the validator to "show source", which means display what
html code it sees, then I see the validator's request (barring web
server support, but modern web servers probably support much more than
the original minimal CGI specification). There are indeed quite a few
extra header fields the validator sends, and I can try to mimic each of
them. Eventually I find this crucial field:
Accept-Encoding: gzip, x-gzip, deflate
Further tests confirm that we just need gzip.
Finally, to confirm the finding with a maximally controlled experiment,
I enter the improved request by hand using nc, but this time I save the
output in a file (so later I can decompress it):
$ nc -q 10 www.adorocinema.com 80 > save.gz
GET /common/search/search_by_film/?criteria=Bourne HTTP/1.1
Host: www.adorocinema.com
Accept-Encoding: gzip
<blank line>
<wait a while>
Now save.gz contains both header and body, and it only makes sense to
uncompress the body. So edit save.gz to delete the header part.
Applying gunzip to the body will give some "unexpected end of file"
error. Don't despair. Do this instead:
$ zcat save.gz > save.html
It is still an error but save.html has meaningful and complete content.
You can examine it. You can load it in a web browser and see. At least,
it is much longer, and it ends with "</html>" rather than "</".
More information about the Haskell-Cafe
mailing list