[Haskell-cafe] Testing web-interfacing applications

Mon Jun 30 12:08:37 UTC 2014

Mateusz Kowalczyk <fuuzetsu at fuuzetsu.co.uk> writes:

> Hi,
>
> Whenever I write a program which has to interface with the web
> (scraping, POSTing, whatever), I never know how to properly test it.
> What I have been doing up to date is fetching some pages ahead of time,
> saving locally and running my parsers or whatever it is I'm coding at
> the moment against that.
>
> The problem with this approach is that we can't test a whole lot: if we
> have a crawler, how do we test it it goes to the next page properly?
> Testing things like logging in and such seems close to impossible, we
> can only test if we are making a good POST.
>
> Let's stick to a crawler example. How would you test that it follows
> links? Do people set up local webservers with few dummy pages they
> download? Do you just inspect that GET and POST ‘look’ correct?
>
> Assume that we don't own the sites so we can't let the program run tests
> in the wild: page content might change (parser tests fail), APIs might
> change (unexpected stuff back), our account might be locked (guess they
> didn't like us logging in 20 times in last hour during tests) &c.
>
> Of course there is nothing which can prevent upstream changes but I'm
> wondering how we can test the more-or-less static stuff without calling
> out into the world.

I'd use a "Self-Initialising Fake":
http://martinfowler.com/bliki/SelfInitializingFake.html

1) Make sure you're not hard-coding the IO functions you're using,
   ie. use dependency injection, either via explicit parameters or via a
   typeclass.

2) If you wrote a typeclass or some other fancy abstraction for (1),
   implement it using the real HTTP procedures you want to use.

3) Write another implementation which reads canned responses from files,
   eg. comparing filenames to hashed requests. If no file is found, it
   should use the real HTTP implementation to get the data, store it in
   an appropriate file, then return it.

Use the implementation from (2) in production (or just the raw
procedures if you didn't abstract them) and use the implementation from
(3) in tests.

This lets you test real responses without having to rely on the
network, without having to worry about hammering other people's
machines, etc.

Since all responses are static, it won't model dynamic server-side
processing, but it sounds like you're OK with that. Note that caching
won't work for randomised requests, eg. if your data is coming from
QuickCheck. You can either limit your ranges to ensure more overlap,
eg. using randomInt % 20 instead of randomInt, or write a custom
pattern-matching/response-rewriting layer on top of the cache.

For extra confidence that your tests are safe, eg. if you have some
highly-randomised tests which will often miss the cache, you could also
write a pure implementation based around a Data.Map, returning a canned
404 response for anything else. A simple driver function can populate
the Map based on any existing cache files, assuring that the tests
themselves are always pure.

Cheers,
Chris