[Haskell] Real life examples

Tue Nov 23 23:50:45 EST 2004

On Mon, Nov 22, 2004 at 05:03:30PM +0100, Benjamin Franksen wrote:
> You have been asked more than once to present a *real-life* example to 
> illustrate that
> 
> (a) global variables are necessary (and not just convenient),
> (b) both above mentioned alternatives are indeed unworkable.

First of all, there are a couple issues here that are getting
mixed up in the discussion. One is that no one is arguing for everyone
to use global variables, their disadvantages are well known. However, we
do see a practical NEED for a mechanism in the language to create
top-level monadic bindings.  The existance of other mechanisms to
achieve the same thing only STRENGTHENS the argument for them, they
won't break anything because we already have them, but will be actually
safe, rather than coincidentally safe due to the peculiarities of the
ghc optimizer and absurdly more efficient.  

Motivated. I decided to do some grepping and bring up some examples.

surprisingly, there is a perfect example in the haskell standard itself:

****
module Random where
setStdGen    :: StdGen -> IO ()
getStdGen    :: IO StdGen     
randomIO  :: Random a => IO a

These get and set the global standard generator for random numbers.
Random number algorithms are easy to write in haskell, generator
splitting routines, the random number generation happens in the IO monad
so it is even okay that it depends on other calls to randomIO and
setStdGen but THIS CANNOT BE IMPLEMENTED IN HASKELL. and that is the
problem.

Should one really have to pass the global standard generator around?
people write monads for the express purpose of HIDING this sort of
thing and IO is a great ubiquitous monad, it would be a shame if a user
couldn't extend it to also pipe around an appropriate random number
seed.

****

stdout,stderr,stdin - people have brought up that these can be
represented by stdout = 1, stdin = 0, and stderr = 2 and treat the
integers as refering to magic built-in constants that refer to handles.

now, what if you wanted to stop relying on magic built-ins and implement
your buffering algorithms in haskell directly exposing them to ghcs
optimizer? You can't without the ability to globally initialize their
buffers. 

****   (some examples from libraries)

Data.Unique

This provides a unique supply of numbers in the IO Monad and illustrates
the efficiency concerns extremely well.

newUnique :: IO Int 

this creates a new unique integer simply by incrementing a number and
returning it. the number is a part of the world. 

now, we can in theory implement this without top level declarations:

newUnique = do
        e <- getEnv "magicUniqueName"
        let n = (read e :: Int)
        putEnv "magicUniqueName" (show $ n + 1)
        return n

(or perhaps something based on files)

note, this does exactly the same thing, but what should have been 3
machine instructions is now thousands and thousands and is much less
safe because someone else might guess the 'magicUniqueName' and
overwrite it, if we had a top-level var, we could just not export the
var and rest assured our invarients are not broken.

****

Atom.hs from ginsu..

This is perhaps the best example, and an incredibly useful piece of code
for anyone struggling with space problems out there.

it provides

data Atom = ... (abstract)

instance Ord Atom
instance Eq Atom
toAtom :: String  -> Atom
fromAtom :: Atom -> String

What it does is hash the strings in a global hash and return an Atom
which internally has an index into a table of strings. This has a couple
of advantages:
comparing Atoms is much much faster, equivalant to comparing Ints, the
strings are hash-consed so all instances of "Foo" will use the same
memory.

internally, Atom has a global hash table of strings -> atoms, note that
externally, Atom is truly purely functional. toAtom and fromAtom
although using internal state inside are real functions. the same
argument always returns the same (externally visible) result. This is
because the actual integer chosen is hidden, there is no way to get at
it outside the module. 

there would be no way to do this without global state without seriously
compromising it's usability.

imagine I decided to get rid of the global state because I (mistakenly)
believed that all global state was inherently evil no matter what. then
my recourse would be to implement something like:

data AtomHash = ... 
data Atom = ... (abstract)

instance Ord Atom
instance Eq Atom
newAtomHash :: IO AtomHash
toAtom :: AtomHash -> String  -> IO Atom
fromAtom :: AtomHash -> Atom -> IO String

note a couple things:

1. The pure functions now are stuck in the IO monad, since I made their
dependence on AtomHash explicit, the fact that they modify AtomHash must
be made explicit by placing them in the IO monad. (it is possible to
come up with other formulations not in the IO monad, but they would have
similar problems) This alone is almost enough to kill the idea, but even
worse is the second 

2. The fundamental property that there is an isomorphism between Atoms
and Strings is broken. because one might create multiple AtomHashs.
Suddenly what was a STATIC COMPILE TIME GUARENTEE becomes a run-time
obscure bug generating probelem. 

furthermore, imagine you carefully avoided ever creating more than one
AtomHash, what purpose does it serve to pass everywhere then? it is
meerly a source of confusion and obfuscation. and someone could come
along to use your library, call 'newAtomHash' and break everything in a
way that would be very tricky to debug.  

This is not a minor performance gain. in ginsu it dropped the memory
usage from > 100megs to 10megs. I would call that vital. when it used
100megs it was not a usable program.

****

Caching. 

in ginsu and a couple other projects, There is the common idiom where
you have something that depends on IO and is very expensive to
calculate.

for the sake of argument, let us pretend processing your configuration
files and command line arguments was very expensive.

what you do is go and write

data Config = ...
getConfig :: IO Config  

now, you want to be able to call getConfig to get the configuration at
various stages, however this is very expensive, it has to read files,
parse them, etc.... 

note that getConfig is in the IO monad, we are doing nothing tricky, it
is perfectly fine for this IO action to call getArgs and read files.

but the problem is we have to do it every time, there is no way to cache
the result of a previous  run. with a global variable, all we need to do
is set the variable and check the modification times of the files and
re-run it if needed. suddenly, we get fast efficient configuration for
the common unchanging case for free without changing the program
semantics at all.

getConfig behaves identically after the change, it is just more
efficient. perhaps drastically so if it was doing something more
complicated than reading config files. 

Why burden the user with this performance hack by making them explicitly
carry the cache around? as far as the user is concerned, getConfig DOES
go and read the config files each time, They see it is in IO, they know
it can be doing anything, and logically, that is how they would like to
think about it when writing their program, a black box, the ability to
abstract away things like this is a great asset for a programming
language. 

****

Haskell ALREADY is great at delegating stateful computation to the IO
monad, this won't change that, it won't break any properties which arn't
already breakable. in fact, it won't even break any properties at all,
everything is STILL in the IO monad, anything that depends on global
state will already HAVE to be in the IO monad, that should be indication
enough to the programmer that this depends on the world, extended by the
programmer in well thought out abstracted ways.
        John

-- 
John Meacham - ⑆repetae.net⑆john⑈