Too Strict?

Tue, 16 Jan 2001 16:11:53 +0000

Dominic,

> What I can't see at the moment is how to keep what I was doing modular. I had
> a module Anonymize, the implementation of which I wanted to change without
> the user of it having to change their code. The initial implementation was a
> state monad which generated a new string every time it needed one but if it
> was a string it had already anonymized then it looked it up in the state. I
> initially used a list but with 100,000+ strings it took a long time. The next
> implementation used FiniteMap which improved things considerably. I only had
> to make three changes in Anonymize and none in Main. Using MD5 is quicker
> still but isn't so good from the program maintenance point of view.

my first stab at the modularity issue was the version _2 in my last message. 
Looking back at the Anonymizable class and instances in your full program, 

  type Anon a = IO a

  class Anonymizable a where
     anonymize :: a -> Anon a

  -- MyString avoids overlapping instances of Strings 
  -- with the [Char]

  data MyString = MyString String
     deriving Show

  instance Anonymizable MyString where
     anonymize (MyString x)
        = do s <- digest x
             return ((MyString . showHex') s)

  instance Anonymizable a => Anonymizable [a] where
     anonymize xs = mapM anonymize xs

the problem is in the Anonymizable instance for [a]: the mapM in anonymize
constructs an IO script, consisting of some IO operation for each list element,
all chained together into a monolithic whole.

As IO a is an abstract type, this is a bit too restrictive to be modular: if I
ever want any of the anonymized Strings, I can only get a script that
anonymizes them all - before executing that script, I don't have any anonymized
Strings, and after executing the script, all of them have been processed.

This forestalls any attempt to interleave the anonymization with some further
per-element processing. Instead, I would prefer to have a list of IO actions,
not yet chained together (after all, in Haskell, they are just data items), but
that doesn't fit the current return type of anonymize.  One approach would be
to change the type of Anon a to [IO a], or to ignore the [a] instance and use
the MyString instance only, but the longer I look at the code, the less I'm
convinced that the overloading is needed at all.

Unless there are other instances of Anonymizable, why not simply have a
function anonymize :: String -> Anon String ? That would still allow you to
hide the implementation decisions you mentioned (even in a separate module),
provided that any extra state you need can be kept in the IO monad.

One would have to write mapM anonymize explicitly where you had simply
anonymize, but it would then become possible to do something else with the list
of IO actions before executing them (in this case, to interleave the printing
with the anonymization).

First, here is the interesting fragment with the un-overloaded anonymize:

  readAndWriteAttrVals =
     do h <- openFile fileout WriteMode
        s <- readFile filename
        a <- mapM anonymize (lines s)
        hPutStr h (unlines a) 

It is now possible to import anonymize from elsewhere and do the interleaving
in the code that uses anonymize:

  readAndWriteAttrVals =
     do h <- openFile fileout WriteMode
        s <- readFile filename
        let action line = do 
                          { a <- anonymize l
                          ; hPutStr h a
                          }
        mapM_ action (lines s)

Would that work for your problem? Alternatively, if some of your implementation
options require initialization or cleanup, your Anonymize module could offer a
function to process all lines, with a hook for per-line processing:

  processLinesWith perLineAction ls =
    do
    { initialize
    ; as <- mapM action ls
    ; cleanup
    ; return as
    }
    where
      action l = do { a <- anonymize l ; perLineAction a } 

Then the code in the client module could simply be:

  readAndWriteAttrVals =
     do h <- openFile fileout WriteMode
        s <- readFile filename
        processLinesWith (hPutStr h) (lines s)
        return ()

Closing the loop, one could now redefine the original, overloaded anonymize to
take a perLineAction, with the obvious instances for MyString and [a], but I
really don't see why every function should have to be called anonymize?-)

Claus

PS The simplified code of the new variant, for observation:

module Main(main) where

import Observe
import IO(openFile,
          hPutStr,
          IOMode(ReadMode,WriteMode,AppendMode))

filename = "ldif1.txt"
fileout  = "ldif.out"

readAndWriteAttrVals =
   do h <- openFile fileout WriteMode
      s <- readFile filename
      let { anonymize s = return (':':s)
          ; action l = do 
                       { a <- anonymize l
                       ; hPutStr h a
                       }
          }
      mapM_ (observe "action" action) (lines s)

main = runO readAndWriteAttrVals