Too Strict?

Steinitz, Dominic J
16 Jan 2001 10:27:32 Z


Thanks for this. I think I can see what the problem was: mapM has to process the whole input before making a value available for hPutStr. Also I can see how to fix the problem.

What I can't see at the moment is how to keep what I was doing modular. I had a module Anonymize, the implementation of which I wanted to change without the user of it having to change their code. The initial implementation was a state monad which generated a new string every time it needed one but if it was a string it had already anonymized then it looked it up in the state. I initially used a list but with 100,000+ strings it took a long time. The next implementation used FiniteMap which improved things considerably. I only had to make three changes in Anonymize and none in Main. Using MD5 is quicker still but isn't so good from the program maintenance point of view.

Dominic. on 15/01/2001 15:02:00
To:	Dominic Steinitz
cc:	haskell
Subject:	Re: Too Strict? 

> Can someone help? The program below works fine with small files but when I
> try to use it on the one I need to (about 3 million lines of data) it
> produces no output. The hard disk is hammered - I assume this is the run time
> system paging. My suspicion is that the program is trying to read in the
> whole file before processing it. Is this correct? If so, how do I make the
> program lazy so that it processes a line at time?

I was about to apply GHood to your program to see whether such a tool could
help to find the problem, so I started to cut down your code. However,
in the simplified version of the program, one can see what is going on
without any graphical tool.. (nevertheless, observation of the simplified
code with GHood confirms your suspicion immediately, and it points out
the spine of the list as the problem, too, so the tool is useful!-) (*)

In effect, your anonymize comes down to a "mapM" over a list of actions
applied to input lines, and all of the resulting IO-actions are placed
before the single "hPutStr". So, even if the results of the individual
actions in the list may not be needed until later, the whole spine of
the list of lines has to be traversed before "hPutStr" can be executed,
meaning that all input is read before any output is produced (and thus
before any computation results are requested, blowing up memory usage).

For the problem at hand, you could simply output each line as it is
processed instead of just returning it into a list for later use (see
variant _1 below). If you would want to keep both the modular program
structure and the explicit line-by-line IO-style, you would need to
interleave the input and output commands somehow (perhaps similar to
variant _2 below?).


(*) Please note that our web-server is being upgraded today..
    (web-pages and GHood download will not be available until
     tomorrow, hence no URL here :-(

PS The simplified code (+ variations) with observations:

module Main(main) where

import Observe
import IO(openFile,

filename = "ldif1.txt"
fileout  = "ldif.out"

readAndWriteAttrVals =
   do h <- openFile fileout WriteMode
      s <- readFile filename
      let action l = return (':':l)
      a <- mapM action (observe "input" (lines s))
      hPutStr h (unlines (observe "output" a))

main = runO readAndWriteAttrVals

readAndWriteAttrVals_1 =
   do h <- openFile fileout WriteMode
      s <- readFile filename
      let action_and_output l = hPutStr h (':':l)
      mapM_ (observe "output" action_and_output) (observe "input" (lines s))

main_1 = runO readAndWriteAttrVals_1

readAndWriteAttrVals_2 =
   do h <- openFile fileout WriteMode
      s <- readFile filename
      let { action l = return (':':l)
          ; as = map action (observe "input" (lines s))
          ; os = repeat (hPutStr h)
      mapM id (observe "output" (zipWith (>>=) as os))

main_2 = runO readAndWriteAttrVals_2

Haskell mailing list

21st century air travel