[Haskell-cafe] maybe IO doesn't suck, but my code does...

Fri Dec 3 16:04:29 EST 2004

Hi Frédéric,

I took a look at you program. To be honest I have to admit that it is not how I think one should program in Haskell. However I have changed it to the style I would recomment and now it runs in constant space.

First of all I thought it is enough to strictify the members of your records. Btw. this nearly never hurts if you have some kind of State transition.

> data Stat = Stat { latest :: !CalendarTime, total :: !Int } deriving (Show)

This was not enough but it was essential in the final version anyway. So lets think about why this is necessary. Well you write in _doParse something like:

> Stat (max (time hit) (latest state)) (total state + 1)
>                                       ^^^^^^^^^^^^^^^^

Since Haskell is lazy it won't evaluate the (+) so in it keeps all copies of the Stat until the very end when you actually print it. The ! annotation in the record definition doesn't allow Haskell to store closures in the members, so it is forced to evaluate it.

Now to the design!
Please do not use IORefs if you don't really need them. The streaming that you have done by reading line by line by hand can be performed using lazy IO.
The main.hs changes to:

> do args  <- getArgs
>    contents <- readFile (args !! 0)
>    let result = parseLog contents
>    print result
or
>    print $ total result

You don't need IO in the parsing module, believe me.
The ParseLog.hs becomes:

> module ParseLog (parseLog, total) where
> import Hit
> data Stat = Stat { latest :: !CalendarTime, total :: !Int } deriving (Show)
>
> _parseLine :: String -> Stat -> Stat
> _parseLine line state   = let hit = parseHit line
>      in Stat (max (time hit) (latest state)) (total state + 1)
>
> parseLog :: String -> Stat
> parseLog contents         = let initial = Stat epoch 0
>       in foldl (flip _parseLine) initial $ lines contents

The Hit module keeps unchanged. Try to use functions like fold(l|r), map ... instead of write your recursions by hand and try to make use of the lazyness where it helps.

Recently there have been same discussion about blockwise IO and similar stuff, but if you don't care to much about speed you can go with the standard library.

Cheers,
  Georg

On Fri, 3 Dec 2004 11:33:34 +0100, Frédéric Gobry <frederic.gobry at epfl.ch> wrote:

> Hello,
>
> I'm a haskell beginner, and I'm struggling with the following problem:
> I've started writing a simple apache log file analyzer, but I cannot
> get rid of important memory usage problems (in fact, at each attempt, I
> fear I won't be able to unlock my box as my linux 2.6.9 kernel is on its
> knees, which reminds me on my early days writing C on MMU-less
> processors... not because of the language of course :-))
>
> Enclosed is a sample code, which aborts on a large (> 100000 lines)
> file.
>
> I tried different variations (readFile $ lines, openFile, openFile +
> IORef,...) but with no success...
>
> So, if the enclosed version is not too far, please give me a hint.
> Alternatively, if I took the wrong direction, please refocus my search
> .-)
>
> Thanks in advance,
>
> Frédéric
>

-- 

---- Georg Martius,  Tel: (+49 34297) 89434 ----
------- http://www.flexman.homeip.net ---------
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ParseLog_Georg.hs
Type: application/octet-stream
Size: 437 bytes
Desc: not available
Url : http://www.haskell.org//pipermail/haskell-cafe/attachments/20041203/cc748e41/ParseLog_Georg.obj
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Main_Georg.hs
Type: application/octet-stream
Size: 227 bytes
Desc: not available
Url : http://www.haskell.org//pipermail/haskell-cafe/attachments/20041203/cc748e41/Main_Georg.obj