[Haskell] Probably a trivial thing for people knowing Haskell
paul at cogito.org.uk
Sat Oct 18 13:48:21 EDT 2008
> I've written just a few programs in Haskell one in a comparison for a
> task I had "nearly daily".
The first thing I notice is that this is clearly a direct translation
from something like Perl. Thats understandable, but I'd suggest
rewriting it with something like this (untested, uncompiled code)
-- Concatenate all the files into one big string. File reading is lazy,
so this won't take all the memory.
getAllFiles :: [String] -> IO String
getAllFiles paths = do
contents <- mapM getFile paths
return $ concat contents
Then use "lines" to split the result into individual lines and process
them using "filter", "map" and "foldr". Because file reading is lazy,
each line is only read when it is to be processed, and then gets reaped
by the garbage collector. So it all runs in constant memory.
(By the way, putting in the top level type declarations helps a lot when
you make a mistake.)
One thing you are doing right is keeping a (sum, count) pair. A gotcha
with Haskell is to compute an average of a list of numbers like this:
mean :: [Double] -> Double
mean xs = sum xs / fromIntegral (length xs)
The problem with this is that it has to traverse the list twice, which
means that the whole list has to be held in memory. So instead you have
to write something like:
mean xs = let (total, count) = foldr (\x (t, c) -> (t + x, c+1))
(0.0, 0) xs in total / fromIntegral count
This is a pain, but it does only traverse the list once.
See how you get on.
> The code analyzes Apache logs and picks some certain stuff from it and
> after that calculates a bit around with it.
> Here's the code
> module Main where
> import System
> import System.IO
> import System.Directory
> import System.IO.Error
> import Text.Regex
> import Control.Monad
> regexp = mkRegex ("([0-9]+) Windows ex")
> main = do
> files <- show_dir "[0-9].*"
> (sum,count) <- run_on_all_files (0,0) files
> let dd = (fromIntegral (sum::Integer))/ (fromIntegral (count::Int))
> putStr("Download = " ++ show sum ++ " in " ++ show count ++ " days are " ++ show dd ++ " downloads/day\n")
> run_on_all_files (a,b)  = return (a,b)
> run_on_all_files (a,b) (x:xs) = do (s,c) <- run_on(a,b) x
> run_on_all_files (s,c) xs
> run_on (a,b) file_name = do
> handle <- openFile file_name ReadMode
> (sum,count) <- for_each_line (a,b) handle
> hClose handle
> return ((sum,count))
> for_each_line (sum,count) handle = do
> l <- try (hGetLine handle)
> case l of
> Left err
> | isEOFError err -> return(sum,count)
> | otherwise -> ioError err
> Right line -> do
> let (nsum, ncount) = check_line line sum count
> for_each_line (nsum,ncount) handle
> check_line line sum count =
> let match = matchRegex regexp line
> in case match of
> Just strs -> (sum + read (head strs) :: Integer, count + 1)
> Nothing -> (sum, count)
> show_dir regmatch = do
> files <- getDirectoryContents "."
> let reg = mkRegex regmatch in
> return(filter (\file_name -> let fm = matchRegex reg file_name
> in case fm of
> Just strs -> True
> Nothing -> False) files)
> The point is this code works if there are just say a few files
> files to check. But it trashes my machine with around 1751 files.
> It sucks memory as wild and so it does not run as I think it should.
> I think I've overseen something which is bad written. Would you mind
> to tell me where I did "extraordinarily" bad.
> With best regards
> Haskell mailing list
> Haskell at haskell.org
More information about the Haskell