[Haskell-cafe] Space leak

Daniel Fischer daniel.is.fischer at web.de
Wed Mar 10 16:40:11 EST 2010


Am Mittwoch 10 März 2010 21:45:56 schrieb Arnoldo Muller:
> Hello,
>
> I am learning haskell and I found a space leak that I find difficult to
> solve. I've been asking at #haskell but we could not solve
> the issue.
>
> I want to lazily read a set of 22 files of about 200MB each, filter them
> and then I want to output the result into a unique file.
> If I modify the main function to work only with one input file,  the
> program runs without issues. I will call this version A.
> Version B  uses a mapM_ to iterate over a list of filenames and uses
> appendFile to output the result of filtering each file.
> In this case the memory usage grows sharply and quickly (profiles show
> constant memory growth). In less than a minute, memory
> occupation will make my system hang with swapping.

No work is been done until the end, when all is tried to be done 
simultaneously. Make sure genomeExecute ... input1 has actually finished 
its work before genomeExecute ... input2 starts etc.

One way is to use a stricter version of sequence_,

sequence'_ :: Monad m => [m a] -> m ()
sequence'_ (x:xs) = do
    a <- x
    a `seq` sequence'_ xs
sequence'_ [] = return ()

(nicer with BangPatterns, but not portable), and

mapM'_ f = sequence'_ . map f

Another option is making genomeExecute itself stricter.

>
> This is version B:
>
> ------------------------------- Program B
> ------------------------------------------------------------------------
>-------------------------------------------- import Data.List
> import System.Environment
> import System.Directory
> import Control.Monad
>
>
> -- different types of chromosomes
> data Chromosome =    C1
>
>                 | C2
>                 | C3
>                 | C4
>                 | C5
>                 | C6
>                 | C7
>                 | C8
>                 | C9
>                 | C10
>                 | C11
>                 | C12
>                 | C13
>                 | C14
>                 | C15
>                 | C16
>                 | C17
>                 | C18
>                 | C19
>                 | CX
>                 | CY
>                 | CMT
>
>                   deriving (Show)
> -- define a window
> type Sequence = [Char]
> -- Window data
> data Window = Window { sequen :: Sequence,
>                        chrom :: Chromosome,
>                        pos   :: Int
>                      }
> -- print a window
> instance Show Window where
>     show w =  (sequen w) ++ "\t" ++ show (chrom w) ++ "\t" ++ show (pos
> w)
>
> -- Reading fasta files with haskell
>
> -- Initialize the
> main = do
>        -- get the arguments (intput is
>        [input, output, windowSize] <- getArgs
>        -- get directory contents (only names)
>        names <- getDirectoryContents input
>        -- prepend directory
>        let fullNames = filter isFastaFile $ map (\x -> input ++ "/" ++
> x) names
>        let wSize = (read windowSize)::Int
>        -- process the directories
>        mapM (genomeExecute output wSize filterWindow)  fullNames
>
>
> -- read the files one by one and write them to the output file
> genomeExecute :: String -> Int -> (Window -> Bool) -> String -> IO ()
> genomeExecute  outputFile windowSize f inputFile = do
>   fileData <- readFile inputFile
>   appendFile outputFile $ fastaExtractor fileData windowSize f




More information about the Haskell-Cafe mailing list