[Haskell-cafe] Space leak

Arnoldo Muller arnoldomuller at gmail.com
Wed Mar 10 17:03:40 EST 2010


Hello Justin,

I tried and what I saw was a constant increase in memory usage.
Any particular profiling option that you would use?

I do remember that there was a particular option in which the leak would
dissapear (for the same amount of work) and that is why I stopped using the
profiler.

Thanks,

Arnoldo


On Wed, Mar 10, 2010 at 10:20 PM, Justin Bailey <jgbailey at gmail.com> wrote:

> Have you use the profiling tools available with GHC?
>
>  http://haskell.org/ghc/docs/latest/html/users_guide/profiling.html
>
>
> On Wed, Mar 10, 2010 at 12:45 PM, Arnoldo Muller
> <arnoldomuller at gmail.com> wrote:
> > Hello,
> >
> > I am learning haskell and I found a space leak that I find difficult to
> > solve. I've been asking at #haskell but we could not solve
> > the issue.
> >
> > I want to lazily read a set of 22 files of about 200MB each, filter them
> and
> > then I want to output the result into a unique file.
> > If I modify the main function to work only with one input file,  the
> program
> > runs without issues. I will call this version A.
> > Version B  uses a mapM_ to iterate over a list of filenames and uses
> > appendFile to output the result of filtering each file.
> > In this case the memory usage grows sharply and quickly (profiles show
> > constant memory growth). In less than a minute, memory
> > occupation will make my system hang with swapping.
> >
> > This is version B:
> >
> > ------------------------------- Program B
> >
> --------------------------------------------------------------------------------------------------------------------
> > import Data.List
> > import System.Environment
> > import System.Directory
> > import Control.Monad
> >
> >
> > -- different types of chromosomes
> > data Chromosome =    C1
> >                 | C2
> >                 | C3
> >                 | C4
> >                 | C5
> >                 | C6
> >                 | C7
> >                 | C8
> >                 | C9
> >                 | C10
> >                 | C11
> >                 | C12
> >                 | C13
> >                 | C14
> >                 | C15
> >                 | C16
> >                 | C17
> >                 | C18
> >                 | C19
> >                 | CX
> >                 | CY
> >                 | CMT
> >                   deriving (Show)
> > -- define a window
> > type Sequence = [Char]
> > -- Window data
> > data Window = Window { sequen :: Sequence,
> >                        chrom :: Chromosome,
> >                        pos   :: Int
> >                      }
> > -- print a window
> > instance Show Window where
> >     show w =  (sequen w) ++ "\t" ++ show (chrom w) ++ "\t" ++ show (pos
> w)
> >
> > -- Reading fasta files with haskell
> >
> > -- Initialize the
> > main = do
> >        -- get the arguments (intput is
> >        [input, output, windowSize] <- getArgs
> >        -- get directory contents (only names)
> >        names <- getDirectoryContents input
> >        -- prepend directory
> >        let fullNames = filter isFastaFile $ map (\x -> input ++ "/" ++ x)
> > names
> >        let wSize = (read windowSize)::Int
> >        -- process the directories
> >        mapM (genomeExecute output wSize filterWindow)  fullNames
> >
> >
> > -- read the files one by one and write them to the output file
> > genomeExecute :: String -> Int -> (Window -> Bool) -> String -> IO ()
> > genomeExecute  outputFile windowSize f inputFile = do
> >   fileData <- readFile inputFile
> >   appendFile outputFile $ fastaExtractor fileData windowSize f
> >
> > --
> > isFastaFile :: String -> Bool
> > isFastaFile fileName = isSuffixOf ".fa" fileName
> >
> >
> > -- fasta extractor (receives a Fasta String and returns a windowed string
> > ready to be sorted)
> > -- an example on how to compose several functions to parse a fasta file
> > fastaExtractor :: String -> Int -> (Window -> Bool) -> String
> > fastaExtractor input wSize f = printWindowList $ filter f $ readFasta
> wSize
> > input
> >
> > -- MAIN FILTER that removes N elements from the strings!
> > filterWindow :: Window -> Bool
> > filterWindow w = not (elem 'N' (sequen w))
> >
> > -- print a window list (the printing makes it ready for output as raw
> data)
> > printWindowList :: [Window] -> String
> > printWindowList l = unlines $ map show l
> >
> > -- read fasta, remove stuff that is not useful from it
> > -- removes the
> > readFasta :: Int -> [Char] -> [Window]
> > readFasta windowSize sequence =
> >     -- get the header
> >     let (header:rest) = lines sequence
> >         chr = parseChromosome header
> >         in
> >
> > -- We now do the following:
> > --      take window                  create counter
> > remove newlines
> >    map (\(i, w) -> Window w chr i) $ zip [0..]  $ slideWindow windowSize
> $
> > filter ( '\n' /= )  $ unlines rest
> >
> >
> > slideWindow :: Int -> [Char] -> [[Char]]
> > slideWindow _ [] = []
> > slideWindow windowSize l@(_:xs)  = take windowSize l : slideWindow
> > windowSize xs
> >
> >
> >
> > -- Parse the chromosome from a fasta comment
> > -- produce a more compact chromosome representation
> > parseChromosome :: [Char] -> Chromosome
> > parseChromosome line
> >     | isInfixOf "chromosome 1," line = C1
> >     | isInfixOf "chromosome 2," line = C2
> >     | isInfixOf "chromosome 3," line = C3
> >     | isInfixOf "chromosome 4," line = C4
> >     | isInfixOf "chromosome 5," line = C5
> >     | isInfixOf "chromosome 6," line = C6
> >     | isInfixOf "chromosome 7," line = C7
> >     | isInfixOf "chromosome 8," line = C9
> >     | isInfixOf "chromosome 10," line = C10
> >     | isInfixOf "chromosome 11," line = C11
> >     | isInfixOf "chromosome 12," line = C12
> >     | isInfixOf "chromosome 13," line = C13
> >     | isInfixOf "chromosome 14," line = C14
> >     | isInfixOf "chromosome 15," line = C15
> >     | isInfixOf "chromosome 16," line = C16
> >     | isInfixOf "chromosome 17"  line = C17
> >     | isInfixOf "chromosome 18"  line = C18
> >     | isInfixOf "chromosome 19"  line = C19
> >     | isInfixOf "chromosome X"   line = CX
> >     | isInfixOf "chromosome Y"   line = CY
> >     | isInfixOf "mitochondrion"  line = CMT
> >     | otherwise = error "BAD header"
> >
> >
> > -------------------------------- End of program B
> >
> ------------------------------------------------------------------------------------------------
> >
> > -------------------------------- Program A
> >
> ---------------------------------------------------------------------------------------------------------
> > If instead of the main function given above I use the following main
> > function to process only one input file, things work OK for even
> > the largest files. Memory usage remains constant in this case.
> >
> > main = do
> >        -- get the arguments
> >        [input, output, windowSize] <- getArgs
> >        -- keep the input stream
> >        inpStr <- readFile input
> >        let wSize = (read windowSize)::Int
> >        writeFile output $ fastaExtractor inpStr wSize filterWindow
> >
> >
> > It is not easy for me to see why is Haskell keeping data in memory. Do
> you
> > have any idea why  program B is
> > not working?
> >
> > Thank you for your help!
> >
> > Arnoldo Muller
> >
> > _______________________________________________
> > Haskell-Cafe mailing list
> > Haskell-Cafe at haskell.org
> > http://www.haskell.org/mailman/listinfo/haskell-cafe
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.haskell.org/pipermail/haskell-cafe/attachments/20100310/aa8a41c3/attachment.html


More information about the Haskell-Cafe mailing list