[Haskell-cafe] Space leak
Arnoldo Muller
arnoldomuller at gmail.com
Wed Mar 10 17:03:40 EST 2010
Hello Justin,
I tried and what I saw was a constant increase in memory usage.
Any particular profiling option that you would use?
I do remember that there was a particular option in which the leak would
dissapear (for the same amount of work) and that is why I stopped using the
profiler.
Thanks,
Arnoldo
On Wed, Mar 10, 2010 at 10:20 PM, Justin Bailey <jgbailey at gmail.com> wrote:
> Have you use the profiling tools available with GHC?
>
> http://haskell.org/ghc/docs/latest/html/users_guide/profiling.html
>
>
> On Wed, Mar 10, 2010 at 12:45 PM, Arnoldo Muller
> <arnoldomuller at gmail.com> wrote:
> > Hello,
> >
> > I am learning haskell and I found a space leak that I find difficult to
> > solve. I've been asking at #haskell but we could not solve
> > the issue.
> >
> > I want to lazily read a set of 22 files of about 200MB each, filter them
> and
> > then I want to output the result into a unique file.
> > If I modify the main function to work only with one input file, the
> program
> > runs without issues. I will call this version A.
> > Version B uses a mapM_ to iterate over a list of filenames and uses
> > appendFile to output the result of filtering each file.
> > In this case the memory usage grows sharply and quickly (profiles show
> > constant memory growth). In less than a minute, memory
> > occupation will make my system hang with swapping.
> >
> > This is version B:
> >
> > ------------------------------- Program B
> >
> --------------------------------------------------------------------------------------------------------------------
> > import Data.List
> > import System.Environment
> > import System.Directory
> > import Control.Monad
> >
> >
> > -- different types of chromosomes
> > data Chromosome = C1
> > | C2
> > | C3
> > | C4
> > | C5
> > | C6
> > | C7
> > | C8
> > | C9
> > | C10
> > | C11
> > | C12
> > | C13
> > | C14
> > | C15
> > | C16
> > | C17
> > | C18
> > | C19
> > | CX
> > | CY
> > | CMT
> > deriving (Show)
> > -- define a window
> > type Sequence = [Char]
> > -- Window data
> > data Window = Window { sequen :: Sequence,
> > chrom :: Chromosome,
> > pos :: Int
> > }
> > -- print a window
> > instance Show Window where
> > show w = (sequen w) ++ "\t" ++ show (chrom w) ++ "\t" ++ show (pos
> w)
> >
> > -- Reading fasta files with haskell
> >
> > -- Initialize the
> > main = do
> > -- get the arguments (intput is
> > [input, output, windowSize] <- getArgs
> > -- get directory contents (only names)
> > names <- getDirectoryContents input
> > -- prepend directory
> > let fullNames = filter isFastaFile $ map (\x -> input ++ "/" ++ x)
> > names
> > let wSize = (read windowSize)::Int
> > -- process the directories
> > mapM (genomeExecute output wSize filterWindow) fullNames
> >
> >
> > -- read the files one by one and write them to the output file
> > genomeExecute :: String -> Int -> (Window -> Bool) -> String -> IO ()
> > genomeExecute outputFile windowSize f inputFile = do
> > fileData <- readFile inputFile
> > appendFile outputFile $ fastaExtractor fileData windowSize f
> >
> > --
> > isFastaFile :: String -> Bool
> > isFastaFile fileName = isSuffixOf ".fa" fileName
> >
> >
> > -- fasta extractor (receives a Fasta String and returns a windowed string
> > ready to be sorted)
> > -- an example on how to compose several functions to parse a fasta file
> > fastaExtractor :: String -> Int -> (Window -> Bool) -> String
> > fastaExtractor input wSize f = printWindowList $ filter f $ readFasta
> wSize
> > input
> >
> > -- MAIN FILTER that removes N elements from the strings!
> > filterWindow :: Window -> Bool
> > filterWindow w = not (elem 'N' (sequen w))
> >
> > -- print a window list (the printing makes it ready for output as raw
> data)
> > printWindowList :: [Window] -> String
> > printWindowList l = unlines $ map show l
> >
> > -- read fasta, remove stuff that is not useful from it
> > -- removes the
> > readFasta :: Int -> [Char] -> [Window]
> > readFasta windowSize sequence =
> > -- get the header
> > let (header:rest) = lines sequence
> > chr = parseChromosome header
> > in
> >
> > -- We now do the following:
> > -- take window create counter
> > remove newlines
> > map (\(i, w) -> Window w chr i) $ zip [0..] $ slideWindow windowSize
> $
> > filter ( '\n' /= ) $ unlines rest
> >
> >
> > slideWindow :: Int -> [Char] -> [[Char]]
> > slideWindow _ [] = []
> > slideWindow windowSize l@(_:xs) = take windowSize l : slideWindow
> > windowSize xs
> >
> >
> >
> > -- Parse the chromosome from a fasta comment
> > -- produce a more compact chromosome representation
> > parseChromosome :: [Char] -> Chromosome
> > parseChromosome line
> > | isInfixOf "chromosome 1," line = C1
> > | isInfixOf "chromosome 2," line = C2
> > | isInfixOf "chromosome 3," line = C3
> > | isInfixOf "chromosome 4," line = C4
> > | isInfixOf "chromosome 5," line = C5
> > | isInfixOf "chromosome 6," line = C6
> > | isInfixOf "chromosome 7," line = C7
> > | isInfixOf "chromosome 8," line = C9
> > | isInfixOf "chromosome 10," line = C10
> > | isInfixOf "chromosome 11," line = C11
> > | isInfixOf "chromosome 12," line = C12
> > | isInfixOf "chromosome 13," line = C13
> > | isInfixOf "chromosome 14," line = C14
> > | isInfixOf "chromosome 15," line = C15
> > | isInfixOf "chromosome 16," line = C16
> > | isInfixOf "chromosome 17" line = C17
> > | isInfixOf "chromosome 18" line = C18
> > | isInfixOf "chromosome 19" line = C19
> > | isInfixOf "chromosome X" line = CX
> > | isInfixOf "chromosome Y" line = CY
> > | isInfixOf "mitochondrion" line = CMT
> > | otherwise = error "BAD header"
> >
> >
> > -------------------------------- End of program B
> >
> ------------------------------------------------------------------------------------------------
> >
> > -------------------------------- Program A
> >
> ---------------------------------------------------------------------------------------------------------
> > If instead of the main function given above I use the following main
> > function to process only one input file, things work OK for even
> > the largest files. Memory usage remains constant in this case.
> >
> > main = do
> > -- get the arguments
> > [input, output, windowSize] <- getArgs
> > -- keep the input stream
> > inpStr <- readFile input
> > let wSize = (read windowSize)::Int
> > writeFile output $ fastaExtractor inpStr wSize filterWindow
> >
> >
> > It is not easy for me to see why is Haskell keeping data in memory. Do
> you
> > have any idea why program B is
> > not working?
> >
> > Thank you for your help!
> >
> > Arnoldo Muller
> >
> > _______________________________________________
> > Haskell-Cafe mailing list
> > Haskell-Cafe at haskell.org
> > http://www.haskell.org/mailman/listinfo/haskell-cafe
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.haskell.org/pipermail/haskell-cafe/attachments/20100310/aa8a41c3/attachment.html
More information about the Haskell-Cafe
mailing list