[Haskell-cafe] Space leak

Daniel Fischer daniel.is.fischer at web.de
Sat Mar 13 21:16:58 EST 2010


Am Sonntag 14 März 2010 00:58:09 schrieb Arnoldo Muller:
> Jason,
>
> I am trying to use haskell in the analysis of bio data. One of the main
> reasons I wanted to use haskell is because lazy I/O allows you to see a
> large bio-sequence as if it was a string in memory.
> In order to achieve the same result in an imperative language I would
> have to write lots of error-prone iterators. I saw lazy I/O as a very
> strong point in favor of Haskell.
>
> Besides the space leaks that can occur and that are a bit difficult to
> find for a newbie like me, are there any other reasons to avoid Lazy
> I/O?

You may be happy to hear that the space leak you encountered had 
__nothing whatsoever__ to do with lazy IO.

It's true that lazy IO offers some pitfalls for the unwary (and some, but 
much fewer, for the wary), but I think the dangers of lazy IO tend to be 
exaggerated.
For your application, readFile and appendFile are absolutely fine, the 
space leak occurred in the pure code.

Below is a variant of your programme that doesn't use file-IO, the one 
readFasta function has the space leak, the currently commented-out one not.
Compile with -O2, run with e.g.

./leak +RTS -s -M400M -RTS 3 10000000 10

one runs in constant space, the other not.

>
> Arnoldo.

----------------------------------------------------------------------
{-# LANGUAGE BangPatterns #-}
module Main (main) where

import Data.List
import System.Environment (getArgs)

data Chromosome =    C1
                | C2
                | C3
                | C4
                | C5
                | C6
                | C7
                | C8
                | C9
                | C10
                | C11
                | C12
                | C13
                | C14
                | C15
                | C16
                | C17
                | C18
                | C19
                | CX
                | CY
                | CMT
                  deriving (Show)

type Sequence = [Char]

data Window = Window { sequen :: Sequence,
                       chrom :: Chromosome,
                       pos   :: Int
                     }

instance Show Window where
    show w =  (sequen w) ++ "\t" ++ show (chrom w) ++ "\t" ++ show (pos w)

main = do
       [ct, len, windowSize] <- getArgs
       let wSize = (read windowSize)::Int
           ln = read len
           inData = [(cn,ln) | cn <- [1 .. read ct]]
       mapM_ (uncurry $ genomeExecute filterWindow wSize)  inData

countLines :: String -> Int
countLines = go 0
      where
        go !acc [] = acc
        go !acc ('\n':cs) = go (acc+1) cs
        go !acc (_:cs ) = go acc cs

genomeExecute :: (Window -> Bool) -> Int -> Int -> Int -> IO ()
genomeExecute flt wSize cn ln =
    print . countLines $ fastaExtractor 
            ("chromosome " ++ show cn ++ ",\n" ++ replicate (cn*ln) 'A') 
                 wSize flt

fastaExtractor :: String -> Int -> (Window -> Bool) -> String
fastaExtractor input wSize f = printWindowList $ filter f $ readFasta 
 wSize input

filterWindow :: Window -> Bool
filterWindow w = not (elem 'N' (sequen w))

printWindowList :: [Window] -> String
printWindowList l = unlines $ map show l

{-
readFasta :: Int -> [Char] -> [Window]
readFasta windowSize sequence =
    let (header,rest) = span (/= '\n') sequence
        chr = parseChromosome header
        go i (w:ws) = Window w chr i : go (i+1) ws
        go _ [] = []
    in go 0 $ slideWindow windowSize $ filter (/= '\n') rest
-}

readFasta :: Int -> [Char] -> [Window]
readFasta windowSize sequence =
    let (header,rest) = span (/= '\n') sequence
        chr = parseChromosome header
    in map (\(i, w) -> Window w chr i) $ zip [0..]  $ slideWindow 
windowSize  $ filter ( '\n' /= ) rest


slideWindow :: Int -> [Char] -> [[Char]]
slideWindow _ [] = []
slideWindow windowSize l@(_:xs)  = take windowSize l : slideWindow 
windowSize xs

parseChromosome :: [Char] -> Chromosome
parseChromosome line
    | isInfixOf "chromosome 1," line = C1
    | isInfixOf "chromosome 2," line = C2
    | isInfixOf "chromosome 3," line = C3
    | isInfixOf "chromosome 4," line = C4
    | isInfixOf "chromosome 5," line = C5
    | isInfixOf "chromosome 6," line = C6
    | isInfixOf "chromosome 7," line = C7
    | isInfixOf "chromosome 8," line = C9
    | isInfixOf "chromosome 10," line = C10
    | isInfixOf "chromosome 11," line = C11
    | isInfixOf "chromosome 12," line = C12
    | isInfixOf "chromosome 13," line = C13
    | isInfixOf "chromosome 14," line = C14
    | isInfixOf "chromosome 15," line = C15
    | isInfixOf "chromosome 16," line = C16
    | isInfixOf "chromosome 17"  line = C17
    | isInfixOf "chromosome 18"  line = C18
    | isInfixOf "chromosome 19"  line = C19
    | isInfixOf "chromosome X"   line = CX
    | isInfixOf "chromosome Y"   line = CY
    | isInfixOf "mitochondrion"  line = CMT
    | otherwise = error "BAD header"
----------------------------------------------------------------------


More information about the Haskell-Cafe mailing list