[Haskell-cafe] Space leak
Daniel Fischer
daniel.is.fischer at web.de
Sat Mar 13 21:16:58 EST 2010
Am Sonntag 14 März 2010 00:58:09 schrieb Arnoldo Muller:
> Jason,
>
> I am trying to use haskell in the analysis of bio data. One of the main
> reasons I wanted to use haskell is because lazy I/O allows you to see a
> large bio-sequence as if it was a string in memory.
> In order to achieve the same result in an imperative language I would
> have to write lots of error-prone iterators. I saw lazy I/O as a very
> strong point in favor of Haskell.
>
> Besides the space leaks that can occur and that are a bit difficult to
> find for a newbie like me, are there any other reasons to avoid Lazy
> I/O?
You may be happy to hear that the space leak you encountered had
__nothing whatsoever__ to do with lazy IO.
It's true that lazy IO offers some pitfalls for the unwary (and some, but
much fewer, for the wary), but I think the dangers of lazy IO tend to be
exaggerated.
For your application, readFile and appendFile are absolutely fine, the
space leak occurred in the pure code.
Below is a variant of your programme that doesn't use file-IO, the one
readFasta function has the space leak, the currently commented-out one not.
Compile with -O2, run with e.g.
./leak +RTS -s -M400M -RTS 3 10000000 10
one runs in constant space, the other not.
>
> Arnoldo.
----------------------------------------------------------------------
{-# LANGUAGE BangPatterns #-}
module Main (main) where
import Data.List
import System.Environment (getArgs)
data Chromosome = C1
| C2
| C3
| C4
| C5
| C6
| C7
| C8
| C9
| C10
| C11
| C12
| C13
| C14
| C15
| C16
| C17
| C18
| C19
| CX
| CY
| CMT
deriving (Show)
type Sequence = [Char]
data Window = Window { sequen :: Sequence,
chrom :: Chromosome,
pos :: Int
}
instance Show Window where
show w = (sequen w) ++ "\t" ++ show (chrom w) ++ "\t" ++ show (pos w)
main = do
[ct, len, windowSize] <- getArgs
let wSize = (read windowSize)::Int
ln = read len
inData = [(cn,ln) | cn <- [1 .. read ct]]
mapM_ (uncurry $ genomeExecute filterWindow wSize) inData
countLines :: String -> Int
countLines = go 0
where
go !acc [] = acc
go !acc ('\n':cs) = go (acc+1) cs
go !acc (_:cs ) = go acc cs
genomeExecute :: (Window -> Bool) -> Int -> Int -> Int -> IO ()
genomeExecute flt wSize cn ln =
print . countLines $ fastaExtractor
("chromosome " ++ show cn ++ ",\n" ++ replicate (cn*ln) 'A')
wSize flt
fastaExtractor :: String -> Int -> (Window -> Bool) -> String
fastaExtractor input wSize f = printWindowList $ filter f $ readFasta
wSize input
filterWindow :: Window -> Bool
filterWindow w = not (elem 'N' (sequen w))
printWindowList :: [Window] -> String
printWindowList l = unlines $ map show l
{-
readFasta :: Int -> [Char] -> [Window]
readFasta windowSize sequence =
let (header,rest) = span (/= '\n') sequence
chr = parseChromosome header
go i (w:ws) = Window w chr i : go (i+1) ws
go _ [] = []
in go 0 $ slideWindow windowSize $ filter (/= '\n') rest
-}
readFasta :: Int -> [Char] -> [Window]
readFasta windowSize sequence =
let (header,rest) = span (/= '\n') sequence
chr = parseChromosome header
in map (\(i, w) -> Window w chr i) $ zip [0..] $ slideWindow
windowSize $ filter ( '\n' /= ) rest
slideWindow :: Int -> [Char] -> [[Char]]
slideWindow _ [] = []
slideWindow windowSize l@(_:xs) = take windowSize l : slideWindow
windowSize xs
parseChromosome :: [Char] -> Chromosome
parseChromosome line
| isInfixOf "chromosome 1," line = C1
| isInfixOf "chromosome 2," line = C2
| isInfixOf "chromosome 3," line = C3
| isInfixOf "chromosome 4," line = C4
| isInfixOf "chromosome 5," line = C5
| isInfixOf "chromosome 6," line = C6
| isInfixOf "chromosome 7," line = C7
| isInfixOf "chromosome 8," line = C9
| isInfixOf "chromosome 10," line = C10
| isInfixOf "chromosome 11," line = C11
| isInfixOf "chromosome 12," line = C12
| isInfixOf "chromosome 13," line = C13
| isInfixOf "chromosome 14," line = C14
| isInfixOf "chromosome 15," line = C15
| isInfixOf "chromosome 16," line = C16
| isInfixOf "chromosome 17" line = C17
| isInfixOf "chromosome 18" line = C18
| isInfixOf "chromosome 19" line = C19
| isInfixOf "chromosome X" line = CX
| isInfixOf "chromosome Y" line = CY
| isInfixOf "mitochondrion" line = CMT
| otherwise = error "BAD header"
----------------------------------------------------------------------
More information about the Haskell-Cafe
mailing list