[Haskell-cafe] Space leak

Thu Mar 11 09:48:08 EST 2010

Hi Arnoldo

This doesn't address the space leak, but your parseChromosome function
looks very inefficient - isInfixOf is repeatedly checking the prefix
"chromosome" for C1 to CY. If you have a lot of CY's in a file then it
will do a lot of work parsing them.

The cleanest way of handling this would be to use a parser combinator
library with keywords for "chromosome" and "mitochondrion" - however
that might add a performance penalty itself.

Here is a version that should be fairly efficient although a little
ugly due to how it has to match literal chars in prefix of the string:

Add a import for Data.Char to the import list:

> import Data.Char

Add Enum to the deriving clause of the Chromosome data type:

>                | C19
>                | CX
>                | CY
>                | CMT
>                  deriving (Show,Enum)

Replace parseChromosome with the one below.

Note that the derived Enum functions for Chromosome are indexed from
0.. whereas when you read one from the file it is indexed from 1.. so
you have to sub1 before using toEnum:

sub1 :: Int -> Int
sub1 x = x-1

parseChromosome :: [Char] -> Chromosome
parseChromosome ('c':'h':'r':'o':'m':'o':'s':'o':'m':'e':' ':xs) = chro xs
  where
   chro ('X' :_)              = CX
   chro ('Y' :_)              = CY
   chro ( x  : ',' :_)        | isDigit x  = toEnum (sub1 $ digitToInt x)
   chro ('1' :  x  : ',' :_ ) | isDigit x  = toEnum (sub1 $ (10+) $
digitToInt x)
   chro ('1' :  x  :_ )       | isDigit x  = toEnum (sub1 $ (10+) $
digitToInt x)
   chro _                     = error "BAD header"

parseChromosome ('m':'i':'t':'o':'c':'h':'o':'n':'d':'r':'i':'o':'n':_) = CMT
parseChromosome _                 = error "BAD header"

Best wishes

Stephen