[Haskell-cafe] Haskell Speed

Chris Kuklewicz chris at mightyreason.com
Sat Dec 24 06:52:51 EST 2005


Greg Buchholz wrote:
>
>     True.  But there are some tests like "fasta" that appear to have a
> laziness induced space leak that presumably could be fixed.   
> 
> http://shootout.alioth.debian.org/benchmark.php?test=fasta&lang=all
> 

Playing with the space-leaking code did not show any obvious reason to
expect the space leak.  By commenting out the third output in main it
did not leak space.

Could some look at the original code at the link above and let me know
why there is a 300 MB space leak?

But if you never express huge lists, then it won't leak space.  Here is
a tweaked version, called tweak6.hs, that interleaves making and writing
a line of output.  It runs in 2.16 MB of RSIZE instead of over 300 MB.

Still slow, however.  If you want more speed without extra libraries,
then use an unboxed array of Word8 (length of one line: 60 bytes), since
Unicode Char and linked lists are overkill.

The other killer is probably the array of tuple linear lookup to find
the chosen symbol.  Hand coded if/then branches could be used to speed
that up.  For even faster speed, do not use 0.0 to 1.0 probabilities at
all.  Instead of normalize :: Int -> Double and then lookup (Double ->
Base) remove the use of Double and do (Int -> Base) lookup.

But the space leak was the embarrassing part.  So I just fixed that.

-- 
Chris Kuklewicz
---------------

-------------- next part --------------
-- The Great Computer Language Shootout
-- http://shootout.alioth.debian.org/
-- contributed by Jeff Newbern
-- Modified to tweak6.hs by Chris Kuklewicz

import Control.Monad.Trans
import Control.Monad.State
import System.IO (hFlush,stdout)

{- original
$ time ./fasta.ghc_run 2500000 > /dev/null 

real    1m17.596s
user    0m44.838s
sys     0m3.322s

330+ Megs! Space leaking all over the place.  No idea why.

$ time ./fasta.tweak6 2500000 > /dev/null 

real    0m30.477s
user    0m23.356s
sys     0m0.744s

2.2 Megs RSIZE
-}

-- Uses random generation code derived from Simon Marlow and Einar Karttunen's
-- "random" test entry.

-- Orignal version note: This code has not been optimized *at all*.
-- It is written to be clear and to follow standard Haskell idioms as
-- much as possible (but we have to match the stateful PRNG idiom in
-- the test definition, which is oriented toward an imperative
-- language).  Performance is decent with ghc -O2, but if it can be
-- improved without sacrificing the clarity of the code, by all means
-- go for it!

-- Mondified tweak6.hs version note: Use a StateT around IO to manage
-- the seed.  This makes interleaving the generation and output of the
-- sequence easier.  It is only *minimally* abstracted.

import System(getArgs)

type Base = Char
type Sequence = [Base]

alu :: Sequence  -- predefined sequence
alu = "GGCCGGGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGG" ++
      "GAGGCCGAGGCGGGCGGATCACCTGAGGTCAGGAGTTCGAGA" ++
      "CCAGCCTGGCCAACATGGTGAAACCCCGTCTCTACTAAAAAT" ++
      "ACAAAAATTAGCCGGGCGTGGTGGCGCGCGCCTGTAATCCCA" ++
      "GCTACTCGGGAGGCTGAGGCAGGAGAATCGCTTGAACCCGGG" ++
      "AGGCGGAGGTTGCAGTGAGCCGAGATCGCGCCACTGCACTCC" ++
      "AGCCTGGGCGACAGAGCGAGACTCCGTCTCAAAAA"

type BaseFrequency = (Base,Double)

iub :: [BaseFrequency]
iub = [ ('a', 0.27), ('c', 0.12), ('g', 0.12), ('t', 0.27),
        ('B', 0.02), ('D', 0.02), ('H', 0.02), ('K', 0.02), ('M', 0.02), ('N', 0.02),
        ('R', 0.02), ('S', 0.02), ('V', 0.02), ('W', 0.02), ('Y', 0.02) ]

homosapiens :: [BaseFrequency]
homosapiens = [ ('a', 0.3029549426680), ('c', 0.1979883004921),
                ('g', 0.1975473066391), ('t', 0.3015094502008) ]

-- select a base whose interval covers the given double
chooseBase :: [BaseFrequency] -> Double -> Base
chooseBase [(b,_)]    _ = b
chooseBase ((b,f):xs) p | p < f     = b
                        | otherwise = chooseBase xs (p-f)

type Seed = Int
type Pseudo a = StateT Seed IO a

prng :: Pseudo Double
prng = let nextSeed s = (s * ia + ic) `mod` im
           normalize n = (fromIntegral n) * (1.0 / fromIntegral im)
           im = 139968; ia = 3877; ic = 29573
       in do seed <- get
             let seed' = nextSeed seed
             put seed'
             return (normalize seed')

prngN count = sequence $ replicate count prng

-- write a sequence in Fasta format
writeFasta :: String -> String -> Sequence -> IO ()
writeFasta label title sequence =
  do putStrLn $ ">" ++ (label ++ (" " ++ title))
     writeWrapped 60 sequence
  where writeWrapped _   []  = do return ()
        writeWrapped len str = do let (s1,s2) = splitAt len str
                                  putStrLn s1
                                  writeWrapped len s2

writeFastaHeader :: (MonadIO m) => String -> String -> m ()
writeFastaHeader label title = liftIO $ putStrLn $ ">" ++ (label ++ (" " ++ title))

writeWrapped' :: Int -> Int -> (Double->Base) -> Pseudo ()
writeWrapped' wrap total trans = 
    let work c = case c of
                   0 -> return ()
                   n -> do let c' = min wrap n
                               nextC = c - c'
                           s <- liftM (map trans) (prngN c')
                           liftIO $ putStrLn s
                           work nextC
    in work total

writeWrapped = writeWrapped' 60

main = do args <- getArgs
          let n = if (null args) then 1000 else read (head args)
          writeFasta "ONE" "Homo sapiens alu" (take (2*n) (cycle alu))
          writeFastaHeader "TWO" "IUB ambiguity codes"
          seed' <- execStateT (writeWrapped (3*n) (chooseBase iub))  42
          writeFastaHeader "THREE" "Homo sapiens frequency"
          execStateT (writeWrapped (5*n) (chooseBase homosapiens)) seed'


More information about the Haskell-Cafe mailing list