[Haskell-beginners] ICFP 2007

Wed Aug 27 22:05:22 EDT 2008

Am Donnerstag, 28. August 2008 01:46 schrieb Rafael Gustavo da Cunha Pereira 
Pinto:
> First of all, sorry asking too many questions!

That's what this list is for :)

>
> I am using the ICFP 2007 (http://save-endo.cs.uu.nl/) problem as a case for
> learning Haskell.
>
> I started with the prototype, listed below
>
>
> =================Main.hs=======================
> module Main(main) where
>
> import qualified Data.ByteString.Lazy.Char8 as L
> import qualified Data.ByteString.Char8 as S
>
> import Data.ByteString.Search.BoyerMoore as BM
> import Debug.Trace
>
> data DNA=DNA {prefix :: L.ByteString, suffix :: L.ByteString}
>
>
> main::IO ()
> main=do
>     d<-L.readFile "endo.dna"
>     let dna=DNA (L.pack "IIIIIIIIIIICCICCICCICFFIIIIIIIIIIIIIIIICFP")  d
>     let pat=S.pack "ICFP"
>     print $ take 1 $ matchSL pat $ L.append (prefix dna) (suffix dna)
> =================EOF Main.hs===================
>
>
> My question is:
>
> Knowing that the searched pattern is in DNA prefix, will "suffix dna" ever
> be called here?
> I mean, will the "append" suspension ever be fully evaluated?

No. That's one nice thing about lazy ByteStrings. You only use as many chunks 
as you really need.
L.append is lazy enough to survive even L.append (prefix dna) undefined. 
Unfortunately our BM searching algorithm isn't, an optimisation for the case 
of a lazy ByteString consisting of a single chunk requires that the second 
argument of L.append is reduced to WHNF (KMP doesn't), so it needs to read 
one chunk from the file. Hence in this case (suffix dna) is called, but only 
one chunk of it is used.

You can verify this by running your programme with the option "+RTS -sstderr", 
which will print out some time and allocation stats. I did:
dafis at linux:~/Documents/haskell/move> ghc -O2 --make endo
[1 of 1] Compiling Main             ( endo.hs, endo.o )
Linking endo ...
dafis at linux:~/Documents/haskell/move> ./endo +RTS -sstderr
./endo +RTS -sstderr
[38]
     89,388 bytes allocated in the heap
        520 bytes copied during GC (scavenged)
          0 bytes copied during GC (not scavenged)
     57,344 bytes maximum residency (1 sample(s))
<snip 0.00 times>

Searching in L.append (prefix dna) (L.append (L.pack "u") (suffix dna)) gives 
the figures
     56,604 bytes allocated in the heap
        528 bytes copied during GC (scavenged)
          8 bytes copied during GC (not scavenged)
     57,344 bytes maximum residency (1 sample(s))
, so then no chunk (~32K on my system) needs to be read from the file.

>
>
> I am thinking on using a data type like
>
> data DNA=DNA {prefix::L.ByteString, worked::L.ByteString,
> suffix::ByteString}
>
> The invariant is that worked should be L.empty whenever prefix is L.empty.
>
>
> Thanks in advance

Cheers,
Daniel