[Haskell-beginners] State Transformer Confusion

Wed Feb 24 23:02:39 EST 2010

Auto-replying again :)

Am Donnerstag 25 Februar 2010 04:26:22 schrieb Daniel Fischer:
> Am Donnerstag 25 Februar 2010 03:54:17 schrieb Daniel Fischer:
> > Am Donnerstag 25 Februar 2010 03:04:02 schrieb Chris Pettitt:
> > > Hello Haskell-Beginners,
> > >
> > > I'm having a fair amount of difficulty understanding the ST monad.
> > > In my basic experimentation I'm finding that I'm getting more heap
> > > allocated with STUArray than with a regular Array, which is not what
> > > I would expect.
> >
> > And ordinarily you don't. The problem is that the STUArray suffers
> > very badly from profiling. Compiling without -prof, the STUArray code
> > allocates about half as much as the UArray code and takes ~42% of the
> > time. With profiling, STUArray allocates ~40% more and takes ~50%
> > longer than UArray.
>
> Oh, and: In the UArray code, you specified the index type as Int, in the
> STUArray code, you didn't specify it, so it was taken to be Integer,
> which slows down performance significantly, changing sumInPlace1 to
>
> sumInPlace1 :: [Int] -> Int
> sumInPlace1 xs = (! 0) . runSTUArray $ do
>         a <- newArray (0 :: Int, 0) 0
>         forM_ xs $ \x -> do
>             x' <- (readArray a 0)
>             writeArray a 0 (x' + x)
>         return a
>
> makes the STUArray code allocate ~6% of what the UArray code allocates
> and run in ~8% of the time, because now we get a pretty nice loop
> involving only unboxed Ints. If we then replace readArray and writeArray
> with unsafeRead and unsafeWrite, we see that the most time is spent on
> bounds checking, because now the STUArray loop becomes really compact
> and runs a good three times faster, so it's now over forty times faster
> than the UArray (for which using unsafeAt instead of (!) doesn't make a
> noticeable difference).
>

To illustrate:

With

sumInPlace1 :: [Int] -> Int
sumInPlace1 xs = (`unsafeAt` 0) . runSTUArray $ do
        a <- newArray (0 :: Int, 0) 0
        forM_ xs $ \x -> do
            x' <- (unsafeRead a 0)
            unsafeWrite a 0 (x' + x)
        return a

, the non-profiling code generates a tight loop, with constant allocation 
(*no* allocation for the enumFromTo, everything runs in registers until the 
end). Beautiful and fast. The profiling code can't optimise as much, the 
result is that the profiling code takes (for a limit of 100 million) 85 
times as long as the non-profiling code. Ouch!
For the UArray code, the profiling version takes only twice as long as the 
non-profiling version.

> > Two things:
> > - profiling interacts badly with some optimisations, so what the
> > profiling output says may deviate significantly from what your -O2
> > compiled production code actually does
> > - some datatypes suffer more from profiling than others, so what the
> > profiling output says for different choices of datatype may deviate
> > significantly from how your -O2 compiled production code behaves
> >
> > Normally, -prof doesn't change the behaviour very much, but sometimes
> > it does.
> >
> > > One additional point of confusion for me: when I run either function
> > > with +RTS -hc and use hp2ps I get an empty graph. I've seen these
> > > tools work quite well before, so I suspect I'm doing something wrong
> > > now.
> >
> > Too little actual heap usage and too short running time. Change the
> > upper limit to 10 million and you should get a graph with actual
> > bands.
> >
> > > Thanks,
> > > Chris