[Haskell-cafe] Re: Need for speed: the Burrows-Wheeler Transform

Sat Jun 23 22:56:12 EDT 2007

Felipe Almeida Lessa wrote:
> On 6/23/07, Bertram Felgenhauer <bertram.felgenhauer at googlemail.com> wrote:
>> "rbwt" implements the corresponding inverse BWT. It's a fun knot tying
>> exercise.
>>
>> > rbwt xs = let
>> >     res = sortBy (\(a:as) (b:bs) -> a `compare` b) (zipWith' (:) xs 
>> res)
>> >   in
>> >     tail . map snd . zip xs $ head res
> 
> Indeed it's very fun =). Although I must admit that, even after
> staring at the code for some time, I still don't see how it works. Is
> it too much to ask for a brief explanation?
> 
> I see that zipWith' creates all the lazy list without requiring any
> knowledge of the second list, which means that the list to be sorted
> is created nicely. But when sortBy needs to compare, say, the first
> with the second items, it forces the thunk of the lazy list. But that
> thunk depends on the sorted list itself!
> 
> What am I missing? ;)
> 
> Thanks in advance,
> 

I just read and understood it.  I'll walk through my version:

> import Data.List
> 
> f `on` g = \x y -> (g x) `f` (g y)
> 
> zipWith' f []     _       = []
> zipWith' f (x:xs) ~(y:ys) = f x y : zipWith' f xs ys
> 
> bwt = map head . sortBy (compare `on` tail) . init . tails . ('\0':)
> 
> rbwt xs = tail $ zipWith' (flip const) xs $ head res
>   where res = sortBy (compare `on` head) (zipWith' (:) xs res)

The key is that sort and sortBy are very very lazy.

'res' is a finite list of infinite strings.

Consider sorting such a finite list of infinite lists without
the recursive definition:

> example =
>   let a = [1,3..]
>       b = [1,2..]
>       c = [2,4..]
>       d = [1,4..]
>       sorted = sort [a,b,c,d]
>       firstFiveOfEach = map (take 5) sorted
>   in mapM_ print firstFiveOfEach

Which in ghci produces:

> *Main> example
> [1,2,3,4,5]
> [1,3,5,7,9]
> [1,4,7,10,13]
> [2,4,6,8,10]

Now for a sidetrack:

> example =
>   let a = [1,3..]
>       b = [1,2..]
>       c = [2,4..]
>       d = [1,4..]
>       sorted = sort [a,b,c,d,c]
>       firstFiveOfEach = map (take 5) sorted
>   in mapM_ print firstFiveOfEach

Produces the first three elements before 'c' ....

> *Main> example
> [1,2,3,4,5]
> [1,3,5,7,9]
> [1,4,7,10,13]

...and then hangs because "compare c c" never terminates.

This lazy sorting is also why "head . sort $ foo" is an O(N) way to get the 
minimum of a list "foo" in Haskell.

So the code for rbwt depends on this lazy sorting behavior.  The recursive 
nature of the definition is the difficult to (directly) invent part.

The result of (zipWith' (:) xs ___) is a list the same length as 'xs' and where 
each item is a list with the head item taken from 'xs'.  So it has "primed the 
pump" of the recursive definition by giving the first Char in each of the strings.

Compare with the much simpler 'ones' and 'odds':

ones = 1 : ones
odds = 1 : map (2+) odds

What about the tails of all these lists?  They come from res.  And res comes 
from sorting the results of zipWith'.  But the sorting is done only by looking 
at the head element, which is the one that comes from xs.  That is the "sortBy 
(\(a:as) (b:bs) -> a `compare` b)" or "sortBy (compare `on` head)".

So the sorting routine does not care about the ___ in (zipWith' (:) xs ___) at 
all, since it only examines the head which comes from xs.