[Haskell-cafe] Re: Friday Afternoon Play: The Burrows Wheeler Transformation

Mon Nov 20 11:34:24 EST 2006

Hello Jón,

sadly, nobody wants to play with you. Most likely, it's because the next
move is too hard: for a faster forward transform, you need (something as
tricky as) suffix trees. See also

     Robert Giegerich, Stefan Kurtz:
    "A Comparison of Imperative and Purely Functional Suffix Tree
Constructions"
    http://citeseer.ist.psu.edu/giegerich95comparison.html

Regards,
apfelmus

Jón Fairbairn wrote:
> So, it's Friday afternoon (well, here anyway), and
> presumably everyone is thinking of leaving for the
> weekend. Is anyone interested in playing a little game?  I'm
> going to make the first move, and then just wait and see if
> anyone makes the next. I'm certainly not promising to make
> any further moves myself, but I'm not promising I won't,
> either.
> 
> Below you'll see a naïve implementation of the Burrows
> Wheeler transformation and its inverse.  That's my
> move. (And don't complain too much about style and such --
> I'm half asleep.)  Further moves are made by transforming
> the programme so that it becomes more efficient.  The first
> target is not raw speed, but to adjust it until the
> computational complexity gets to be what it should be.  I'm
> hoping for elegant moves here.  After that, folk might start
> replacing the lists with more efficient datastructures and
> so on, but still /elegantly/.
> 
> The object of the game isn't ultimate speed -- I've no
> particular wish to see a Haskell version that's as fast as a
> good C version, especially not if it ends up /looking/ like
> a C version.  Really the objective is to think about Haskell
> programming -- and probably come up with some observations
> such as "there really ought to be a class of indexable
> structures, so that we don't have to change (!!) to (!) all
> over the place". And to have fun, anyhow...
> 
> * * *
> 
> Demonstration of the principle of the Burrows Wheeler
> transform. The implementations are simplistic (and have
> worse complexity than is possible).
> 
>> module Burrows_Wheeler_Transform where
>> import List
> 
> This is surely the wrong module to have to find (><) from:
> 
>> import Data.Graph.Inductive.Query.Monad((><))
> 
> The forward transformation. Input a list, output the
> transformed list and possibly an int (Nothing if there were
> no elements in the input -- perhaps the result type should
> be Maybe (NonEmptyList t,Int))
> 
>> slow_bwt:: Ord t => [t] -> ([t], Maybe Int)
> 
> This version works by computing a list of all the rotations
> of the input and sorting it. The transformation is then the
> last element of each line of the sorted rotations. Tag the
> data with an index so that it's possible to find where the
> original form of the unput ends up after sorting. This is
> essentially the specification of the transform coded up
> directly.
> 
>> slow_bwt l
>>     = (map fst tagged_bwt, index_of_original)
>>       where tagged_bwt = map (last><id)
>>                          $ sort
>>                          $ rotations l`zip`[0..]
>>             index_of_original = findIndex ((==0).snd) tagged_bwt
> 
> that looks like worst case n²×log n to me (where n is the
> length of the string): if every element is the same, the
> n×log n comparisons the sort does will each look at every
> element, and that's really more than is necessary.
> 
> 
> 
> The inverse transform. We should have that 
> 
> @ 
> slow_inv_bwt . slow_bwt == id
> @
> 
> Observe that sorting the input gives the first column of the
> sorted rotations (and the input is the last). So after one
> sort we have
> 
> X    ...   c
> Y    ...   b
> Z    ...   a
> ..     .    .
> ..     .    .
> ..     .    .
> 
> 
> Since they are rotations, rotating each row of this gives
> pairs that belong to the original list:
> 
> cX   ...
> by   ...
> aZ   ...
> ..     .
> ..     .
> ..     .
> 
> sorting these gives us the first two columns of the sorted
> rotations, and again we already know the last column, which
> can be rotated into the first and so on.
> 
>> slow_inv_bwt ([],_) = []
>> slow_inv_bwt (l,Just index_of_original)
>>     = iterate catsort (replicate len [])
> 
> After (length input) iterations we have the original sorted
> rotations,
> 
>>       !! len
> 
> and we have the index of where the untransformed string is,
> so we can just pick it back out.
> 
>>       !! index_of_original
> 
>>       where catsort s = sort $ map (uncurry (:)) $ l `zip`s
>>             len = length l
> 
> 
> This version does rather fewer sorts:
> 
>> mark2_inv_bwt ([],_) = []
>> mark2_inv_bwt (l,Just n)
>>     = (transpose $ take (length l) $ iterate (permuteWith perm) l)!!(perm!!n)
>>       where perm = map snd $ sort $ l `zip` [0..]
> 
> Should I simply have left it out and hoped that it would
> turn up after a few moves? I'm not going to explain it,
> anyway.
> 
>> rotations l = init $ zipWith (++) (tails l) (inits l)
> 
>> permuteWith ns l = map (l!!) ns