[Haskell-cafe] Need for speed: the Burrows-Wheeler Transform
Andrew Coppin
andrewcoppin at btinternet.com
Fri Jun 22 16:26:40 EDT 2007
OK, so I *started* writing a request for help, but... well I answered my
own question! See the bottom...
I was reading Wikipedia, and I found this:
http://en.wikipedia.org/wiki/Burrows-Wheeler_transform
I decided to sit down and see what that looks like in Haskell. I came up
with this:
module BWT where
import Data.List
rotate1 :: [x] -> [x]
rotate1 [] = []
rotate1 xs = last xs : init xs
rotate :: [x] -> [[x]]
rotate xs = take (length xs) (iterate rotate1 xs)
bwt :: (Ord x) => [x] -> [x]
bwt = map last . sort . rotate
step :: (Ord x) => [x] -> [[x]] -> [[x]]
step xs = zipWith (:) xs . sort
inv_bwt :: (Ord x) => x -> [x] -> [x]
inv_bwt mark xs =
head . filter (\xs -> head xs == mark) .
head . drop ((length xs) - 1) . iterate (step xs) .
map (\x -> [x]) $ xs
My my, isn't that SO much shorter? I love Haskell! :-D
Unfortunately, the resident C++ expert fails to grasp the concept of
"example code", and insists on comparing the efficiency of this program
to the C one on the website.
Fact is, he's translated the presented C into C++, and it can apparently
transform a 145 KB file in 8 seconds using only 3 MB of RAM. The code
above, however, took about 11 seconds to transform 4 KB of text, and
that required about 60 MB of RAM. (I tried larger, but the OS killed the
process for comsuming too much RAM.)
Well anyway, the code was written for simplicity, not efficiency. I've
tried to explain this, but apparently that is beyond his comprehension.
So anyway, it looks like we have a race on. >:-D
The first thing I did was the optimisation mentioned on Wikipedia: you
don't *need* to build a list of lists. You can just throw pointers
around. So I arrived at this:
module BWT2 (bwt) where
import Data.List
rotate :: Int -> [x] -> Int -> [x]
rotate l xs n = (drop (l-n) xs) ++ (take (l-n) xs)
bwt xs =
let l = length xs
ys = rotate l xs
in map (last . rotate l xs) $
sortBy (\n m -> compare (ys n) (ys m)) [0..(l-1)]
This is indeed *much* faster. With this, I can transform 52 KB of text
in 9 minutes + 60 MB RAM. The previous version seemed to have quadratic
memory usage, whereas this one seems to be linear. 52 KB would have
taken many months with the first version!
Still, 9 minutes (for a file 3 times smaller) is nowhere near 8 seconds.
So we must try harder... For my next trick, ByteStrings! (Never used
them before BTW... this is my first try!)
module BWT3 (bwt) where
import Data.List
import qualified Data.ByteString as Raw
rotate :: Int -> Raw.ByteString -> Int -> Raw.ByteString
rotate l xs n = (Raw.drop (l-n) xs) `Raw.append` (Raw.take (l-n) xs)
bwt xs =
let l = Raw.length xs
ys = rotate l xs
in Raw.pack $
map (Raw.last . rotate l xs) $
sortBy (\n m -> compare (ys n) (ys m)) [0..(l-1)]
Now I can transform 52 KB in 54 seconds + 30 MB RAM. Still nowhere near
C++, but a big improvement none the less.
Woah... What the hell? I just switched to Data.ByteString.Lazy and WHAM!
Vast speed increases... Jeepers, I can transform 52 KB so fast I can't
even get to Task Manager fast enough to *check* the RAM usage! Blimey...
OK, just tried the 145 KB test file that Mr C++ used. That took 2
seconds + 43 MB RAM. Ouch.
More information about the Haskell-Cafe
mailing list