[Haskell-cafe] Fun with ByteStrings [was: A very edgy language]

Sun Jul 8 07:07:53 EDT 2007

Donald Bruce Stewart wrote:
> andrewcoppin:
>   
>> Does anybody have any clue why ByteStrings are actually faster? (And why 
>> this information isn't easily findable anywhere - must shorly be a VFAQ.)
>>
>>     
>
> It's well documented in the API documentation for bytestrings.
>
> Start here:
>     http://www.cse.unsw.edu.au/~dons/fps/Data-ByteString-Lazy.html
>   

I've read the API and still left wondering...

> And then read:
>     http://www.cse.unsw.edu.au/~dons/papers/CSL06.html
>   

Now *that* is far more useful... (And interesting.)

So what you're saying is that whereas "fusion" is usually used to mean 
"that amazing technology that will one day supply us with unlimited 
amounts of cheap clean energy [shame it doesn't actually work]", in the 
context of Haskell it seems "fusion" means "that technology that makes 
all your code 98% faster for free"? ;-)

I guess the question that's really burning in my mind is "if ByteString 
is so much faster than [x], why can't you just do the same optimisations 
to [x]?" In other words, "why should I need to alter my code to get all 
this fusion goodness?"

Now, as I understand it, a ByteString is a kind of unboxed array (= big 
RAM savings + big CPU time savings for not building it + big GC savings 
for not processing millions of list nodes + better cache performance). 
Or at least, a *strict* ByteString is; I'm very very fuzzy on exactly 
how a *lazy* ByteString is any different to a normal list. From my 
reading today, I take it the only real difference is that one is a 
linked list, whereas the other is a (boxed?) array, so it's smaller. (?)

At any rate, currently all my toy compression algorithms run at 
respectable speeds using [Word8], *except* for the BWT, which is 
absurdly slow. I've done everything I can think of to it, and it's still 
too slow. It's no use, I'm going to have to use ByteStrings to get any 
kind of performance out of it. I'm just wondering whether using 
getContents :: [Char] and then packing that into a ByteString is going 
to completely negate all the speed advantages. (I'm really not keen to 
completely mangle the entire toolbox just to make this one algorithm 
hurry up.)

Also, while I'm here... I can see a sorting algorithm implemented in 
Data.List, but I don't see any for other structures. For example, there 
doesn't appear to be any sorting functions for any kind of array. There 
doesn't appear to be anything for ByteString either. I'd like to see 
such a thing in the libraries. Yes, you can sort things using (say) 
Data.Map. But what if you have some data that's already *in* an array or 
a ByteString and you just want to sort it? (I also notice that the 
mutable arrays don't seem to provide an in-place map function. Obviously 
that's only ever going to work for a function that doesn't change the 
value's type, but even so...)

Finally, I don't see anything in the libraries that would efficiently 
sort (large) strings. Data.List.sort and Data.Map.Map both use an Ord 
context, and we have instance (Ord x) => Ord [x], but one would think 
that a [potentially large] list of values could be sorted more 
efficiently using a radix sort than a quicksort...? An implementation of 
Data.Map especially for the (common?) case of the keys being a *list* of 
Ord items would seem useful... (And in my current program, I'm probably 
going to implement a radix sort on lists of ByteStrings.)