[Haskell-cafe] The question of ByteString

Sat Nov 3 05:34:40 EDT 2007

Duncan Coutts wrote:
> On Fri, 2007-11-02 at 21:35 +0000, Andrew Coppin wrote:
>
>   
>> Well OK, maybe I was a little vague. Let me be a bit more specific...
>>
>> If you do text processing using ByteString rather than String, you get 
>> dramatically better performance in time and space. For me, this raises a 
>> number of questions:
>>
>> 1. Why do I have to type "ByteString" in my code? Why isn't the compiler 
>> automatically performing this optimisation for me? (I.e., is there some 
>> observable property that is changed? 
>>     
>
> Yes, the semantics are different. ByteString is stricter. In some
> circumstances you could discover that some list is being used
> sufficiently strictly (spine and element strict) that you could do a
> representation change to use strict arrays. It is something I have
> pondered occasionally and I think that is an interesting avenue for
> research.
>   

OK. So I'm not the first person to wonder about this then... ;-)

> One approach might be to do a more sophisticated strictness analysis
>
> However this is likely to be quite fragile. I usually think that it's
> better to declare the strictness you want up front in one place

Yeah, I can imagine. And I guess then you'd be forever wondering "hey, 
did the compiler do that optimisation or not?" like with certain current 
optimisations now. (E.g., did that type get unboxed?)

>> Currently the answer is yes: the ByteString interface only provides
>> "trancated" Unicode characters. But, in principle, that could be
>> changed.)
>>     
>
> Indeed it could, we could provide a proper Unicode string type.
>   

Likely to happen in the near term? (Not that I imagine this is a huge 
issue for anybody...)

>> 2. ByteString makes text strings faster. But what about other kinds of 
>> collections? Can't we do something similar to them that makes them go 
>> faster?
>>     
>
> There is much less benefit for other collections since the overheads of
> generic structures are smaller for other types.
>
> Note that the NDP parallel arrays stuff uses type functions to calculate
> optimised data representations for arrays of types.
>   

Type... functions...? That sounds pretty scary. :-}

I was thinking more immediately about lists in general, but also perhaps 
binary trees and things. (Although it's probably hard to optimise a 
general binary tree; you would probably optimise a tree designed for a 
specific *purpose* instead...)

>> As I understand it, ByteString is faster due to several factors. First 
>> of all, it's stricter.
>>     
>
> So that's the semantic difference.
>
>   
>> Secondly, it's an "unboxed" structure. 
>>     
>
> Which is the representation optimisation allowed by the semantic change
> of making it stricter.
>   

Indeed.
>> Third, it's implemented as an array that looks like a linked list.
>> Given how ubiquitous lists are in Haskell, "array that looks like a
>> linked list" sounds like one seriously useful data type! Yet
>> ByteString seems to be the only implementation of this concept - and
>> only for lists on unboxed bytes. (Not even unboxed Word16 or anything
>> else, *only* Word8.) If I understand this correctly, a ByteString is
>> actually a linked list of large array chunks. (This presumably yields
>> fastER random access than a plain linked list?) Also, it seems to be
>> possible to create a new array which is merely a subrange of an
>> existing one, without any copying; the standard array API doesn't seem
>> to provide this, yet it sounds damn useful.
>>     
>
> I think the NDP project should get us most of this stuff actually.
>   

Cool.

So in summary, these kinds of things are "on the way", they're just not 
here quite yet?

(BTW, anybody have any clue what's happening with stream fusion? I 
remember reading the paper saying "hey, this is how it works and it's 
cool and we're going to try to replace the whole list library with new 
stream implementations", but that's the last I heard...)