[Haskell-cafe] The question of ByteString

Fri Nov 2 17:35:43 EDT 2007

Tim Chevalier wrote:
> On 11/2/07, Andrew Coppin <andrewcoppin at btinternet.com> wrote:
>   
>> Somewhat related to the discussions about Haskell's performance...
>>
>> String. ByteString. Do we really need both? Can one replace the other?
>>     
>
> You can't get rid of "String" because a String is just a [Char].
> Requiring the element type of a list to be "anything except Char"
> would be silly. In addition, it's useful to have a String that you can
> apply arbitrary list operations to when performance isn't a concern
> (i.e., most of the time). Finally, removing String would break
> existing code.
>
>   
>> Why is one faster? Can't we make *all* lists this fast? [insert further
>> variations here]
>>     
>
> ByteString takes advantage of the fact that the elements are, well,
> bytes. The operations are optimized for reading large amounts of text,
> but not necessarily for other applications. Lists are a parameterized
> type, so the elements of a list are pointers to arbitrary data. So
> that's why the same tricks as ByteString don't apply to general lists.
> That isn't to say that there aren't possible optimizations which
> haven't yet been dreamed of.
>   

Well OK, maybe I was a little vague. Let me be a bit more specific...

If you do text processing using ByteString rather than String, you get 
dramatically better performance in time and space. For me, this raises a 
number of questions:

1. Why do I have to type "ByteString" in my code? Why isn't the compiler 
automatically performing this optimisation for me? (I.e., is there some 
observable property that is changed? Currently the answer is yes: the 
ByteString interface only provides "trancated" Unicode characters. But, 
in principle, that could be changed.)

2. ByteString makes text strings faster. But what about other kinds of 
collections? Can't we do something similar to them that makes them go 
faster?

As I understand it, ByteString is faster due to several factors. First 
of all, it's stricter. Secondly, it's an "unboxed" structure (so you 
eliminate layers of indirection and there's less GC load). Third, it's 
implemented as an array that looks like a linked list. Given how 
ubiquitous lists are in Haskell, "array that looks like a linked list" 
sounds like one seriously useful data type! Yet ByteString seems to be 
the only implementation of this concept - and only for lists on unboxed 
bytes. (Not even unboxed Word16 or anything else, *only* Word8.) If I 
understand this correctly, a ByteString is actually a linked list of 
large array chunks. (This presumably yields fastER random access than a 
plain linked list?) Also, it seems to be possible to create a new array 
which is merely a subrange of an existing one, without any copying; the 
standard array API doesn't seem to provide this, yet it sounds damn useful.

These are the things I'm thinking about. Is there some deep theoretical 
reason why things are the way they are? Or is it merely that nobody has 
yet had time to make something better? ByteString solves the problem of 
text strings (and raw binary data) very nicely, it's just a pitty we 
can't apply some of that know-how more widely...