[Haskell-cafe] Bytestrings and [Char]

Tue Mar 23 13:11:03 EDT 2010

On Tue, Mar 23, 2010 at 08:51:16AM -0700, John Millikin wrote:
> On Tue, Mar 23, 2010 at 00:27, Johann Höchtl <johann.hoechtl at gmail.com> wrote:
> > How are ByteStrings (Lazy, UTF8) and Data.Text meant to co-exist? When I
> > read bytestrings over a socket which happens to be UTF16-LE encoded and
> > identify a fitting function in Data.Text, I guess I have to transcode them
> > with Data.Text.Encoding to make the type System happy?
> >
> There's no such thing as a UTF8 or UTF16 bytestring -- a bytestring is
> just a more efficient encoding of [Word8], just as Text is a more
> efficient encoding of [Char]. If the file format you're parsing
> specifies that some series of bytes is text encoded as UTF16-LE, then
> you can use the Text decoders to convert to Text.
> 
> Poor separation between bytes and characters has caused problems in
> many major languages (C, C++, PHP, Ruby, Python) -- lets not abandon
> the advantages of correctness to chase a few percentage points of
> performance.

I agree with the principle of correctness, but let's be honest - it's
(many) orders of magnitude between ByteString and String and Text, not
just a few percentage points…

I've been struggling with this problem too and it's not nice. Every time
one uses the system readFile & friends (anything that doesn't read via
ByteStrings), it hell slow.

Test: read a file and compute its size in chars. Input text file is
~40MB in size, has one non-ASCII char. The test might seem stupid but it
is a simple one. ghc 6.12.1.

Data.ByteString.Lazy (bytestring readFile + length) - < 10 miliseconds,
incorrect length (as expected).

Data.ByteString.Lazy.UTF8 (system readFile + fromString + length) - 11
seconds, correct length.

Data.Text.Lazy (system readFile + pack + length) - 26s, correct length.

String (system readfile + length) - ~1 second, correct length.

For the record:

python2.6 (str type) -  ~60ms, incorrect length.
python3.1 (unicode)  - ~310ms, correct length.

If anyone has a solution on how to work on fast text (unicode)
transformations (but not a 1:1 pipeline where fusion can work nicely),
I'd be glad to hear.

iustin