[Haskell-cafe] Bytestrings and [Char]

Joachim Breitner mail at joachim-breitner.de
Tue Mar 23 12:53:12 EDT 2010


Hi,

Am Dienstag, den 23.03.2010, 08:51 -0700 schrieb John Millikin:
> On Tue, Mar 23, 2010 at 00:27, Johann Höchtl <johann.hoechtl at gmail.com> wrote:
> > How are ByteStrings (Lazy, UTF8) and Data.Text meant to co-exist? When I
> > read bytestrings over a socket which happens to be UTF16-LE encoded and
> > identify a fitting function in Data.Text, I guess I have to transcode them
> > with Data.Text.Encoding to make the type System happy?
> >
> There's no such thing as a UTF8 or UTF16 bytestring -- a bytestring is
> just a more efficient encoding of [Word8], just as Text is a more
> efficient encoding of [Char]. If the file format you're parsing
> specifies that some series of bytes is text encoded as UTF16-LE, then
> you can use the Text decoders to convert to Text.

It wold still be useful to have an alternative to Data.Text that
internally stores strings as UTF8 encoded bytestrings. I tried to switch
from String to Data.Text in arbtt (which mostly calls pcre-light, which
expects and returns UTF8-encoded C-strings), and it became slower! No
surprise, considering that the program has to re-encode the strings all
the time.

Using a 
> newtype Text = Text { ByteString }
with an interface akin to Data.Text, but using UTF8-encoded ByteStrings
internally gave the same performance as String, at half the memory
footprint. This is in an internal module¹ but I would find it handy to
have this available as a common type in a well-supported library.

Greetings,
Joachim

¹ http://darcs.nomeata.de/arbtt/src/Data/MyText.hs


-- 
Joachim "nomeata" Breitner
  mail: mail at joachim-breitner.de | ICQ# 74513189 | GPG-Key: 4743206C
  JID: nomeata at joachim-breitner.de | http://www.joachim-breitner.de/
  Debian Developer: nomeata at debian.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part
Url : http://www.haskell.org/pipermail/haskell-cafe/attachments/20100323/33d31b4a/attachment.bin


More information about the Haskell-Cafe mailing list