[Haskell-cafe] Re: Bytestrings and [Char]

Tue Mar 23 18:01:05 EDT 2010

> If you read the source code, length do not read the data, that's why
> it is so fast. It cannot be done for UTF-8 strings.

I think at this point most the amazement is directed at Data.Text
being slower than good old [Char] (at least for this operation - we
should probably expand our view to more than one operation).

> Hey, normal string way faster than GNU wc!

No - you need to perform a fair comparison.  Try "wc -c" to only count
characters (not lines and words too).  I'd provide numbers but my wc
doesn't seem to support UTF-8 and not sure what package contains a
unicode aware wc.

> readChar :: L.ByteString -> Maybe Int64
> readChar bs = do (c,_) <- L.uncons bs
>                 return (choose (fromEnum c))
>  where
>  choose :: Int -> Int64
>  choose c
>    | c < 0xc0  = 1
>    | c < 0xe0  = 2
>    | c < 0xf0  = 3
>    | c < 0xf8  = 4
>    | otherwise = 1
>
> inspired by Data.ByteString.Lazy.UTF8, same performances as GNU wc (it
> is cheating because it do not check the validity of the multibyte char).

Ah, interesting and a worth-while cheat.

Thomas