[Haskell-cafe] Re: Bytestrings and [Char]
Thomas DuBuisson
thomas.dubuisson at gmail.com
Tue Mar 23 18:01:05 EDT 2010
> If you read the source code, length do not read the data, that's why
> it is so fast. It cannot be done for UTF-8 strings.
I think at this point most the amazement is directed at Data.Text
being slower than good old [Char] (at least for this operation - we
should probably expand our view to more than one operation).
> Hey, normal string way faster than GNU wc!
No - you need to perform a fair comparison. Try "wc -c" to only count
characters (not lines and words too). I'd provide numbers but my wc
doesn't seem to support UTF-8 and not sure what package contains a
unicode aware wc.
> readChar :: L.ByteString -> Maybe Int64
> readChar bs = do (c,_) <- L.uncons bs
> return (choose (fromEnum c))
> where
> choose :: Int -> Int64
> choose c
> | c < 0xc0 = 1
> | c < 0xe0 = 2
> | c < 0xf0 = 3
> | c < 0xf8 = 4
> | otherwise = 1
>
> inspired by Data.ByteString.Lazy.UTF8, same performances as GNU wc (it
> is cheating because it do not check the validity of the multibyte char).
Ah, interesting and a worth-while cheat.
Thomas
More information about the Haskell-Cafe
mailing list