[Haskell-cafe] Re: Bytestrings and [Char]

Tue Mar 23 17:42:09 EDT 2010

Iustin Pop <iusty at k1024.org> writes:

> On Tue, Mar 23, 2010 at 03:31:33PM -0400, Nick Bowler wrote:
>
> > So that's a 30% reduction in throughput.  I'd say that's a lot worse
> > than a few percentage points, but certainly not orders of magnitude.
> 
> Because you're possibly benchmarking the disk also. With a 400MB
> file on tmpfs, lazy bytestring readfile + length takes on my machine
> ~150ms, which is way faster than 8 seconds…

If you read the source code, length do not read the data, that's why
it is so fast. It cannot be done for UTF-8 strings.

>From Data.ByteString.Lazy:

-- | /O(n\/c)/ 'length' returns the length of a ByteString as an
-- | 'Int64'
length :: ByteString -> Int64
length cs = foldlChunks (\n c -> n + fromIntegral (S.length c)) 0 cs
{-# INLINE length #-}

> > On the other hand, using Data.ByteString.Lazy.readFile and
> > Data.ByteString.Lazy.UTF8.length, we get results of around 12000ms with
> > approximately 5% of that time spent in GC, which is rather worse than
> > the Prelude.  Data.Text.Lazy.IO.readFile and Data.Text.Lazy.length are
> > even worse, with results of around 25 *seconds* (!!) and 2% of that time
> > spent in GC.
> > 
> > GNU wc computes the correct answer as quickly as lazy bytestrings
> > compute the wrong answer.  With perl 5.8, slurping the entire file as
> > UTF-8 computes the correct answer just as slowly as Prelude.  In my
> > first ever Python program (with python 2.6), I tried to read the entire
> > file as a unicode string and it quickly crashes due to running out of
> > memory (yikes!), so it earns a DNF.
> > 
> > So, for computing the right answer with this simple test, it looks like
> > the Prelude is the best option.  We tie with Perl and lose only to GNU
> > wc (which is written in C).  Really, though, it would be nice to close
> > that gap.
> 
> Totally agreed :)

texitoi at flyeeeng:~$ ./wc-utf8 /dev/shm/haskell-utf8.txt
Normal String + System.IO "60452700": 5.575169s
Data.ByteString.Lazy "61965200": 0.088136s
Data.ByteString.Lazy.UTF8 "60452700": 13.953714s
Cheating a little bit "60452700": 9.307322s
Data.Text.Lazy "60452700": 15.608354s
texitoi at flyeeeng:~$ time wc /dev/shm/haskell-utf8.txt 
 1329900  8065200 61965200 /dev/shm/haskell-utf8.txt

real    0m9.303s
user    0m9.089s
sys     0m0.152s
texitoi at flyeeeng:~$ 

Hey, normal string way faster than GNU wc!

Cheat sheet, using Data.ByteString.Lazy:

myLength :: U.ByteString -> Int
myLength b = loop 0 b
  where loop n xs = case readChar xs of
                      Just m ->
                          let n' = n+1
                          in n' `seq` loop n' (L.drop m xs)
                      Nothing -> n

readChar :: L.ByteString -> Maybe Int64
readChar bs = do (c,_) <- L.uncons bs
                 return (choose (fromEnum c))
  where
  choose :: Int -> Int64
  choose c
    | c < 0xc0  = 1
    | c < 0xe0  = 2
    | c < 0xf0  = 3
    | c < 0xf8  = 4
    | otherwise = 1

inspired by Data.ByteString.Lazy.UTF8, same performances as GNU wc (it
is cheating because it do not check the validity of the multibyte char).

Using Debian testing, ghc 6.12.1 on Atom N270 @ 1.6GHz. The file is a
repeated LaTeX UTF8 file of about 60MB.

-- 
Guillaume Pinot               http://www.irccyn.ec-nantes.fr/~pinot/

« Les grandes personnes ne comprennent jamais rien toutes seules, et
c'est fatigant, pour les enfants, de toujours leur donner des
explications... » -- Antoine de Saint-Exupéry, Le Petit Prince

()  ASCII ribbon campaign      -- Against HTML e-mail
/\  http://www.asciiribbon.org -- Against proprietary attachments