[Haskell-cafe] Re: Bytestrings and [Char]
TeXitoi
texitoi at texitoi.eu
Tue Mar 23 17:42:09 EDT 2010
Iustin Pop <iusty at k1024.org> writes:
> On Tue, Mar 23, 2010 at 03:31:33PM -0400, Nick Bowler wrote:
>
> > So that's a 30% reduction in throughput. I'd say that's a lot worse
> > than a few percentage points, but certainly not orders of magnitude.
>
> Because you're possibly benchmarking the disk also. With a 400MB
> file on tmpfs, lazy bytestring readfile + length takes on my machine
> ~150ms, which is way faster than 8 seconds…
If you read the source code, length do not read the data, that's why
it is so fast. It cannot be done for UTF-8 strings.
>From Data.ByteString.Lazy:
-- | /O(n\/c)/ 'length' returns the length of a ByteString as an
-- | 'Int64'
length :: ByteString -> Int64
length cs = foldlChunks (\n c -> n + fromIntegral (S.length c)) 0 cs
{-# INLINE length #-}
> > On the other hand, using Data.ByteString.Lazy.readFile and
> > Data.ByteString.Lazy.UTF8.length, we get results of around 12000ms with
> > approximately 5% of that time spent in GC, which is rather worse than
> > the Prelude. Data.Text.Lazy.IO.readFile and Data.Text.Lazy.length are
> > even worse, with results of around 25 *seconds* (!!) and 2% of that time
> > spent in GC.
> >
> > GNU wc computes the correct answer as quickly as lazy bytestrings
> > compute the wrong answer. With perl 5.8, slurping the entire file as
> > UTF-8 computes the correct answer just as slowly as Prelude. In my
> > first ever Python program (with python 2.6), I tried to read the entire
> > file as a unicode string and it quickly crashes due to running out of
> > memory (yikes!), so it earns a DNF.
> >
> > So, for computing the right answer with this simple test, it looks like
> > the Prelude is the best option. We tie with Perl and lose only to GNU
> > wc (which is written in C). Really, though, it would be nice to close
> > that gap.
>
> Totally agreed :)
texitoi at flyeeeng:~$ ./wc-utf8 /dev/shm/haskell-utf8.txt
Normal String + System.IO "60452700": 5.575169s
Data.ByteString.Lazy "61965200": 0.088136s
Data.ByteString.Lazy.UTF8 "60452700": 13.953714s
Cheating a little bit "60452700": 9.307322s
Data.Text.Lazy "60452700": 15.608354s
texitoi at flyeeeng:~$ time wc /dev/shm/haskell-utf8.txt
1329900 8065200 61965200 /dev/shm/haskell-utf8.txt
real 0m9.303s
user 0m9.089s
sys 0m0.152s
texitoi at flyeeeng:~$
Hey, normal string way faster than GNU wc!
Cheat sheet, using Data.ByteString.Lazy:
myLength :: U.ByteString -> Int
myLength b = loop 0 b
where loop n xs = case readChar xs of
Just m ->
let n' = n+1
in n' `seq` loop n' (L.drop m xs)
Nothing -> n
readChar :: L.ByteString -> Maybe Int64
readChar bs = do (c,_) <- L.uncons bs
return (choose (fromEnum c))
where
choose :: Int -> Int64
choose c
| c < 0xc0 = 1
| c < 0xe0 = 2
| c < 0xf0 = 3
| c < 0xf8 = 4
| otherwise = 1
inspired by Data.ByteString.Lazy.UTF8, same performances as GNU wc (it
is cheating because it do not check the validity of the multibyte char).
Using Debian testing, ghc 6.12.1 on Atom N270 @ 1.6GHz. The file is a
repeated LaTeX UTF8 file of about 60MB.
--
Guillaume Pinot http://www.irccyn.ec-nantes.fr/~pinot/
« Les grandes personnes ne comprennent jamais rien toutes seules, et
c'est fatigant, pour les enfants, de toujours leur donner des
explications... » -- Antoine de Saint-Exupéry, Le Petit Prince
() ASCII ribbon campaign -- Against HTML e-mail
/\ http://www.asciiribbon.org -- Against proprietary attachments
More information about the Haskell-Cafe
mailing list