[Haskell-cafe] Bytestrings and [Char]

Tue Mar 23 15:31:33 EDT 2010

On 18:25 Tue 23 Mar     , Iustin Pop wrote:
> On Tue, Mar 23, 2010 at 01:21:49PM -0400, Nick Bowler wrote:
> > On 18:11 Tue 23 Mar     , Iustin Pop wrote:
> > > I agree with the principle of correctness, but let's be honest - it's
> > > (many) orders of magnitude between ByteString and String and Text, not
> > > just a few percentage points…
> > > 
> > > I've been struggling with this problem too and it's not nice. Every time
> > > one uses the system readFile & friends (anything that doesn't read via
> > > ByteStrings), it hell slow.
> > > 
> > > Test: read a file and compute its size in chars. Input text file is
> > > ~40MB in size, has one non-ASCII char. The test might seem stupid but it
> > > is a simple one. ghc 6.12.1.
> > > 
> > > Data.ByteString.Lazy (bytestring readFile + length) - < 10 miliseconds,
> > > incorrect length (as expected).
> > > 
> > > Data.ByteString.Lazy.UTF8 (system readFile + fromString + length) - 11
> > > seconds, correct length.
> > > 
> > > Data.Text.Lazy (system readFile + pack + length) - 26s, correct length.
> > > 
> > > String (system readfile + length) - ~1 second, correct length.
> > 
> > Is this a mistake?  Your own report shows String & readFile being an
> > order of magnitude faster than everything else, contrary to your earlier
> > claim.
> 
> No, it's not a mistake. String is faster than pack to Text and length, but it's
> 100 times slower than ByteString.

Only if you don't care about obtaining the correct answer, in which case
you may as well just say const 42 or somesuch, which is even faster.

> My whole point is that difference between byte processing and char processing
> in Haskell is not a few percentage points, but order of magnitude. I would
> really like to have only the 6x penalty that Python shows, for example.

Hang on a second... less than 10 milliseconds to read 40 megabytes from
disk?  Something's fishy here.

I ran my own tests with a 400M file (419430400 bytes) consisting almost
exclusively of the letter 'a' with two Japanese characters placed at
every multiple of 40 megabytes (UTF-8 encoded).

With Prelude.readFile/length and 5 runs, I see

  10145ms, 10087ms, 10223ms, 10321ms, 10216ms.

with approximately 10% of that time spent performing GC each run.

With Data.Bytestring.Lazy.readFile/length and 5 runs, I see

  8223ms, 8192ms, 8077ms, 8091ms, 8174ms.

with approximately 20% of that time spent performing GC each run.
Maybe there's some magic command line options to tune the GC for our
purposes, but I only managed to make things slower.  Thus, I'll handwave
a bit and just shave off the GC time from each result.

Prelude: 9178ms mean with a standard deviation of 159ms.
Data.ByteString.Lazy: 6521ms mean with a standard deviation of 103ms.

Therefore, we managed a throughput of 43 MB/s with the Prelude (and got
the right answer), while we managed 61 MB/s with lazy ByteStrings (and
got the wrong answer).  My disk won't go much, if at all, faster than
the second result, so that's good.

So that's a 30% reduction in throughput.  I'd say that's a lot worse
than a few percentage points, but certainly not orders of magnitude.

On the other hand, using Data.ByteString.Lazy.readFile and
Data.ByteString.Lazy.UTF8.length, we get results of around 12000ms with
approximately 5% of that time spent in GC, which is rather worse than
the Prelude.  Data.Text.Lazy.IO.readFile and Data.Text.Lazy.length are
even worse, with results of around 25 *seconds* (!!) and 2% of that time
spent in GC.

GNU wc computes the correct answer as quickly as lazy bytestrings
compute the wrong answer.  With perl 5.8, slurping the entire file as
UTF-8 computes the correct answer just as slowly as Prelude.  In my
first ever Python program (with python 2.6), I tried to read the entire
file as a unicode string and it quickly crashes due to running out of
memory (yikes!), so it earns a DNF.

So, for computing the right answer with this simple test, it looks like
the Prelude is the best option.  We tie with Perl and lose only to GNU
wc (which is written in C).  Really, though, it would be nice to close
that gap.

-- 
Nick Bowler, Elliptic Technologies (http://www.elliptictech.com/)