[Haskell-cafe] Bytestrings and [Char]

Tue Mar 23 15:50:47 EDT 2010

On Tue, Mar 23, 2010 at 03:31:33PM -0400, Nick Bowler wrote:
> On 18:25 Tue 23 Mar     , Iustin Pop wrote:
> > On Tue, Mar 23, 2010 at 01:21:49PM -0400, Nick Bowler wrote:
> > > On 18:11 Tue 23 Mar     , Iustin Pop wrote:
> > > > I agree with the principle of correctness, but let's be honest - it's
> > > > (many) orders of magnitude between ByteString and String and Text, not
> > > > just a few percentage points…
> > > > 
> > > > I've been struggling with this problem too and it's not nice. Every time
> > > > one uses the system readFile & friends (anything that doesn't read via
> > > > ByteStrings), it hell slow.
> > > > 
> > > > Test: read a file and compute its size in chars. Input text file is
> > > > ~40MB in size, has one non-ASCII char. The test might seem stupid but it
> > > > is a simple one. ghc 6.12.1.
> > > > 
> > > > Data.ByteString.Lazy (bytestring readFile + length) - < 10 miliseconds,
> > > > incorrect length (as expected).
> > > > 
> > > > Data.ByteString.Lazy.UTF8 (system readFile + fromString + length) - 11
> > > > seconds, correct length.
> > > > 
> > > > Data.Text.Lazy (system readFile + pack + length) - 26s, correct length.
> > > > 
> > > > String (system readfile + length) - ~1 second, correct length.
> > > 
> > > Is this a mistake?  Your own report shows String & readFile being an
> > > order of magnitude faster than everything else, contrary to your earlier
> > > claim.
> > 
> > No, it's not a mistake. String is faster than pack to Text and length, but it's
> > 100 times slower than ByteString.
> 
> Only if you don't care about obtaining the correct answer, in which case
> you may as well just say const 42 or somesuch, which is even faster.
> 
> > My whole point is that difference between byte processing and char processing
> > in Haskell is not a few percentage points, but order of magnitude. I would
> > really like to have only the 6x penalty that Python shows, for example.
> 
> Hang on a second... less than 10 milliseconds to read 40 megabytes from
> disk?  Something's fishy here.

Of course I don't want to benchmark the disk, and therefore the source file is
on tmpfs.

> I ran my own tests with a 400M file (419430400 bytes) consisting almost
> exclusively of the letter 'a' with two Japanese characters placed at
> every multiple of 40 megabytes (UTF-8 encoded).
> 
> With Prelude.readFile/length and 5 runs, I see
> 
>   10145ms, 10087ms, 10223ms, 10321ms, 10216ms.
> 
> with approximately 10% of that time spent performing GC each run.
> 
> With Data.Bytestring.Lazy.readFile/length and 5 runs, I see
> 
>   8223ms, 8192ms, 8077ms, 8091ms, 8174ms.
> 
> with approximately 20% of that time spent performing GC each run.
> Maybe there's some magic command line options to tune the GC for our
> purposes, but I only managed to make things slower.  Thus, I'll handwave
> a bit and just shave off the GC time from each result.
> 
> Prelude: 9178ms mean with a standard deviation of 159ms.
> Data.ByteString.Lazy: 6521ms mean with a standard deviation of 103ms.
> 
> Therefore, we managed a throughput of 43 MB/s with the Prelude (and got
> the right answer), while we managed 61 MB/s with lazy ByteStrings (and
> got the wrong answer).  My disk won't go much, if at all, faster than
> the second result, so that's good.

I'll bet that for a 400MB file, if you have more than two 2GB of ram, most of
it will be cached. If you want to check Haskell performance, just copy it to a
tmpfs filesytem so that the disk is out of the equation.

> So that's a 30% reduction in throughput.  I'd say that's a lot worse
> than a few percentage points, but certainly not orders of magnitude.

Because you're possibly benchmarking the disk also. With a 400MB file on tmpfs,
lazy bytestring readfile + length takes on my machine ~150ms, which is way
faster than 8 seconds…

> On the other hand, using Data.ByteString.Lazy.readFile and
> Data.ByteString.Lazy.UTF8.length, we get results of around 12000ms with
> approximately 5% of that time spent in GC, which is rather worse than
> the Prelude.  Data.Text.Lazy.IO.readFile and Data.Text.Lazy.length are
> even worse, with results of around 25 *seconds* (!!) and 2% of that time
> spent in GC.
> 
> GNU wc computes the correct answer as quickly as lazy bytestrings
> compute the wrong answer.  With perl 5.8, slurping the entire file as
> UTF-8 computes the correct answer just as slowly as Prelude.  In my
> first ever Python program (with python 2.6), I tried to read the entire
> file as a unicode string and it quickly crashes due to running out of
> memory (yikes!), so it earns a DNF.
> 
> So, for computing the right answer with this simple test, it looks like
> the Prelude is the best option.  We tie with Perl and lose only to GNU
> wc (which is written in C).  Really, though, it would be nice to close
> that gap.

Totally agreed :)

iustin