Performance of `count`ing a char in a ByteString

Sat Feb 8 22:34:17 UTC 2020

Hi there,

tl;dr: I've noticed ByteString's `count`'s performance can be improved
by a factor between 2.5 and 8 by explicitly using certain SIMD
extensions, so what's the policy on those?

Results of a small C benchmark (since the corresponding BS routine is
implemented in C anyway) are available here (
https://github.com/0xd34df00d/counting-chars ), as well as some sample
code to give the idea of what's happening (which might be
incorrect/suboptimal, my SSE/AVX is rusty). I'll also put a sample
input file if I manage to get git-lfs going.

This change only really makes sense for large strings, but that's a
somewhat common use case at least in my applications, so I wanted to
gather some initial feedback, as well as ask a couple of questions and
emphasize a couple of points.

1. How are SIMD extensions generally controlled?

2. Is depending on gcc OK? If so, we could leverage gcc's support of
automatically generating a dispatching function that'll check cpuid
for the supported extensions and call the right implementation.

3. Does BS' allocator give any guarantees on the alignment? It'd be
nice to get rid of that "prefix" part that makes sure the SIMD code
churns through 64-byte-aligned data.

4. Note that compiling with `-O3 -march=native` makes code effectively
non-portable (at least, to CPUs lacking extensions the compiler
decides to use), while the code with `-O2`, explicit SIMD versions and
a dispatching function will get the best performance using the SIMD
extensions available, degrading gracefully on older hardware. So,
personally, when I look at that table with the results I compare the
SSE4.2/AVX2 implementations against the `-O2` column.

5. Thanks Gregory Collins for mentioning PCMPESTRM somewhere — that
remark sparked the desire to do some intrinsics stuff.

-- 
  Georg Rudoy