[Haskell-cafe] Fast sorting with Bytestring

Stefan O'Rear stefanor at cox.net
Wed Jun 18 14:23:00 EDT 2008


On Wed, Jun 18, 2008 at 08:19:10PM +0200, Georg Sauthoff wrote:
> Hi,
> 
> I played a bit around with the nice bytestring package. At some point I
> implemented a simple sorting program, because I needed line-sorting a file
> with a custom line-compare function. I was a bit surprised, that the
> resulting code is very fast. A lot of faster than sorting via GNU sort
> (with the standard line-compare function).
> 
> Then I got suspicious and tested standard GNU sort against a trivial
> standard sort implementation in Haskell using bytestring. The Haskell
> implementation looks like:
> 
[snip]
> Ok, strane ... Well, let's test with some 'normal' text:
> 
> time ./sort < bible > /dev/null       # ~  0.4   s
> time sort < bible > /dev/null         # ~  0.56  s
> 
> Ok, not that different. But with Haskell you often expect to get very
> slow code compared to an implementation in C.
> 
> And I am surprised, that the Haskell is fast _and_ nice to read - because
> for example the ultra fast 'wc -l' Haskell implementation from the
> Haskell-Wiki uses some insider-knowledge about bytestring, and looks to
> a beginner not that intuitive, I guess.
> 
> ./sort is the shown Haskell implementation. I used ghc 6.8.2, installed
> bytestring 0.9.1.0 as a user-local package (don't now I this superseeds
> the global bytestring package), compiled via ghc -O2. The tests are run at a
> Pentium M 1.3 GHz computer. As GNU sort I tested the version from
> coreutils 6.9.
> 
> You can get the test files from:
> http://www.techfak.uni-bielefeld.de/~gsauthof/stuff/brackets.gz
> http://www.techfak.uni-bielefeld.de/~gsauthof/stuff/bible.gz
> (obviously, you have to gunzip them after downloading ...)
> 
> Of course, the naive Haskell implementation doesn't care about locales
> (i.e. collating locale sequences), but this shouldn't explain the
> observed differences sorting the brackets file.

GNU 'sort' uses an external sort algorithm.  You can, with 200M of
memory, give it a 50G input file, and it will work.  This might explain
the difference..

Stefan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://www.haskell.org/pipermail/haskell-cafe/attachments/20080618/2f6ee2ad/attachment.bin


More information about the Haskell-Cafe mailing list