[Haskell-cafe] Fast sorting with Bytestring
Stefan O'Rear
stefanor at cox.net
Wed Jun 18 14:23:00 EDT 2008
On Wed, Jun 18, 2008 at 08:19:10PM +0200, Georg Sauthoff wrote:
> Hi,
>
> I played a bit around with the nice bytestring package. At some point I
> implemented a simple sorting program, because I needed line-sorting a file
> with a custom line-compare function. I was a bit surprised, that the
> resulting code is very fast. A lot of faster than sorting via GNU sort
> (with the standard line-compare function).
>
> Then I got suspicious and tested standard GNU sort against a trivial
> standard sort implementation in Haskell using bytestring. The Haskell
> implementation looks like:
>
[snip]
> Ok, strane ... Well, let's test with some 'normal' text:
>
> time ./sort < bible > /dev/null # ~ 0.4 s
> time sort < bible > /dev/null # ~ 0.56 s
>
> Ok, not that different. But with Haskell you often expect to get very
> slow code compared to an implementation in C.
>
> And I am surprised, that the Haskell is fast _and_ nice to read - because
> for example the ultra fast 'wc -l' Haskell implementation from the
> Haskell-Wiki uses some insider-knowledge about bytestring, and looks to
> a beginner not that intuitive, I guess.
>
> ./sort is the shown Haskell implementation. I used ghc 6.8.2, installed
> bytestring 0.9.1.0 as a user-local package (don't now I this superseeds
> the global bytestring package), compiled via ghc -O2. The tests are run at a
> Pentium M 1.3 GHz computer. As GNU sort I tested the version from
> coreutils 6.9.
>
> You can get the test files from:
> http://www.techfak.uni-bielefeld.de/~gsauthof/stuff/brackets.gz
> http://www.techfak.uni-bielefeld.de/~gsauthof/stuff/bible.gz
> (obviously, you have to gunzip them after downloading ...)
>
> Of course, the naive Haskell implementation doesn't care about locales
> (i.e. collating locale sequences), but this shouldn't explain the
> observed differences sorting the brackets file.
GNU 'sort' uses an external sort algorithm. You can, with 200M of
memory, give it a 50G input file, and it will work. This might explain
the difference..
Stefan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://www.haskell.org/pipermail/haskell-cafe/attachments/20080618/2f6ee2ad/attachment.bin
More information about the Haskell-Cafe
mailing list