[Haskell-cafe] Parallel weirdness

Sat Apr 19 11:53:03 EDT 2008

I can't offer definite answers to your questions, but I can suggest a few 
issues you should consider:

1. Merge sort doesn't parallelize all that well--when the blocks are 
small, the parallelization overhead is large in comparison with the 
productive work that is to be done, and when the blocks get large, the 
amount of parallelization possible is not great. Quicksort and 
quickersort, of course, suffer from the same issue. The end result is that 
your timings will be heavily dependent on your hardware, software, and the 
properties of the particular data set you use for testing.

2. You need to account for I/O buffering (not only by your OP system in 
RAM, but by your disk controller)--after the first set of I/O operations, 
your data may be in buffers, so subsequent uses may retrieve data from 
buffers rather than from the disk itself. Similarly, you also have to take 
into account paging and cache issues, which could make the first run much 
slower than immediate subsequent runs.

3. A better benchmark would be provided by a counting sort, which does 
parallelize well (O(n * (n/k), where k is the number of processors, and n 
is the number of elements to be sorted). A major advantage of using a 
counting sort for benchmarking is that it runs slowly enough to make it 
relatively easy to compare sequential and parallel timings.

4. Depending on your system defaults, there may also be memory allocation 
issues that need to be taken into account (which could also easily cause 
the first run to be considerably slower than subsequent runs made 
immediately after the first).

Murray Gross
Brooklyn College

On Sat, 19 Apr 2008, Andrew Coppin wrote:

> OK, so just for fun, I decided to try implementing a parallel merge sort 
> using the seq and par combinators. My plan was to generate some psuedo-random 
> data and time how long it takes to sort it. To try to account for lazy 
> evaluation, what the program actually does is this:
>
> 1. Write the input data to disk without any sorting. (This ought to force it 
> to be fully evaluated.)
> 2. Sort and save the data to disk 8 times. (So I can average the runtimes.)
>
> This is done with two data sets - one with 1 million items, and another with 
> 2 million rows. Each data set is run through both the purely sequential 
> algorithm and a simple parallel one. (Split the list in half, merge-sort each 
> half in parallel, and then merge them.)
>
> The results of this little benchmark utterly defy comprehension. Allow me to 
> enumerate:
>
> Weird thing #1: The first time you sort the data, it takes a few seconds. The 
> other 7 times, it takes a split second - roughly 100x faster. Wuh?
>
> Weird thing #2: The parallel version runs *faster* than the sequential one in 
> all cases - even with SMP disabled! (We're only talking a few percent faster, 
> but still.)
>
> Weird thing #3: Adding the "-threaded" compiler option makes *everything* run 
> a few percent faster. Even with only 1 OS thread.
>
> Weird thing #4: Adding "-N2" makes *everything* slow down a few percent. In 
> particular, Task Manager shows only one CPU core in use.
>
> Adding more than 2 OS threads makes everything slow down even further - but 
> that's hardly surprising.
>
> Can anybody explain any of this behaviour? I have no idea what I'm 
> benchmarking, but it certainly doesn't appear to be the performance of a 
> parallel merge sort!
>
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>