[Haskell-cafe] External Sort and unsafeInterleaveIO

Tue Jul 17 23:46:44 EDT 2007

hi --

thanks for the useful comments!  i will definitely go through them
carefully.  unfortunately for this code (but fortunately for me) i
defend my dissertation on monday so i'm a little distracted right
now.....

i'm more than happy to donate this code or whatever improvements
happen to it.  actually, hGetContentsWithCursor seems like a candidate
for inclusion with Data.ByteStrings or Data.Binary or something -- it
seems like it might find other uses.  (i think you liked that bit of
code because i ripped it off of you guys!  it's very short hamming
distance from the original.)  anyhow, all that will have to wait a
couple weeks or so.  also i've never cabalized anything so i may come
begging for help.

at some point i thought i saw how to do recursive external sort, to
keep memory usage truly constant, but with my current lack of sleep i
have lost that illusion.  i'm also curious about the performance
characteristics of this vs Prelude sort vs the version using the
tournament mergesort apfelmus suggested.  i need to find a computer
with a lot more RAM than my weakling laptop.  finally, it would be
good to be able to have the blocksize controlled by Kb of RAM rather
than # of elements, not sure how to get that information.

ultimately this was part of my project to write lucene for haskell.  i
think with this out of the way, plus all the Data.Binary / ByteString
goodness, it shouldn't take too long.  keep writing good libraries for
me!

thanks and take care, Ben

On 7/17/07, Donald Bruce Stewart <dons at cse.unsw.edu.au> wrote:
> midfield:
> > hi folks --
> >
> > a haskell newbie here, searching for comments and wisdom on my code.
> >
> > i had a project to try to implement "external sort" in haskell as a
> > learning exercise.  (external sort is sorting a list that is too large
> > to fit in main memory, by sorting in chunks, spooling to disk, and
> > then merging.  more properly there probably should be multiple stages,
> > but for simplicity i'm doing a one-stage external sort.)
> >
> > the trick is the number of files can quickly grow very large, so it is
> > best to use one large file and seek inside it around.  however as one
> > can imagine the order-of-IO-operations becomes a bit tricky, if you're
> > seeking file handles around underneath Data.ByteString.Lazy's nose.
> > but late this night after not thinking about it for a while i had a
> > brainstorm: rewrite hGetContents to keep the handle position in the
> > right place!  it's all about judicious use of unsafeInterleaveIO.....
> >
> > it seems to be rather fast, strangely faster than the usual "sort" at
> > times.  it also seems to have nice memory characteristics, though not
> > perfect.  it's hard to test because the normal "sort" function takes
> > too much RAM on large lists, making my computer swap like mad.
>
> I have to agree with Mr. Apfelmus here. This is lovely code. It is exactly
> what the ByteString team hoped people would be able to write
> ByteStrings: "Zen of Haskell" code, where you win by working at a high
> level, rather than a low level.
>
> Thanks!
>
> I've inserted some small comments though the  source:
>
> > >module ExternalSort where
> >
> > Sort a list of Ords "offline."  We're doing this to be able to sort
> > things without taking up too much memory (for example sorting lists
> > too large to fit in RAM.)  Laziness is imperative, as is the
> > order-of-operations.
> >
> > >import Control.Monad
> > >import Data.List
> > >import qualified Data.Binary as Bin
> > >import qualified Data.ByteString.Lazy as B
> > >import qualified Data.ByteString as P (hGetNonBlocking, null)
> > >import Data.ByteString.Base (LazyByteString(LPS))
> > >import Foreign.Storable (sizeOf)
> > >import System.IO (openFile, hClose, hSeek, hTell, hIsEOF, hWaitForInput,
> > >                  Handle, IOMode(ReadMode, WriteMode),
> > >                  SeekMode(AbsoluteSeek))
> > >import System.IO.Unsafe (unsafeInterleaveIO)
> > >
> > >import qualified Data.Edison.Seq.ListSeq as LS
> > >import qualified Data.Edison.Coll.SplayHeap as Splay
> >
> > Conceptually, we sort a list in blocks, spool blocks to disk, then
> > merge back.  However for IO performance it is better to read off
> > chunks of elements off the sorted blocks from disk instead of
> > elements-at-a-time.
> >
> > It would be better if these were in KBytes instead of # of elements.
> >
> > >blocksize :: Int
> > >blocksize = 10000
> >
> > Turn a list into a list of chunks.
> >
> > >slice :: Int -> [a] -> [[a]]
> > >slice _ [] = []
> > >slice size l = (take size l) : (slice size $ drop size l)
>
> That's unnecessary parenthesis, and I'd probably use splitAt here:
>
>     myslice :: Int -> [a] -> [[a]]
>     myslice _ [] = []
>     myslice n xs = a : myslice n b  where (a,b) = splitAt n xs
>
> And just to check:
>
>     *M> :m + Test.QuickCheck
>     *M Test.QuickCheck> quickCheck (\n (xs :: [Int]) -> n > 0 ==> slice n xs == myslice n xs)
>     OK, passed 100 tests.
>
> >
> > Turn a list into a list of blocks, each of which is sorted.
> >
> > >blockify :: (Ord a) => Int -> [a] -> [[a]]
> > >blockify bsize l = map sort $ slice bsize l
>
> Possibly you could drop the 'l' parameter:
>
>     blockify n = map sort . slice n
>
> >
> > Serialize a block, returning the (absolute) position of the start.
> >
> > >dumpBlock :: (Ord a, Bin.Binary a) => Handle -> [a] -> IO Integer
> > >dumpBlock h b = do
> > >  start <- hTell h
> > >  B.hPut h $ Bin.encode b
> > >  return start
> >
> > The actual sorting function.  We blockify the list, turning it into a
> > list of sorted blocks, and spool to disk, keeping track of offsets.
> > We then read back the blocks (lazily!), and merge them.
> >
> > >externalSort [] = do return []
> > >externalSort l = do
> > >  h <- openFile "ExternalSort.bin" WriteMode
> > >  idx <- mapM (\x -> dumpBlock h x) (blockify blocksize l)
>
>     idx <- mapM (dumpBlock h) (blockify blocksize l)
>
> > >  hClose h
> > >  h <- openFile "ExternalSort.bin" ReadMode
> > >  blocks <- mapM (\x -> do {bs <- hGetContentsWithCursor h x;
> > >                            return $ Bin.decode bs}) idx
>
>     Possibly
>
>         forM idx $ \x -> decode `fmap` hGetContentsWithCursor h x
>
>
> > >  return (kMerge $ blocks)
> >
> > Merging chunks.  K-way merge (and in fact external sort in general) is
> > detailed in Knuth, where he recommends tournament trees.  The easiest
> > thing is to probably use one of Okasaki's heaps.  I'll use splay
> > heaps, because I don't know any better.
> >
> > It would be better if I changed Ord for blocks to only check the first
> > element.
> >
> > >kMerge :: (Ord a) => [[a]] -> [a]
> > >kMerge [] = []
> > >kMerge l =
> > >    let h = Splay.fromSeq l in
> > >    kM (Splay.minElem h) (Splay.deleteMin h)
> > >    where
> > >    kM :: (Ord a) => [a] -> Splay.Heap [a] -> [a]
> > >    kM l h
> > >        | h == Splay.empty = l
> > >        | otherwise =
> > >            let next = Splay.minElem h
> > >                (f, b) = span (\x -> x <= head next) l
> > >            in
> > >            f ++ (kM next (if null b then Splay.deleteMin h
> > >                           else (Splay.insert b $ Splay.deleteMin h)))
> > >
> > >kMergeSort :: (Ord a) => [a] -> [a]
> > >kMergeSort l = kMerge $ blockify blocksize l
> >
> > This is a version of hGetContents which resets its handle position
> > between reads, so is safe to use with interleaved handle seeking.
> >
> > >hGetContentsWithCursor :: Handle -> Integer -> IO B.ByteString
> > >hGetContentsWithCursor = hGetContentsWithCursorN defaultChunkSize
> > >
> > >hGetContentsWithCursorN :: Int -> Handle -> Integer -> IO B.ByteString
> > >hGetContentsWithCursorN k h start = (lazyRead start) >>= return . LPS
> > >  where
> > >    lazyRead start = unsafeInterleaveIO $ loop start
> > >
> > >    loop start = do
> > >        hSeek h AbsoluteSeek start
> > >        ps <- P.hGetNonBlocking h k
> > >        --TODO: I think this should distinguish EOF from no data available
> > >        -- the otherlying POSIX call makes this distincion, returning
> > >        either
> > >        -- 0 or EAGAIN
> > >        if P.null ps
> > >          then do eof <- hIsEOF h
> > >                  if eof then return []
> > >                         else hWaitForInput h (-1)
> > >                           >> (loop start)
> > >           else do
> > >              pos <- hTell h
> > >              pss <- lazyRead pos
> > >              return (ps : pss)
>
> Very nice!
>
> > >
> > >defaultChunkSize :: Int
> > >defaultChunkSize = 32 * k - overhead
> > >   where k = 1024
> > >         overhead = 2 * sizeOf (undefined :: Int)
>
> We'll export this value in bytestring 1.0.
>
>
> I like this code. Would you consider cabalising it, and uploading it to
> hackage.haskell.org, so we don't lose it? Perhaps just call it hsort or
> something?
>
> Cheers,
>   Don
>