[Haskell-cafe] External Sort and unsafeInterleaveIO
Donald Bruce Stewart
dons at cse.unsw.edu.au
Tue Jul 17 23:17:26 EDT 2007
midfield:
> hi folks --
>
> a haskell newbie here, searching for comments and wisdom on my code.
>
> i had a project to try to implement "external sort" in haskell as a
> learning exercise. (external sort is sorting a list that is too large
> to fit in main memory, by sorting in chunks, spooling to disk, and
> then merging. more properly there probably should be multiple stages,
> but for simplicity i'm doing a one-stage external sort.)
>
> the trick is the number of files can quickly grow very large, so it is
> best to use one large file and seek inside it around. however as one
> can imagine the order-of-IO-operations becomes a bit tricky, if you're
> seeking file handles around underneath Data.ByteString.Lazy's nose.
> but late this night after not thinking about it for a while i had a
> brainstorm: rewrite hGetContents to keep the handle position in the
> right place! it's all about judicious use of unsafeInterleaveIO.....
>
> it seems to be rather fast, strangely faster than the usual "sort" at
> times. it also seems to have nice memory characteristics, though not
> perfect. it's hard to test because the normal "sort" function takes
> too much RAM on large lists, making my computer swap like mad.
I have to agree with Mr. Apfelmus here. This is lovely code. It is exactly
what the ByteString team hoped people would be able to write
ByteStrings: "Zen of Haskell" code, where you win by working at a high
level, rather than a low level.
Thanks!
I've inserted some small comments though the source:
> >module ExternalSort where
>
> Sort a list of Ords "offline." We're doing this to be able to sort
> things without taking up too much memory (for example sorting lists
> too large to fit in RAM.) Laziness is imperative, as is the
> order-of-operations.
>
> >import Control.Monad
> >import Data.List
> >import qualified Data.Binary as Bin
> >import qualified Data.ByteString.Lazy as B
> >import qualified Data.ByteString as P (hGetNonBlocking, null)
> >import Data.ByteString.Base (LazyByteString(LPS))
> >import Foreign.Storable (sizeOf)
> >import System.IO (openFile, hClose, hSeek, hTell, hIsEOF, hWaitForInput,
> > Handle, IOMode(ReadMode, WriteMode),
> > SeekMode(AbsoluteSeek))
> >import System.IO.Unsafe (unsafeInterleaveIO)
> >
> >import qualified Data.Edison.Seq.ListSeq as LS
> >import qualified Data.Edison.Coll.SplayHeap as Splay
>
> Conceptually, we sort a list in blocks, spool blocks to disk, then
> merge back. However for IO performance it is better to read off
> chunks of elements off the sorted blocks from disk instead of
> elements-at-a-time.
>
> It would be better if these were in KBytes instead of # of elements.
>
> >blocksize :: Int
> >blocksize = 10000
>
> Turn a list into a list of chunks.
>
> >slice :: Int -> [a] -> [[a]]
> >slice _ [] = []
> >slice size l = (take size l) : (slice size $ drop size l)
That's unnecessary parenthesis, and I'd probably use splitAt here:
myslice :: Int -> [a] -> [[a]]
myslice _ [] = []
myslice n xs = a : myslice n b where (a,b) = splitAt n xs
And just to check:
*M> :m + Test.QuickCheck
*M Test.QuickCheck> quickCheck (\n (xs :: [Int]) -> n > 0 ==> slice n xs == myslice n xs)
OK, passed 100 tests.
>
> Turn a list into a list of blocks, each of which is sorted.
>
> >blockify :: (Ord a) => Int -> [a] -> [[a]]
> >blockify bsize l = map sort $ slice bsize l
Possibly you could drop the 'l' parameter:
blockify n = map sort . slice n
>
> Serialize a block, returning the (absolute) position of the start.
>
> >dumpBlock :: (Ord a, Bin.Binary a) => Handle -> [a] -> IO Integer
> >dumpBlock h b = do
> > start <- hTell h
> > B.hPut h $ Bin.encode b
> > return start
>
> The actual sorting function. We blockify the list, turning it into a
> list of sorted blocks, and spool to disk, keeping track of offsets.
> We then read back the blocks (lazily!), and merge them.
>
> >externalSort [] = do return []
> >externalSort l = do
> > h <- openFile "ExternalSort.bin" WriteMode
> > idx <- mapM (\x -> dumpBlock h x) (blockify blocksize l)
idx <- mapM (dumpBlock h) (blockify blocksize l)
> > hClose h
> > h <- openFile "ExternalSort.bin" ReadMode
> > blocks <- mapM (\x -> do {bs <- hGetContentsWithCursor h x;
> > return $ Bin.decode bs}) idx
Possibly
forM idx $ \x -> decode `fmap` hGetContentsWithCursor h x
> > return (kMerge $ blocks)
>
> Merging chunks. K-way merge (and in fact external sort in general) is
> detailed in Knuth, where he recommends tournament trees. The easiest
> thing is to probably use one of Okasaki's heaps. I'll use splay
> heaps, because I don't know any better.
>
> It would be better if I changed Ord for blocks to only check the first
> element.
>
> >kMerge :: (Ord a) => [[a]] -> [a]
> >kMerge [] = []
> >kMerge l =
> > let h = Splay.fromSeq l in
> > kM (Splay.minElem h) (Splay.deleteMin h)
> > where
> > kM :: (Ord a) => [a] -> Splay.Heap [a] -> [a]
> > kM l h
> > | h == Splay.empty = l
> > | otherwise =
> > let next = Splay.minElem h
> > (f, b) = span (\x -> x <= head next) l
> > in
> > f ++ (kM next (if null b then Splay.deleteMin h
> > else (Splay.insert b $ Splay.deleteMin h)))
> >
> >kMergeSort :: (Ord a) => [a] -> [a]
> >kMergeSort l = kMerge $ blockify blocksize l
>
> This is a version of hGetContents which resets its handle position
> between reads, so is safe to use with interleaved handle seeking.
>
> >hGetContentsWithCursor :: Handle -> Integer -> IO B.ByteString
> >hGetContentsWithCursor = hGetContentsWithCursorN defaultChunkSize
> >
> >hGetContentsWithCursorN :: Int -> Handle -> Integer -> IO B.ByteString
> >hGetContentsWithCursorN k h start = (lazyRead start) >>= return . LPS
> > where
> > lazyRead start = unsafeInterleaveIO $ loop start
> >
> > loop start = do
> > hSeek h AbsoluteSeek start
> > ps <- P.hGetNonBlocking h k
> > --TODO: I think this should distinguish EOF from no data available
> > -- the otherlying POSIX call makes this distincion, returning
> > either
> > -- 0 or EAGAIN
> > if P.null ps
> > then do eof <- hIsEOF h
> > if eof then return []
> > else hWaitForInput h (-1)
> > >> (loop start)
> > else do
> > pos <- hTell h
> > pss <- lazyRead pos
> > return (ps : pss)
Very nice!
> >
> >defaultChunkSize :: Int
> >defaultChunkSize = 32 * k - overhead
> > where k = 1024
> > overhead = 2 * sizeOf (undefined :: Int)
We'll export this value in bytestring 1.0.
I like this code. Would you consider cabalising it, and uploading it to
hackage.haskell.org, so we don't lose it? Perhaps just call it hsort or
something?
Cheers,
Don
More information about the Haskell-Cafe
mailing list