[Haskell-cafe] External Sort and unsafeInterleaveIO
Ben
midfield at gmail.com
Tue Jul 17 23:46:44 EDT 2007
hi --
thanks for the useful comments! i will definitely go through them
carefully. unfortunately for this code (but fortunately for me) i
defend my dissertation on monday so i'm a little distracted right
now.....
i'm more than happy to donate this code or whatever improvements
happen to it. actually, hGetContentsWithCursor seems like a candidate
for inclusion with Data.ByteStrings or Data.Binary or something -- it
seems like it might find other uses. (i think you liked that bit of
code because i ripped it off of you guys! it's very short hamming
distance from the original.) anyhow, all that will have to wait a
couple weeks or so. also i've never cabalized anything so i may come
begging for help.
at some point i thought i saw how to do recursive external sort, to
keep memory usage truly constant, but with my current lack of sleep i
have lost that illusion. i'm also curious about the performance
characteristics of this vs Prelude sort vs the version using the
tournament mergesort apfelmus suggested. i need to find a computer
with a lot more RAM than my weakling laptop. finally, it would be
good to be able to have the blocksize controlled by Kb of RAM rather
than # of elements, not sure how to get that information.
ultimately this was part of my project to write lucene for haskell. i
think with this out of the way, plus all the Data.Binary / ByteString
goodness, it shouldn't take too long. keep writing good libraries for
me!
thanks and take care, Ben
On 7/17/07, Donald Bruce Stewart <dons at cse.unsw.edu.au> wrote:
> midfield:
> > hi folks --
> >
> > a haskell newbie here, searching for comments and wisdom on my code.
> >
> > i had a project to try to implement "external sort" in haskell as a
> > learning exercise. (external sort is sorting a list that is too large
> > to fit in main memory, by sorting in chunks, spooling to disk, and
> > then merging. more properly there probably should be multiple stages,
> > but for simplicity i'm doing a one-stage external sort.)
> >
> > the trick is the number of files can quickly grow very large, so it is
> > best to use one large file and seek inside it around. however as one
> > can imagine the order-of-IO-operations becomes a bit tricky, if you're
> > seeking file handles around underneath Data.ByteString.Lazy's nose.
> > but late this night after not thinking about it for a while i had a
> > brainstorm: rewrite hGetContents to keep the handle position in the
> > right place! it's all about judicious use of unsafeInterleaveIO.....
> >
> > it seems to be rather fast, strangely faster than the usual "sort" at
> > times. it also seems to have nice memory characteristics, though not
> > perfect. it's hard to test because the normal "sort" function takes
> > too much RAM on large lists, making my computer swap like mad.
>
> I have to agree with Mr. Apfelmus here. This is lovely code. It is exactly
> what the ByteString team hoped people would be able to write
> ByteStrings: "Zen of Haskell" code, where you win by working at a high
> level, rather than a low level.
>
> Thanks!
>
> I've inserted some small comments though the source:
>
> > >module ExternalSort where
> >
> > Sort a list of Ords "offline." We're doing this to be able to sort
> > things without taking up too much memory (for example sorting lists
> > too large to fit in RAM.) Laziness is imperative, as is the
> > order-of-operations.
> >
> > >import Control.Monad
> > >import Data.List
> > >import qualified Data.Binary as Bin
> > >import qualified Data.ByteString.Lazy as B
> > >import qualified Data.ByteString as P (hGetNonBlocking, null)
> > >import Data.ByteString.Base (LazyByteString(LPS))
> > >import Foreign.Storable (sizeOf)
> > >import System.IO (openFile, hClose, hSeek, hTell, hIsEOF, hWaitForInput,
> > > Handle, IOMode(ReadMode, WriteMode),
> > > SeekMode(AbsoluteSeek))
> > >import System.IO.Unsafe (unsafeInterleaveIO)
> > >
> > >import qualified Data.Edison.Seq.ListSeq as LS
> > >import qualified Data.Edison.Coll.SplayHeap as Splay
> >
> > Conceptually, we sort a list in blocks, spool blocks to disk, then
> > merge back. However for IO performance it is better to read off
> > chunks of elements off the sorted blocks from disk instead of
> > elements-at-a-time.
> >
> > It would be better if these were in KBytes instead of # of elements.
> >
> > >blocksize :: Int
> > >blocksize = 10000
> >
> > Turn a list into a list of chunks.
> >
> > >slice :: Int -> [a] -> [[a]]
> > >slice _ [] = []
> > >slice size l = (take size l) : (slice size $ drop size l)
>
> That's unnecessary parenthesis, and I'd probably use splitAt here:
>
> myslice :: Int -> [a] -> [[a]]
> myslice _ [] = []
> myslice n xs = a : myslice n b where (a,b) = splitAt n xs
>
> And just to check:
>
> *M> :m + Test.QuickCheck
> *M Test.QuickCheck> quickCheck (\n (xs :: [Int]) -> n > 0 ==> slice n xs == myslice n xs)
> OK, passed 100 tests.
>
> >
> > Turn a list into a list of blocks, each of which is sorted.
> >
> > >blockify :: (Ord a) => Int -> [a] -> [[a]]
> > >blockify bsize l = map sort $ slice bsize l
>
> Possibly you could drop the 'l' parameter:
>
> blockify n = map sort . slice n
>
> >
> > Serialize a block, returning the (absolute) position of the start.
> >
> > >dumpBlock :: (Ord a, Bin.Binary a) => Handle -> [a] -> IO Integer
> > >dumpBlock h b = do
> > > start <- hTell h
> > > B.hPut h $ Bin.encode b
> > > return start
> >
> > The actual sorting function. We blockify the list, turning it into a
> > list of sorted blocks, and spool to disk, keeping track of offsets.
> > We then read back the blocks (lazily!), and merge them.
> >
> > >externalSort [] = do return []
> > >externalSort l = do
> > > h <- openFile "ExternalSort.bin" WriteMode
> > > idx <- mapM (\x -> dumpBlock h x) (blockify blocksize l)
>
> idx <- mapM (dumpBlock h) (blockify blocksize l)
>
> > > hClose h
> > > h <- openFile "ExternalSort.bin" ReadMode
> > > blocks <- mapM (\x -> do {bs <- hGetContentsWithCursor h x;
> > > return $ Bin.decode bs}) idx
>
> Possibly
>
> forM idx $ \x -> decode `fmap` hGetContentsWithCursor h x
>
>
> > > return (kMerge $ blocks)
> >
> > Merging chunks. K-way merge (and in fact external sort in general) is
> > detailed in Knuth, where he recommends tournament trees. The easiest
> > thing is to probably use one of Okasaki's heaps. I'll use splay
> > heaps, because I don't know any better.
> >
> > It would be better if I changed Ord for blocks to only check the first
> > element.
> >
> > >kMerge :: (Ord a) => [[a]] -> [a]
> > >kMerge [] = []
> > >kMerge l =
> > > let h = Splay.fromSeq l in
> > > kM (Splay.minElem h) (Splay.deleteMin h)
> > > where
> > > kM :: (Ord a) => [a] -> Splay.Heap [a] -> [a]
> > > kM l h
> > > | h == Splay.empty = l
> > > | otherwise =
> > > let next = Splay.minElem h
> > > (f, b) = span (\x -> x <= head next) l
> > > in
> > > f ++ (kM next (if null b then Splay.deleteMin h
> > > else (Splay.insert b $ Splay.deleteMin h)))
> > >
> > >kMergeSort :: (Ord a) => [a] -> [a]
> > >kMergeSort l = kMerge $ blockify blocksize l
> >
> > This is a version of hGetContents which resets its handle position
> > between reads, so is safe to use with interleaved handle seeking.
> >
> > >hGetContentsWithCursor :: Handle -> Integer -> IO B.ByteString
> > >hGetContentsWithCursor = hGetContentsWithCursorN defaultChunkSize
> > >
> > >hGetContentsWithCursorN :: Int -> Handle -> Integer -> IO B.ByteString
> > >hGetContentsWithCursorN k h start = (lazyRead start) >>= return . LPS
> > > where
> > > lazyRead start = unsafeInterleaveIO $ loop start
> > >
> > > loop start = do
> > > hSeek h AbsoluteSeek start
> > > ps <- P.hGetNonBlocking h k
> > > --TODO: I think this should distinguish EOF from no data available
> > > -- the otherlying POSIX call makes this distincion, returning
> > > either
> > > -- 0 or EAGAIN
> > > if P.null ps
> > > then do eof <- hIsEOF h
> > > if eof then return []
> > > else hWaitForInput h (-1)
> > > >> (loop start)
> > > else do
> > > pos <- hTell h
> > > pss <- lazyRead pos
> > > return (ps : pss)
>
> Very nice!
>
> > >
> > >defaultChunkSize :: Int
> > >defaultChunkSize = 32 * k - overhead
> > > where k = 1024
> > > overhead = 2 * sizeOf (undefined :: Int)
>
> We'll export this value in bytestring 1.0.
>
>
> I like this code. Would you consider cabalising it, and uploading it to
> hackage.haskell.org, so we don't lose it? Perhaps just call it hsort or
> something?
>
> Cheers,
> Don
>
More information about the Haskell-Cafe
mailing list