[Haskell-cafe] Re: Collections

Thu Jun 21 19:38:49 EDT 2007

On 6/20/07, apfelmus <apfelmus at quantentunnel.de> wrote:
> Eh, why not a simple mergesort that also deletes duplicates?

I had to sit down and think about this, and while for the simple case
that I showed, your equivalent code is definitely simpler, and
probably more efficient.

The actual case that I'm dealing with, where I believe Data.Map (or
similar, incl finger trees) has a benefit is one in which it's not
simply a case of lists of items, yielding a list of items. I'm
manipulating an on-disk inverted index, so rather than a simple list
of items, the code is actually monadic, doing IO to retrieve the items
off disk, and the cost of creating the intermediate lists is
unwearable. The key problem is that you loose the laziness because of
the IO monad, so if you're not careful, you end up trying to store the
complete intermediate lists.

If you can assume you can hold enough stuff in memory, then nice
elegant lazy algorithms work beautifully. I'm doing external
algorithms which have to work on Gbs of data, which I can't hold in
memory. So the type signature of my merge is approximately:

type Reader t = IO (Maybe t)
type Writer t = t -> IO ()
merge :: [Reader t] -> Writer t -> IO ()

Actually, the scope of the problem is such that I could *almost*
finesse the problem by reading the globs of data as ByteStrings, then
use lazy evaluation internally, and write the outputs

merge blobLocators writer = do
    lists <- fmap decode $ mapM readOffDisk blobLocators
    mapM_ writer (mergesort lists)

but I need to keep a firm lid on the resource usage, so I can't. <sigh>

cheers,
T.
-- 
Dr Thomas Conway
drtomc at gmail.com

Silence is the perfectest herald of joy:
I were but little happy, if I could say how much.