[Haskell-cafe] Re: Collections

Fri Jun 22 05:10:02 EDT 2007

Thomas Conway wrote:
> On 6/22/07, Duncan Coutts <duncan.coutts at worc.ox.ac.uk> wrote:
>> You might find that lazy IO is helpful in this case. The primitive that
>> implements lazy IO is unsafeInterleaveIO :: IO a -> IO a
> 
> Personally, unsafeInterleaveIO is so horribly evil, that even just
> having typed the name, I'll have to put the keyboard through the
> dishwasher (see http://www.coudal.com/keywasher.php).

:D :D
Finally someone who fully understands the true meaning of the prefix
"unsafe"  ;)

>> Note that using a Map will probably not help since it needs to
>> read all the keys to be able to construct it so that'd pull
>> in all the data from disk.
>
> Well, in the case I'm dealing with, the map can contain the current
> key from each postings vector, and the closure for reading the
> remainder of the vector. E.g. Map Key ([IO (Maybe Key)]).

In any case, you have to store as many keys as you have lists to sort,
but lazy mergesort will not hold on more than (length xs + 1) keys in
memory at a single moment in time and only force one new key per
retrieval. No lingering intermediate lists :)

In this situation, unsafeInterleaveIO is an easy way to carry this
behavior over to the IO-case:

 type Reader t = IO (Maybe t)
 type Writer t = t -> IO ()

 readList :: Reader t -> IO [t]
 readList m = unsafeInterleaveIO $ do
    mx <- m
    case mx of
       Just x  -> liftM (x:) $ readList m
       Nothing -> return []

 mergesortIO :: Ord t => [Reader t] -> Writer t -> IO ()
 mergesortIO xs f = do
    ys <- mapM readList xs
    mapM_ f $ mergesort ys

Here, readList creates only as many list elements as you demand,
similarly to getContents. Of course, it has the same problem as
getContents, namely that you can accidentally close the file before
having read all data. But this is applies to any on-demand approach be
it with IO or without.

Also, you can make the heap in mergesort explicit and obtain something
similar to your current approach with Data.Map. The observation is that
while mergesort does create a heap, its shape does not change and is
determined solely by (length xs).

    -- convenient invariant:
    --   the smaller element comes from the left child
  data Ord b => Heap m b = Leaf m b | Branch b (Tree a b) (Tree a b)

    -- smart constructor
  branch :: Ord b => Tree m b -> Tree m b -> Tree m b
  branch x y
      | gx <= gy  = Branch gx x y
      | otherwise = Branch gy y x
      where
      (gx,gy) = (getMin x, getMin y)

    -- fromList is the only way to "insert" elements into a heap
  fromList :: Ord b => [(m,b)] -> Heap m b
  fromList = foldtree1 branch . map (uncurry Leaf)

  getMin :: Heap m b -> b
  getMin (Leaf _ b)      = b
  getMin (Branch b _ _ ) = b

  deleteMin :: Heap (Reader b) b -> IO (Maybe (Heap (Reader b) b))
  deleteMin (Leaf m _)     = m >>= return . fmap (Leaf m)
  deleteMin (Branch _ x y) = do
    mx' <- deleteMin x
    return . Just $ case mx' of
       Just x' -> branch x' y
       Nothing -> y

  mergesortIO :: Ord t => [Reader t] -> Writer t -> IO ()
  mergesortIO xs f = ...

> Also, I need to support concurrent querying and updates,
> and trying to manage the locking is quite hard enough as it is,
> without trying to keep track of which postings vectors have closures
> pointing to them!

I guess you have considered Software Transactional Memory for atomic
operations?
   http://research.microsoft.com/~simonpj/papers/stm/index.htm

Also, write-once-read-many data structures (like lazy evaluation uses
them all the time) are probably very easy to get locked correctly.

Regards,
apfelmus