From enumerators to cursors: turning the left fold inside out
oleg at pobox.com
oleg at pobox.com
Wed Sep 24 00:59:45 EDT 2003
There are two basic approaches of accessing a collection, be it a
file, a database, or a generating function. One approach relies on
enumerators (aka folds); the other is based on lazy lists (aka
cursors, streams). Given a stream, we can always construct the
corresponding enumerator. It's perhaps less appreciated that given a
left fold enumerator, we can always construct the corresponding stream
[1]: we can mechanically invert an enumerator inside out. It has been
argued [2] that the enumerator approach is superior, especially from
the point of view of managing scarce resources such as database
connections. When we design the interface to a collection, it seems a
good idea to choose enumerator as a primitive. We can always obtain a
stream if we really need it.
The mechanical inversion procedure presented in [2] had a catch: it
relies on shift/reset (or call/cc plus a mutable cell, which is the
same thing). How can we do such an inversion in Haskell? Well, we can
introduce a right fold enumerator, which is more amenable to such
transformation. Or we can use a continuation monad and emulate
shift/reset. The present article demonstrates a third approach: a
half-baked, non-recursive left-fold. We argue that such a left fold is
the best interface for a collection. Indeed, given the half-baked
left-fold we can:
- instantiate it into the ordinary left fold
- instantiate in into a stream
If we turn two enumerators into streams, we can *safely* interleave
these streams.
The rest of the article demonstrates the inversion procedure. We use a
file as a sample "collection", a collection of characters. Haskell
provides a stream interface to that collection: hGetChar. We implement
a left fold enumerator. We then show how to turn that enumerator back
to a stream: how to implement a function mygetchar only in terms of
the left fold enumerator. The approach is general and uses no monadic
heavy lifting.
The following code is Haskell98. No unsafe operations are used.
> import IO
The desired left-fold enumerator over a file:
> hfold_left:: FileName -> Title
> -> Iteratee seed
> -> seed -- the initial seed
> -> IO seed
> type FileName = String
> type Title = String -- just an identifying string for debug printing
> type Iteratee seed = seed -> Char -> Either seed seed
The function (seed -> Char -> Either seed seed) is an iteratee: it
takes the current seed and the current character and returns a new
seed. To be more precise, if it returns Right seed', the iteration
continues with seed' as the new seed. If the iteratee returns Left
seed'', the enumerator immediately stops further iterations,
frees all the resources, and returns seed'' as the final result. So
our left fold enumerator supports premature termination of iterations,
a quite useful feature.
As me mentioned above, we actually need a half-baked, non-recursive
enumerator hfold_left'
> make_hfold_left' filename title = do
> h <- openFile filename ReadMode
> let hfold_left' self iteratee seed = do
> mc <- try $ hGetChar h
> case mc of
> Left exc -> hClose h >> return seed
> Right c -> do
> putStrLn $ " reading from " ++ title ++ ". Got " ++ (show c)
> case iteratee seed c of
> Left seed -> hClose h >> return seed
> Right seed -> self iteratee seed
> return hfold_left'
We can easily obtain the desired hfold_left as an instantiation of
hfold_left':
> hfold_left filename title iteratee seed = do
> hfold_left' <- make_hfold_left' filename title
> let g = hfold_left' g
> hfold_left' g iteratee seed
We should note that the Handle is an internal resource and is never
leaked. The statement
putStrLn $ " reading from " ++ title ...
is for debugging. We want to be sure that we don't read from a
stream too much. This will be important for interleaving.
Let us create a test file
> tf = "/tmp/a"
> make_test_file fname = do
> h <- openFile fname WriteMode
> mapM_ ((hPutStr h) . show) [0..9]
> putStrLn $ "Wrote " ++ fname
> hClose h
and test our left fold enumerator
> test_read fname n = do
> (count,acc) <- hfold_left fname "handle" reader (0,[])
> putStrLn $ "Have read: " ++ (show acc)
> where reader (count,acc) c =
> let count' = count + 1
> acc' = c:acc
> in (if count' >= n then Left else Right) (count',acc')
The first test checks the premature termination. We read only five
characters, and then terminate the iteration and dispose of its
resources, including the (invisible) handle.
*Main> test_read tf 5
reading from handle. Got '0'
reading from handle. Got '1'
reading from handle. Got '2'
reading from handle. Got '3'
reading from handle. Got '4'
Have read: "43210"
The following tests the exhaustive iteration.
*Main> test_read tf 20
reading from handle. Got '0'
reading from handle. Got '1'
reading from handle. Got '2'
reading from handle. Got '3'
reading from handle. Got '4'
reading from handle. Got '5'
reading from handle. Got '6'
reading from handle. Got '7'
reading from handle. Got '8'
reading from handle. Got '9'
Have read: "9876543210"
Let us suppose that make_hfold_left' is the only way to access
files. Can we recover ordinary i/o streams from it? The answer is
yes. We can get the familiar open, getchar and eof functions
from the half-baked enumerator. Furthermore, the derivation of these
functions is *independent* of the precise nature of the enumerator. We
will never need (nor be given) access to the handle.
> data MyStream = MyNil (Maybe Char) | MyCons Char (IO MyStream)
>
> myopen:: FileName -> Title -> IO MyStream
>
> myopen filename title = do
> hfold_left' <- make_hfold_left' filename title
> let k fn (MyNil Nothing) = return $ MyNil Nothing
> k fn (MyNil (Just c))
> = return $ MyCons c (hfold_left' k fn (MyNil Nothing))
> hfold_left' k (\_ c -> Right $ MyNil $ Just c) (MyNil Nothing)
> mygetchar:: MyStream -> IO (Char,MyStream)
> mygetchar (MyCons c k) = k >>= (return . ((,) c))
>
> myiseof:: MyStream -> Bool
> myiseof (MyNil Nothing) = True
> myiseof _ = False
We can now test reading from two streams with interleaving:
> test_interleave fname = do
> stream1 <- myopen fname "stream1"
> stream2 <- myopen fname "stream2"
>
> putStrLn "\nReading 2 chars from one stream"
> (c1,stream1) <- mygetchar stream1
> (c2,stream1) <- mygetchar stream1
> putStrLn $ "Read: " ++ (show [c1,c2])
>
> putStrLn "\nReading 3 chars from the second stream"
> (c1,stream2) <- mygetchar stream2
> (c2,stream2) <- mygetchar stream2
> (c3,stream2) <- mygetchar stream2
> putStrLn $ "Read: " ++ (show [c1,c2,c3])
>
> putStrLn "\nReading again 2 chars from the first stream"
> (c1,stream1) <- mygetchar stream1
> (c2,stream1) <- mygetchar stream1
> putStrLn $ "Read: " ++ (show [c1,c2])
*Main> test_interleave tf
reading from stream1. Got '0'
reading from stream2. Got '0'
Reading 2 chars from one stream
reading from stream1. Got '1'
reading from stream1. Got '2'
Read: "01"
Reading 3 chars from the second stream
reading from stream2. Got '1'
reading from stream2. Got '2'
reading from stream2. Got '3'
Read: "012"
Reading again 2 chars from the first stream
reading from stream1. Got '3'
reading from stream1. Got '4'
Read: "23"
Note that we read ahead by exactly one character. If we want to detect
EOF, that is inevitable (we need to attempt reading if we want to
detect EOF).
We should point out the relationship between hfold_left, hfold_left' and
myopen: hfold_left is the fixpoint of hfold_left'. OTH, myopen
captures the continuation of hfold_left' in an IO action. This
relationship once again illustrates that call/cc and Y are indeed two
sides of the same coin [3].
[1] General ways to traverse collections
http://pobox.com/~oleg/ftp/Scheme/enumerators-callcc.html
[2] An argument against cursors
http://srfi.schemers.org/srfi-44/mail-archive/maillist.html
[3] Self-application as the fixpoint of call/cc
http://google.com/groups?selm=7eb8ac3e.0309182239.5a64b3b1%40posting.google.com
More information about the Haskell
mailing list