[Haskell-cafe] Copying Arrays

Johan Tibell johan.tibell at gmail.com
Fri May 30 05:40:34 EDT 2008


Hi!

On Fri, May 30, 2008 at 10:38 AM, Ketil Malde <ketil at malde.org> wrote:
> "Johan Tibell" <johan.tibell at gmail.com> writes:
>> The intent of the not-yet-existing Unicode string is to represent
>> text not bytes.
>
> Right, so this will replace the .Char8 modules as well?  What confused
> me was my misunderstanding Duncan to mean that Unicode text would
> somehow imply shorter strings than non-Unicode (i.e. 8-bit) text.

Yes.

>> To give just one example, short (Unicode) strings are common as keys
>> in associative data structures like maps
>
> I guess typically, you'd break things down to words, so strings of
> lenght 4-10 or so.  BS uses three words and LBS four (IIRC), so the
> cost of sharing typically outweighs the benefit.

I'm not sure if you would have much sharing in a map as the keys will be unique.

>> Can I also here insert a plea for keeping lazy I/O out of the new
>> Unicode module?
>
> I use ByteString.Lazy almost exclusively.  I realize it there's a
> penalty in time and space, but the ability to write applications that
> stream over multi-Gb files is essential.

Lazy I/O comes with a penalty in terms of correctness! Pretending that
I/O and the underlying resource allocations (e.g. file handles) aren't
observable is bad. Lazy I/O is kinda, maybe usable for small scripts
that reads a file or two an spits out a result but for servers it
doesn't work at all. Lazy I/O requires unsafe* functions and is
therefore, unsafe. The finalizers required can be arbitrary complex
depending on what kind of resources need to be allocated. The simple
case is a file handle but there's no reason we might need sockets,
locks, etc to create the lazy ByteString. Here are two possible
interfaces for safe I/O. One isstream based one with explicit close
and the other fold based one (i.e. inversion of control):

> import qualified Data.ByteString as S
>
> -- Stream based I/O.
> class InputStream s where
>   read :: s -> IO Word8
>   readN :: s -> Int -> IO S.ByteString  -- efficient block reads
>   close :: s -> IO ()
>
> openBinaryFile :: InputStream s => FilePath -> IO s

or a left fold over the file's content. The 'foldBytes' function can
close the file at EOF.

> -- Left fold/callback based I/O.
> foldBytes :: FilePath -> (seed -> Word8 -> Either seed seed) -> seed -> IO seed
> -- Efficient block reads.
> foldChunks :: FilePath -> (seed -> S.ByteString -> Either seed seed) -> seed -> IO seed

on top of this you might want monadic versions of the above two
functions. The case for a Unicode type are analogous.

> Of course, these applications couldn't care less about Unicode, so
> perhaps the usage is different.

The issue of lazy I/O is orthogonal to ByteString vs Unicode(String).

Cheers,

Johan


More information about the Haskell-Cafe mailing list