using ghc with make

Thu Apr 20 04:06:06 EDT 2006

Hello John,

Wednesday, April 19, 2006, 3:27:49 AM, you wrote:

>> if that is due to the time of reading .hi files, my alternative Binary
>> library should help in some future

> Interesting, A big bottleneck

big bottleneck? ;)

> in jhc right now is reading the (quite
> large) binary ho and hl files on startup. a few things I have wanted out
> of a binary library are: 

> I was going to get around to writing this sometime, but perhaps there is
> room for a collaborative project in there. Is your code available
> somewhere bulat?

http://freearc.narod.ru/Streams.tar.gz
http://haskell.org/haskellwiki/Library/Streams

but this doc don't contain info about Binary part that is now
discussed. i attached to the letter my unfinished docs about this part
of library 

now about your requirements:

>  * mmap based reading.

my Streams library mainly consists of two parts - Streams and
AltBinary. The streams part implements Handle-like interface
(including such functions as vGetChar, vGetByte, vPutBuf, vSeek and so
on) for various data sources - files, memory buffers, pipes, strings.
m/m files support is planned but now has just preliminary
implementation

AltBinary part works via the Streams part. basically, it just
implements various ways to convert data structure to the sequence of
vPutByte operations (and vice versa), with support for lists, arrays
and all other "simpler" datatypes that Haskell/GHC provides to us.
Binary instances for other datatypes can be autogenerated via DrIFT or TH

>  * being able to jump over unneeded data, as in go directly to the 112th
>    record, or the third field in a data structure without having to
>    slurp through everything that came before it.

what should be the user interface? the lib (its Streams part) supports
vSeek/vTell operations. skipping to 112th record without knowing it's
exact location will be impossible if each record can have different
size

the following things imho should not be a part of Binary library
itself, but a higher-level client code

>  * the ability to create a hash of the structure of the underlying data
>    type, to verify you are reading data in the right format.

you mean that using signature is not enough, or to be exact - that
library should generate this signature itself? interesting. i think
that for jhc (and potentially ghc) this should be implemented via
DrIFT?

>  * extensible type-indexed sets (implemented hackily in Info.Binary in
>    jhc)

by creating hash of structure we can reduce this task to just ordinary
hash-like database?

>  * VSDB[1] style ACID updates as an option.
>  * VSDB style write-time optimized constant hash table. I don't mind
>    spending extra time when writing library files to speed up their
>    usage.

i don't understand second thing. but anyway you already implemented
VSDB database. you already has the way to autogenerate Binary
instances. my lib can help by making serialization faster and providing
uniform access to various media (files, buffers, m/m files). i can
also work on hash-of-structure implementation using DrIFT or TH

-- 
Best regards,
 Bulat                            mailto:Bulat.Ziganshin at gmail.com
-------------- next part --------------
AltBinary library features:

- interface compatibility with Binary library
- compatibility both with Hugs and GHC, with GHC-specific speed
optimizations
- (de)serialization speed of 20-60 mb/sec on 1GHz CPU
- support for byte-aligned and bit-aligned, low-endian and big-endian
serialization using the same interface
- free intermixing of text and binary i/o on the same handle
- UTF8 encoding for strings, variable-length encoding for
Int/Word/Integer (including encoding of list/array bounds)
- fixed-size integral values use network byte order; any integral value
can be saved with explicitly specified size using functions "putBits
bh bits value" and "getBits bh bits" (putBit/putWord8...putWord64 is
also supported for all integral types)
- implementation of Binary interface for all Bounded Enum types
- implementation of Binary interface for all array types
- ability to serialize data to any Stream

the following is the guide to using library, containing the following
parts:

1. emulation of Binary interface
2. AltBinary interface
2.1 byte-aligned and bit-aligned streams
2.2 getBits/putBits; Binary instances for Bool, Maybe, Either
2.3 putWord8..putWord64; Binary instances for Int8..Word64; putWord##le
2.4 putBounded; Binary instances for Bounded Enum types
2.5 putUnsigned/putInteger/putLength; Binary instances for Int/Integer/Word
2.6 Binary instances for Char and String
2.7 lists support
2.8 arrays support
2.9 putGhcInteger and putByteArray - GHC-specific routines
2.10 defining Binary instances for custom serialization formats
3. Stream interface

>1. emulation of Binary interface

This library implements 2 interfaces: Binary and AltBinary. First
interface allows to use this library as drop-in replacement for the
well-known Binary and NewBinary libs. all you need to do is to replace
"import Data.Binary" statement with either

 import Data.Binary.ByteAligned
or
 import Data.Binary.BitAligned

depending on what type of access you need. in the first case
representation of any data value will be written/read as the whole
number of bytes, in the second case data values may cross byte
boundaries and, for example, Bools will be packed 8 values per byte.
please draw attention that despite interface emulation this library
and original Binary lib use different representations for most of the
data types

>2. AltBinary interface

          let s = encode ("11",123::Int,[1..10::Int])
          print (decode s::(String,Int,[Int]))

>2.1 4 types of binary streams

AltBinary is "native" interface of this library to (de)serialize data.
it provides the same operations `get` and `put_` to read/write data,
but allows to use them directly on Handles and any other streams:

 import Data.AltBinary

 h <- openBinaryFile "test" WriteMode
 put_ h [1..100::Int]
 hClose h

 h <- openBinaryFile "test" ReadMode
 x <- get h :: IO [Int]
 print x

if you need bit-aligned serialization, use the `openBitAligned` stream
transformer:

 h <- openBinaryFile "test" WriteMode
        >>= openBitAligned
 put_ h "string"
 put_ h True
 vClose h

of course, to read these data you also need to use `openBitAligned`:

 h <- openBinaryFile "test" ReadMode
        >>= openBitAligned
 x <- get h :: IO String
 y <- get h :: IO Bool
 print (x,y)

The above code writes data in big-endian format, if you need to use
low-endian formats, use the following transformers:

 h <- openBinaryFile "test" WriteMode
        >>= openByteAlignedLE
and
 h <- openBinaryFile "test" WriteMode
        >>= openBitAlignedLE

for the byte-aligned and bit-aligned access, respectively.

You can also mix the binary and text i/o at the same stream, with only
one requirement: use "flushBits h" after you used stream for some
bit-aligned I/O:

 h <- openBinaryFile "test" WriteMode
        >>= openBitAligned
 put_ h True
 flushBits h
 vPutStr h "string"
 vClose h

it's also possible to use different types of binary atreams on top of
one Stream:

 h <- openBinaryFile "test" WriteMode
 bh <- openBitAligned h
 put_ bh True
 flushBits bh
 bh <- openByteAlignedLE h
 vPutStr bh "string"
 vClose h

... if you will ever need this :)

>2.2 getBits/putBits; Binary instances for Bool, Maybe, Either

`get` and `put_` operations are just enough if you need only to save
some values in Stream and then restore them. but to assemble/parse
data in some particular format, you will need some more low-level
functions, such as `getBits` and `putBits`, which transfers just the
specified number of bits:

 putBits 32 h (123::Int)
 x <- getBits 32 h :: IO Int

if you call on byte-aligned stream putBits with number of bits, what
is not divisible by 8, the whole number of bytes are occupied. in
particular, putBit on byte-aligned streams occupies entire byte

this makes possible to use the same (de)serialization code and in
particular the same definitions of Binary instances both for
byte-aligned and bit-aligned streams! for example, the following
definition:

 instance Binary Bool where
     put_ h x = putBit h $! (fromEnum x)
     get  h   = do x <- getBit h; return $! (toEnum x)

allows to encode Bool values with just one bit in bit-aligned streams,
but uses the whole byte in byte-aligned ones. further, serialization
code for Maybe types uses Bool values:

 instance Binary a => Binary (Maybe a) where
     put_ bh (Just a) = do put_ bh True; put_ bh a
     put_ bh Nothing  = do put_ bh False
     get  bh          = do flag <- get bh
                           if flag then do a <- get bh; return (Just a)
                                   else return Nothing

as a result, representation of `Maybe a` uses just one more bit than
representation of type `a` in bit-aligned streams, and whole extra
byte otherwise. the same story is for Either types

>2.3 getWord8..putWord64; Binary instances for Int8..Word64

most widespread uses of getBits/putBits is for 1/8/16/32/64 bits, and
so there are specialized (and sometimes more efficient) versions of
these functions, called putBit, putWord8...putWord64 (and of course
their get... counterparts). please draw attention that all these
functions accept arguments (or return values) of any Integral type
(i.e. types what are instances of Integral class - Int, Integer, Word,
Int8..Word64), so you don't need to convert types if you want, for
example, encode Int as 8-bit value:

 putWord8 h (length "test")

these fixed-bits routines used in definitions of Binary instances for
types with fixed sizes - Int8...Word64. types Int, Word and Integer
by default uses variable-sized representation, which would be
described later. if you need to read or write values of these types
using fixed-size representation, use appropriate fixed-bits procedures
instead of get/put_:

 putWord16 h (1::Int)
 putWord32 h (2::Word)
 putWord64 h (3::Integer)

the same rule applies if you need to write fixed-size value with
non-default number of bits:

 putWord8 h (4::Int32)

functions putWord16..putWord64 uses big-endian representation, also
known as network byte order - it is the order of bytes, used natively
on PowerPC/Sparc processors. in this format, representation of value
started fom most significant bytes. if you use bit-aligned stream,
high bits of each byte are also filled first. if you need
little-endian (native for Intel processors) formats,
putWord16le..putWord64le is at your service

>2.4 putBounded; Binary instances for Bounded Enum types

next pair of functions uses mininal possible number of bits to encode
values in given range [min..max]:

 putBounded min max h x
 x <- getBounded min max h

they also support values of any Integral type. These functions are used
to provide default Binary instances for all Bounded Enum types (i.e.
types which support both Bounded and Enum interfaces). for example,
you can declare:

 data Color = Red | Green | Blue  deriving (Bounded, Enum)

and now you can use get/put_ on Colors; Color values would be encoded
using 2 bits in bit-aligned streams (of course, whole byte would be
used in byte-aligned streams)

>2.5 putUnsigned/putInteger/putLength; Binary instances for Int/Integer/Word

putUnsigned provides variable-sized encoding, what can be used to
represent any non-negative Integral value using minimal possible
number of bytes. it uses 7+1 encoding, i.e. 7 bits in each byte
represents bits of actual value, and higher bit used to distinguish
last byte in sequence. so, values in range 0..127 would be encoded
using one byte, values in range 128..2^14-1 - using two bytes and so on

putInteger is about the same, but allows to encode also negative
values, so -64..63 encoded with one byte, -2^13..2^13-1 - with two
bytes...

putLength is synonym for putUnsigned, just used to represent lengths
of various containers - strings, lists, arrays and so on

put_ uses putInteger to encode Int and Integer, and putUnsigned to
encode Word; i don't used fixed-size representation for Int and Word
because that will produce data incompatible between 32-bit and 64-bit
platforms. i also don't use internal GHC's representation of Integer
to speed up (de)serialization because that will produce data
incompatible with other Haskell compilers. but if you need to
(de)serialize large number of Integers quickly, you should use
putGhcInteger/getGhcInteger procedures, described later. of course,
this way your program will become compatible only with the GHC
compiler.

>2.6 Binary instances for Char and String

>2.7 lists support

>2.8 arrays support

This library supports (de)serialization for all array types, included
in standard hierarchical libraries plus PArr arrays, supported only by
GHC. Immutable array types can be (de)serialized to any Stream (just
like lists); mutable arrays can be (de)serialized only in the
corresponding monad (where this array can be read/modified), i.e.
IOArray can be get/put only to Stream belonging to IO monad, STArray
can be get/put only to Stream belonging to the same state monad. all
that is done automatically, just use put_ or get operation on the
corresponding array

if you read an array, you may need (or don't need, depending on the
surrounding code) to specify its type explicitly, say:

 arr <- get h :: IO (Array Int Int32)

besides of automatic support for all array types in put_/get
operations, there are also huge number of "low-level" array
(de)serialization routines. first, there are routines

 putIArray h arr
 putMArray h arr

what can be used to write to the Stream any array that is instance of
IArray or MArray class, correspondingly (the first class contains all
immutable arrays: Array, UArray, DiffArray, DiffUArray; the second -
all other, mutable arrays - IOArray, IOUArray, STArray, STUArray,
StorableArray). corresponding operations to read these arrays require
to explicitly pass them bounds of array read:

 arr <- getIArray h bounds
 arr <- getMArray h bounds

note that this operations is not full analogues of put_/get ones, which
are write and read array bounds automatically. these operations are
more low-level - they are read/write only the array elements. also
note that just like `get` operation, you may need to specify type of
the array read:

 arr <- getIArray h (0,9) :: IO (Array Int Int32)

second, you can read/write array elements with explicitly pointed
(de)serialization procedure for array elements isstead of default
ones, provided by the Binary class. to achive this, add `With` suffix
to routine name and specify procedure to read or write array elements
as the first argument:

 putIArrayWith putUnsigned  h arr
 putMArrayWith (putBits 15) h arr
 arr <- getIArrayWith getWord8 h bounds
 arr <- getMArrayWith (getBounded 1 5) h bounds

of course, you can also provide your own read/write procedures, if
they have the same types as standard get/put_ functions.

there are also variants of all get operations, which uses `size` parameter
instead of `bounds`, and creates arrays with bounds (0,size-1::Int).
they have names with `N` at the end of of procedure name, but before
`With`:

 arr <- getIArrayN h 10 :: IO (Array Int Int32)
 arr <- getMArrayNWith getWord32 h 10 :: IO (IOArray Int Int)

these operations in some way dubs the similar list procedures

at last, part of the `get` operations have versions, specialized to
specific type constructors. for example, `getMArrayN` have
`getIOArrayN` and `getIOUArrayN` variants which can read only the
IOArray/IOUArray, accordingly. it's just a trick to avoid necessity to
specify array types in `get` operations, say instead of:

 arr <- getIArrayN h 10 :: IO (Array Int Int32)

one can write

 arr <- getArrayN h 10

it is nothing more than handy shortcuts. the only exclusion is
operations to read `UArray`, what is not specializations of
corresponding `IArray` operations, but use some faster algorithm and
work only in IO monad. if you need to read `UArray` in any other monad
- please use general operations on the `IArray` instead (anyway the
compiler will ensure proper use via the typechecking)

so far i don't say anything about specific operations for
(de)serialization of parallel arrays (available only in GHC via
the module GHC.PArr).

2.9 putGhcInteger and putGhcByteArray - GHC-specific routines
2.10 defining Binary instances for custom serialization formats

-------------- next part --------------
In AltBinary library there 4 methods of binary I/O builded on top of
each other:

- Byte I/O              (vGetByte and vPutByte)
- Integral values I/O   (getWordXX and putWordXX)
- Data structures I/O   (over 100 operations :) )
- Serialization API     (get and put_)

We will study them all sequentially, starting from the lowest level.

* Byte I/O

Lowest level, the byte I/O, isn't differ significantly from the Char I/O.
All Streams support vGetByte and vPutByte operations, either directly
or via buffering transformer. These operations has rather generalized
types:

vGetByte :: (Stream m h, Enum a) => h -> m a
vPutByte :: (Stream m h, Enum a) => h -> a -> m ()

This allows to read/write any integral and enumeration values without
additional type conversions (of course, these values should belong to
the 0..255 range)

Together with other Stream operations, such as vIsEOF, vTell/vSeek,
vGetBuf/vPutBuf, this allows to write any programs that operate upon
binary data. You can freely mix byte and text I/O on one Stream:

main = do vPutByte stdout (1::Int)
          vPutStrLn stdout "text"
          vPutBuf stdout buf bufsize

* Integral values / bit sequences I/O

The core of this API is two generalized operations:

getBits bits h
putBits bits h value

`getBits` reads certain number of bits from given BinaryStream and
returns it as value of any integral type (Int, Word8, Integer and so on).
`putBits` writes given value as a certain number of bits. The `value`,
again, may be of any integral type.

These two operations can be implemented in one of 4 ways, depending on
the answers on two questions:
- whether integral values written as big- or little-endian?
- whether values written are bit-aligned or byte-aligned?

The library allows you to select any answers on these questions. The
`h` parameter in this operation represents BinaryStream and there are
4 methods to open BinaryStream on top of plain Stream:

binaryStream <- openByteAligned stream      -- big-endian
binaryStream <- openByteAlignedLE stream    -- little-endian
binaryStream <- openBitAligned stream       -- big-endian
binaryStream <- openBitAlignedLE stream     -- little-endian

Moreover, to simplify your work, Stream by itself can also be used as
BinaryStream - in this case byte-aligned big-endian representation used.
So, you can write, for example:
   putBits 16 stdout (0::Int)
or
   bh <- openByteAlignedLE stdout
   putBits 16 bh (0::Int)

There is also operation `flushBits h` what aligns BinaryStream on the
byte boundary. It fills the rest of pyte with zero bits on output and
skip the rest of bits in current bytes on input. Of course, this
operation does nothing on byte-aligned BinaryStreams.

There are also "shortcut" operations what read/write some number of bits:

getBit h
getWord8 h
getWord16 h
getWord32 h
getWord64 h
putBit h value
putWord8 h value
putWord16 h value
putWord32 h value
putWord64 h value

Although these operations seems like just shortcuts for partial
application of getBits/putBits, they are works somewhat faster.
In contrast to other binary I/O libraries, each of these operations
can accept/return values of any integral type.

You can freely mix text I/O, byte I/O and bits I/O as long as you
don't forget to make `flushBits` after bit-aligned chunks of I/O:

main = do putWord32 stdout (1::Int)  -- byte-aligned big-endian

          stdoutLE <- openByteAlignedLE stdout
          putWord32  stdoutLE (1::Int)  -- byte-aligned little-endian
          putBits 15 stdoutLE (1::Int)  -- byte-aligned little-endian

          stdoutBitsLE <- openBitAlignedLE stdout
          putBit     stdoutBitsLE (1::Int)  -- bit-aligned little-endian
          putBits 15 stdoutBitsLE (1::Int)  -- bit-aligned little-endian
          flushBits stdoutBitsLE

          vPutStrLn stdout "text"

          stdoutBits <- openBitAligned stdout
          putBit     stdoutBits (1::Int)  -- bit-aligned big-endian
          putBits 15 stdoutBits (1::Int)  -- bit-aligned big-endian
          flushBits stdoutBit

When you request to write, say, 15 bits to byte-aligned BinaryStream,
the whole number of bytes are written. In particular, each `putBit`
operation on byte-aligned BinaryStream writes the whole byte to the
stream while the same operation on bit-aligned streams fills one bit at
a time.

But that is not yet the whole story! There are also operations that
allow to intermix little-endian and big-endian I/O:

getWord16le h
getWord32le h
getWord64le h
putWord16le h value
putWord32le h value
putWord64le h value
getWord16be h
getWord32be h
getWord64be h
putWord16be h value
putWord32be h value
putWord64be h value

For example, you can write:

main = do putWord32le stdout (1::Int)  -- byte-aligned little-endian
          putWord16be stdout (1::Int)  -- byte-aligned big-endian

Please note that `h` in these operations is a Stream, not
BinaryStream. Actually, these operations just perform several fixed
vGetByte or vPutByte operations and, strictly speaking, they should be
noted in previous section.

There are also combinator versions of `open*` operations, that
automatically perform `flushBits` at the finish:

    withBitAlignedLE stdout $ \h -> do
        putBit     h (1::Int)  -- bit-aligned little-endian
        putBits 15 h (1::Int)  -- bit-aligned little-endian

I also should say that you can perform all the Stream operations on
any BinaryStream, and bit-aligned streams will flush themselves before
performing any I/O and seeking operations. For example:

    h <- openBitAligned stdout
    vPutStr h "text"
    putBit h (1::Int)
    vPutByte h (1::Int)     -- `flushBits` will be automatically
                            --   called before this operation
    putWord16le h (1::Int)  -- little-endian format will be used here despite
                            --   big-endiannes of the BinaryStream itself

* Serialization API

This part is a really small! :) There are just two operations:

get h
put_ h a

where `h` is a BinaryStream. These operations read and write binary
representation of any value belonging to the class Binary.