[Haskell-cafe] Idiomatic way to parse bytecodes

Mon Aug 31 03:43:46 UTC 2015

On 31/08/2015, at 10:35 am, Gautier DI FOLCO <gautier.difolco at gmail.com> wrote:
> Thanks for your answer.
> I think I have misled you in the expression of my needs.
> I don't have any representation issues, I'm just looking for a good design for the parsing phase.

Depending on what you mean by "parsing",
a very simple answer may suffice.

Use a TABLE-DRIVEN approach.

Let me take a Smalltalk byte code as an example.
Instructions fall into a number of groups.
For example, we have
    push_local <number>
where <number> might be encoded in 2 bytes,
in 1 byte, or it might be implied in the opcode.
So we'd have something like

    data Operator
       = Push_Local
       | Store_Local
       | ...
    data Operand
       = None
       | Implied Int
       | One_Byte
       | Two_Byte
       | ...
    decode :: Array Int (Operator, Operand)
    decode = array (0,255) [
       (0, (Push_Local, Implied 0)),
       (1, (Push_Local, Implied 1)),
       ...
       (8, (Push_Local, One_Byte)),
       (9, (Push_Local, Two_Byte)),
       ...

Or even make it a function:

    decode :: Int -> (Operator,Operand)
    decode 0 = ...
    ...
    decode 255 = ...

then for example

    extract :: Operand -> Bytes -> Int -> (Simple_Operand, Int)
    extract None        b i = (No_Operand, i)
    extract (Implied n) b i = (Int_Operand n, i)
    extract (One_Byte)  b i = (Int_Operand (byte b i), i+1)
    extract (Two_Byte)  b i = (Int_Operand (byte b i * 256 + byte b (i+1)),i+2)
    ...

Ideally, you'd switch to a different variant of the instruction set
by simply changing the table.

When you have "portmanteau instructions", it gets a bit trickier.
For example, Smalltalk systems typically have

    Store_Local <index>   -- store the TOS in that variable
    Pop                   -- remove the TOS

combined in

    Pop_Into_Local <index> -- store TOS into variable and remove it

> PS: For the Markov part, Markov models will help me to spot the most common sequences of bytecodes. Then I'll be able to create proprietary byte that will be the most likely to decrease my code's size.

There's a Xerox blue-and-white report.  Ah, here we go.
http://www.textfiles.com/bitsavers/pdf/xerox/parc/techReports/CSL-82-2_An_Analysis_of_a_Mesa_Instruction_Set.pdf
Mesa was an "industrial strength Pascal" running on the Alto and
the D-machines.  The author collected dynamic frequencies of
byte pairs.  "It is hard to learn much of general interest by
examining individual instruction frequencies of a small set of
programs, since those statistics may highlight idiosyncratic
behavior of the programs. For example, only three of the top eight
most frequently executed instructions in both VlsiCheck and the
compiler are the same."

There's also the companion paper from ASPLOS'82,
"Static Analysis of the Mesa Instruction Set". R.Sweet, J.Sandman.

The obvious question is WHY you want to compress your code size
in this way.  That report (if only I could remember title or
author, sigh, pointed out that STATIC instruction (and operand)
frequencies and DYNAMIC instruction (and operand) frequencies
can be quite different.

A fair bit of fiddling has gone on in the Smalltalk world, so that
although practically everyone started with something like the Blue
Book model (which is still probably *the* reference for how byte
code systems work).  Those people are trying to balance *compact*
encoding against *efficient* decoding.

If you just want to reduce size you might think in terms of
just compressing everything, with decompression of code that's
actually executed as "Level 0 JIT".