[Haskell-cafe] Idiomatic way to parse bytecodes

Sun Aug 30 22:35:26 UTC 2015

2015-08-30 5:46 GMT+02:00 Kim-Ee Yeoh <ky3 at atamo.com>:

>
> On Sun, Aug 30, 2015 at 6:12 AM, Gautier DI FOLCO <
> gautier.difolco at gmail.com> wrote:
>
>
>> I have two main goals:
>>  * print the signification of each bytecodes (a more explicit one than
>> the hexadecimal value of the bytecode)
>>  * Do some analysis on it, spot patterns, apply markov model on it and so
>> on.
>>
>
> Work backwards, a piece at a time.
>
> The output of parsing is a piece of structured data. So you need to design
> your data type. Start with the bare minimum. Write the pretty printing
> code. What do the type signatures of your analysis functions look like?
>
> E.g. "Apply markov model" is way too vague.
>
> The key thing is to iterate and improve on the design of your data type.
>
> Once the data type looks good, the implementation of parsing becomes clear
> as day.
>
> -- Kim-Ee
>

Hello,

Thanks for your answer.
I think I have misled you in the expression of my needs.
I don't have any representation issues, I'm just looking for a good design
for the parsing phase.
Here's some naive iterative tries:
0. We only have one bytecode per instruction a simple HashMap is enough:
import Data.Word
import Data.Maybe
import qualified Data.Map as H
type Byte = Word8

data ByteCodeSimpleInstruction =
      Astore_0 -- 0x4B
    | Astore_1 -- 0x4C
    | Astore_2 -- 0x4D
    | Astore_3 -- 0x4E
    deriving (Eq, Show)

sampleSI :: [Byte]
sampleSI = [0x4B, 0x4C, 0x4D, 0x4E]

convertTableSI :: H.Map Byte ByteCodeSimpleInstruction
convertTableSI = H.fromList $ [
      (0x4B, Astore_0),
      (0x4C, Astore_1),
      (0x4D, Astore_2),
      (0x4E, Astore_3)
    ]

parserSI :: [Byte] -> [ByteCodeSimpleInstruction]
parserSI = map $ fromJust . flip H.lookup convertTableSI

1. We add instruction with different lengths, I'm "forced" to introduce
pattern matching and it makes me sad:
data ByteCodeVariableLengthInstruction =
      Astore' Byte -- 0x19
    | Astore_0' -- 0x4B
    | Astore_1' -- 0x4C
    | Astore_2' -- 0x4D
    | Astore_3' -- 0x4E
    deriving (Eq, Show)

sampleVLI :: [Byte]
sampleVLI = [0x4B, 0x4C, 0x19, 0x2A, 0x4D, 0x4E]

parserVLI :: [Byte] -> [ByteCodeVariableLengthInstruction]
parserVLI b = case b of
    (0x19:p:n) -> (Astore' p):parserVLI n
    (0x4B:n)   -> Astore_0':parserVLI n
    (0x4C:n)   -> Astore_1':parserVLI n
    (0x4D:n)   -> Astore_2':parserVLI n
    (0x4E:n)   -> Astore_3':parserVLI n
    []         -> []

2. We add instructions that change the next instruction's length, we are
"forced" to add different parsing strategies:
data ByteCodeVariableLengthParameterInstruction =
      Astore'' ByteParameter -- 0x19
    | Astore_0'' -- 0x4B
    | Astore_1'' -- 0x4C
    | Astore_2'' -- 0x4D
    | Astore_3'' -- 0x4E
    | Wide'' -- 0xDD
    deriving (Eq, Show)

data ByteParameter =
      Simple Byte
    | Double Byte Byte
    deriving (Eq, Show)

sampleVLPI :: [Byte]
sampleVLPI = [0x4B, 0x4C, 0xDD, 0x19, 0xA2, 0x2A, 0x4D, 0x4E]

parserVLPI :: [Byte] -> [ByteCodeVariableLengthParameterInstruction]
parserVLPI = parserVLPISimple

parserVLPISimple :: [Byte] -> [ByteCodeVariableLengthParameterInstruction]
parserVLPISimple b = case b of
    (0x19:p:n) -> (Astore'' (Simple p)):parserVLPISimple n
    (0x4B:n)   -> Astore_0'':parserVLPISimple n
    (0x4C:n)   -> Astore_1'':parserVLPISimple n
    (0x4D:n)   -> Astore_2'':parserVLPISimple n
    (0x4E:n)   -> Astore_3'':parserVLPISimple n
    (0xDD:n)   -> Wide'':parserVLPIDouble n
    []         -> []

parserVLPIDouble :: [Byte] -> [ByteCodeVariableLengthParameterInstruction]
parserVLPIDouble b = case b of
    (0x19:p:q:n) -> (Astore'' (Double p q)):parserVLPISimple n
    _            -> parserVLPISimple b

I feel that are too "ad-hoc" tactics and I'm looking for an higher-order
way to do it.
I don't know if I'm clearer now.

Thanks in advance for you help.
Regards.

PS: For the Markov part, Markov models will help me to spot the most common
sequences of bytecodes. Then I'll be able to create proprietary byte that
will be the most likely to decrease my code's size.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20150831/ddc6f0c3/attachment.html>