[Haskell-cafe] Very simple parser

Mon Jul 2 18:12:40 EDT 2007

vim: set ft=lhaskell:

On Mon, Jul 02, 2007 at 02:25:57PM -0700, Gregory Propf wrote:
| As a programming exercise I'm trying to use the State monad to create
| a simple parser.  It's for a very simple assembly language for a
| simple virtual machine.  The state is a string of instructions.  I
| want to be able to call something like getNextInstruction to pull out
| the next instruction and then update the state (string).  I know I can
| do this non-monadically by just passing the string explicitly each
| time but I'd like to learn more about the State monad.  I also know
| about Parsec and Happy and so forth but this is just an exercise so I
| want to do it this way.  Any ideas?  I can't seem to get anything to
| work.  I've tried different things but I suspect I'm just missing
| something basic.  Can someone post a simple prototype for this?  Just
| assume the instructions are integers.

For an example, here is the simplest parser type I know of; LL(1) a la
Crenshaw.

Our parser is simply a stateful computation using the rest of the input.

> import Char
> import Control.Monad.State

> type P = State [Char]

Simple primitives.  We need to be able to see what the next character
is.  Notice that we return Maybe because there might not be a next
character.  Also note the use of the State data constructor to modify
the value and return at the same time.

> look :: P (Maybe Char)
> look = State $ \ st -> case st of
>     []      -> (Nothing,  [])
>     (c:cs)  -> (Just c,   c:cs)

We need to do tests occasionally on the values.

> isDigit' = maybe False isDigit
> digitToInt' = maybe 0 digitToInt

getc is similar, but it removes the character.  This is typically done
after making a decision based on look.

> getc :: P (Maybe Char)
> getc = State $ \ st -> case st of
>     []      -> (Nothing,  [])
>     (c:cs)  -> (Just c,   cs)

If we find inconsistent input, we signal a fatal error using the fail
function already defined for State.  A more featureful monad such as
ErrorT (State [Char]) could be used to make error conditions non-fatal
for the program.

Often, we know what the lookahead will be; we can use this for better
error messages.  (We should also store the text position in the state,
but for pedagogical reasons I will ignore that).

For instance, if we are expecting something but don't find it, we use
expected:

> expected str = do
>     context <- gets (show . take 20)
>     fail $ str ++ " expected; found " ++ context ++ " instead"

Skipping whitespace is very important to handle our lines.  Note that
we also handle #-comments.

> white = look >>= \ch ->
>     case ch of  Just '#'   -> line  >> white
>                 Just ' '   -> getc  >> white
>                 Just '\t'  -> getc  >> white
>                 _          -> return ()
>
> line = look >>= \ch ->
>     case ch of  Just '\n'  -> return ()
>                 Nothing    -> return ()
>                 _          -> getc >> line

Another common pattern is to skip over a noise character while
verifying that it was correct.  Notice the use of expected to handle
error message formatting.

Also note that we skip whitespace afterward - from here on we will
maintain the invariant that the cursor does not point to spaces.

> match ch = look >>= \realch ->
>     if (realch == Just ch)
>         then getc
>         else expected (show ch)

Parsing integers is one of the problem requirements; it is handled
by dispatching on the first character, then parsing a natural.

> number = look >>= \ch -> case ch of
>     Just '+'  -> match '+'  >> natural 0
>     Just '-'  -> match '-'  >> negate `fmap` natural 0
>     _         -> natural 0

Naturals can be handled using a simple loop.  Notice that we check
lookahead at the *end*, this is necessary to avoid parsing eg "xx" as
a natural 0.  We use an accumulating parameter.

> natural acc = look >>= \ch -> do
>     unless (isDigit' ch) $ expected "number"
>
>     getc
>     let acc' = digitToInt' ch + acc * 10
>
>     look >>= \lk -> if isDigit' lk
>         then  natural acc'
>         else  white >> return acc'

A line of input to our assembler consists of horizontal whitespace, an
optional number, more horizontal whitespace, and a newline.

> inputLine = do
>     white
>     look >>= \ch -> if isDigit' ch
>         then  do  number
>                   white
>         else  return ()
>     match '\n'

All our input consists of input lines for as long as there is more.

> input = look >>= maybe (return ()) (\_ -> inputLine >> input)

> main = interact $ show . runState input

Stefan