[Haskell-cafe] Very simple parser
Stefan O'Rear
stefanor at cox.net
Mon Jul 2 18:12:40 EDT 2007
vim: set ft=lhaskell:
On Mon, Jul 02, 2007 at 02:25:57PM -0700, Gregory Propf wrote:
| As a programming exercise I'm trying to use the State monad to create
| a simple parser. It's for a very simple assembly language for a
| simple virtual machine. The state is a string of instructions. I
| want to be able to call something like getNextInstruction to pull out
| the next instruction and then update the state (string). I know I can
| do this non-monadically by just passing the string explicitly each
| time but I'd like to learn more about the State monad. I also know
| about Parsec and Happy and so forth but this is just an exercise so I
| want to do it this way. Any ideas? I can't seem to get anything to
| work. I've tried different things but I suspect I'm just missing
| something basic. Can someone post a simple prototype for this? Just
| assume the instructions are integers.
For an example, here is the simplest parser type I know of; LL(1) a la
Crenshaw.
Our parser is simply a stateful computation using the rest of the input.
> import Char
> import Control.Monad.State
> type P = State [Char]
Simple primitives. We need to be able to see what the next character
is. Notice that we return Maybe because there might not be a next
character. Also note the use of the State data constructor to modify
the value and return at the same time.
> look :: P (Maybe Char)
> look = State $ \ st -> case st of
> [] -> (Nothing, [])
> (c:cs) -> (Just c, c:cs)
We need to do tests occasionally on the values.
> isDigit' = maybe False isDigit
> digitToInt' = maybe 0 digitToInt
getc is similar, but it removes the character. This is typically done
after making a decision based on look.
> getc :: P (Maybe Char)
> getc = State $ \ st -> case st of
> [] -> (Nothing, [])
> (c:cs) -> (Just c, cs)
If we find inconsistent input, we signal a fatal error using the fail
function already defined for State. A more featureful monad such as
ErrorT (State [Char]) could be used to make error conditions non-fatal
for the program.
Often, we know what the lookahead will be; we can use this for better
error messages. (We should also store the text position in the state,
but for pedagogical reasons I will ignore that).
For instance, if we are expecting something but don't find it, we use
expected:
> expected str = do
> context <- gets (show . take 20)
> fail $ str ++ " expected; found " ++ context ++ " instead"
Skipping whitespace is very important to handle our lines. Note that
we also handle #-comments.
> white = look >>= \ch ->
> case ch of Just '#' -> line >> white
> Just ' ' -> getc >> white
> Just '\t' -> getc >> white
> _ -> return ()
>
> line = look >>= \ch ->
> case ch of Just '\n' -> return ()
> Nothing -> return ()
> _ -> getc >> line
Another common pattern is to skip over a noise character while
verifying that it was correct. Notice the use of expected to handle
error message formatting.
Also note that we skip whitespace afterward - from here on we will
maintain the invariant that the cursor does not point to spaces.
> match ch = look >>= \realch ->
> if (realch == Just ch)
> then getc
> else expected (show ch)
Parsing integers is one of the problem requirements; it is handled
by dispatching on the first character, then parsing a natural.
> number = look >>= \ch -> case ch of
> Just '+' -> match '+' >> natural 0
> Just '-' -> match '-' >> negate `fmap` natural 0
> _ -> natural 0
Naturals can be handled using a simple loop. Notice that we check
lookahead at the *end*, this is necessary to avoid parsing eg "xx" as
a natural 0. We use an accumulating parameter.
> natural acc = look >>= \ch -> do
> unless (isDigit' ch) $ expected "number"
>
> getc
> let acc' = digitToInt' ch + acc * 10
>
> look >>= \lk -> if isDigit' lk
> then natural acc'
> else white >> return acc'
A line of input to our assembler consists of horizontal whitespace, an
optional number, more horizontal whitespace, and a newline.
> inputLine = do
> white
> look >>= \ch -> if isDigit' ch
> then do number
> white
> else return ()
> match '\n'
All our input consists of input lines for as long as there is more.
> input = look >>= maybe (return ()) (\_ -> inputLine >> input)
> main = interact $ show . runState input
Stefan
More information about the Haskell-Cafe
mailing list