[Haskell-cafe] how to get started: a text application

Graham Klyne GK at ninebynine.org
Wed Jun 23 11:45:32 EDT 2004


At 16:53 22/06/04 +0300, Max Ischenko wrote:

>Hi all,
>
>I'm going to try to implement a version of Markdown tool[1] in Haskell. 
>The application is rather simple: take a text file with some (simple) 
>mark-up embedded in it and turn it into another text file, this time with 
>XHTML markup.

Cool project!

>I need some guidelines on how to get started. I'll have to struggle with 
>both the language itself (as I am a newcomer) plus with libraries that I 
>may need.

Ah, how to get started?  I find myself wrestling with this each time I 
start a new Haskell project, or phase of a project.

>If I to do this in some convenient language, like Python, I'd use a lexer 
>to parse the input, a state machine to build some internal representation 
>and some serializer to write this into XHTML.

These choices are all available in Haskell.  The way you pit them together 
may be a little different (see below).

It's bee a while since I peered at MarkDown, and I forget how the language 
looks, except that I recall that, by design, it's kept very simple and has 
very little "superfluous" syntactic structure.

I think the first choice is whether to go for a separately identifiable 
lexing phase, rather than working directly from the raw text.  Either might 
work, I think.  The HaXml XML parser has a separate lexer, but it turns out 
that it's not always easy to get the tokenization right without having 
contextual information (e.g. from the syntax analyzer).  (XML is rather 
messy in that way.)

In Haskell, it's often reasonably efficient to construct a program as a 
composition of "filters", rather like a Unix command pipeline;  lazy 
evaluation often means that data can "stream" through the filters and never 
exist in its entirety in an intermediate form.  This immediately allows the 
program structure to be resolved into a number of smaller, independent 
pieces; e.g.

    tokenize    :: String -> [Token]
    parse       :: [Token] -> DocModel
    createXHTML :: DocModel -> Document  -- (cf. HaXml)

Then HaXml provides function that can generate textual XML.  Thus the 
overall conversion function might look like:

    markdownToXHTML :: String -> String
    markdownToXHTML = show . document . createXHTML . parse . tokenize

(where "document" is from the HaXml module Text.XML.HaXml.Pretty).

For parsing of any complexity, I recommend Parsec:  it has the advantage of 
being very well documented, and it helps to show how monads can be used to 
handle state information.

The outline sketched above has at least one weakness, it doesn't provide 
any way to handle errors.  This could be overcome by using Either as an 
error monad (see Control.Monad and Control.Monad.Error in the standard 
hierarchical libraries), and then using >>= in place of function 
composition (noting the reversal of component order):

    -- Get definition of (Either String) as an instance of Monad:
    import Control.Monad()
    import Control.Monad.Error()

    tokenize    :: String -> Either String [Token]
    parse       :: [Token] -> Either String DocModel
    createXHTML :: DocModel -> Either String Document
    markdownToXHTML :: String -> Either String String
    markdownToXHTML s =
        tokenize s >>= parse >>= createXHTML >>= (Right . show . document)

Or, using the do-notation:

    markdownToXHTML s = do
        { ts <- tokenize s
        ; dm <- parse ts
        ; xd <- createXHTML
        ; return $ show (document xd)
        }

>Which approaches could you suggest for that kind of application in 
>Haskell? I'd especially glad to hear about some "idiomatic" or "native" 
>ways to solve the problem.

>I've made a quick search and found tools like Parsec, HaXML, Happy and 
>WASH/HTML. Not sure which ones I'll need.

I recommend Parsec (if it fits your needs, of course).

It's not clear to me whether you want to generate text directly from the 
document data, or via something like HaXml.  HaXml does have a pretty 
printing function (though I have noticed some minor problems with it, 
though probably nothing you'd encounter with this application.)

...

Just to check out my use of >>= and do-notation, I constructed a trivial 
complete program using both.

[[
-- spike-composition.hs
--
-- composition of functions; using Either as an error monad

-- Get definition of (Either String) as an instance of Monad:
import Control.Monad()
import Control.Monad.Error()

data Token = T { unT::Char }

data DocModel = D String

instance Show DocModel where
     show (D s) = s

tokenize    :: String -> Either String [Token]
tokenize    = Right . map T

parse       :: [Token] -> Either String DocModel
parse       =  Right . D . reverse . map unT

docToString1 s =
     tokenize s >>= parse >>= (Right . ("docToString1: "++) . show)

docToString2 s = do
     { ts <- tokenize s
     ; dm <- parse ts
     ; return $ (("docToString2: "++) . show) dm
     }

t1 = docToString1 "abc" -- == Right "docToString1: cba"
t2 = docToString2 "def" -- == Right "docToString2: fed"
]]

#g
--

>[1] - http://daringfireball.net/projects/markdown/
>
>TIA,
>max.
>_______________________________________________
>Haskell-Cafe mailing list
>Haskell-Cafe at haskell.org
>http://www.haskell.org/mailman/listinfo/haskell-cafe

------------
Graham Klyne
For email:
http://www.ninebynine.org/#Contact



More information about the Haskell-Cafe mailing list