[Haskell-cafe] ANN: cheapskate 0.1, markdown parser

Tue Jan 7 18:18:29 UTC 2014

+++ Justin Bailey [Jan 06 14 13:07 ]:
> Looks like an excellent library! How did you manage to maintain linear
> growth in running time? Seems like quite an achievement.

I tried very hard to avoid any backtracking.

Parsing occurs in two phases.  In the first phase, we process each line
of the input in turn (this could even be done with streaming input;
we never need to "rewind").  Each line transforms a "container stack,"
which container stack represents the nested containers, e.g.  list items
or blockquotes, that are open at that point in the document.  Depending
on the contents of the line, we might add new containers to the stack
and/or close some of the open containers, adding them as children to the
elements below them.  The end of the line, after the bits that indicate
container structure, will generally be added as a leaf element (text
line or blank line) to the top container.

I had to make a couple small changes in markdown syntax to make this
line-by-line strategy feasible.  If an opening code fence isn't matched
by a closing code fence, the whole remainder of the document gets parsed
as code.  (Otherwise we'd have to backtrack an indefinite number of
lines.)  There is a also a change in how raw HTML works (documented in
the README).  A raw HTML block begins with an opening block HTML tag,
and ends, not with a matching closing tag, but with the next blank line.
An advantage of this system is that we can easily include markdown
content in HTML tags (just surround it with blank lines), yet we can
still have literal HTML blocks (just make sure they don't contain
blanks).  So I think this is superior to Gruber's original specification
for raw HTML blocks, not just from a parsing point of view but from an
expressive point of view.

After we've read all the input, we close the container stack. We now
have a Document element that represents the container structure and
the text lines under each container.  We also have a reference
map with the link references.  You can inspect this structure using
'cheapskate --debug'; here's an example:

  (Document
    ListItem {markerColumn = 1, padding = 3, listType = Numbered PeriodFollowing 1}
      "Foo"
      ListItem {markerColumn = 3, padding = 2, listType = Bullet '-'}
        "bar"
        BlankLine ""
        "baz"
    BlankLine ""
    ListItem {markerColumn = 1, padding = 3, listType = Numbered PeriodFollowing 2}
      "Bim",fromList [])

In the second phase, we walk this container structure and produce a
proper AST.  This involves assembling list items into lists and
assembling lists of text lines into paragraphs, parsing their contents
as markdown inline elements, using the reference map to resolve
reference links.

Inline parsing uses parser combinators.  I avoid backtracking by using
fallbacks.  For example, if we start parsing an emphasized section beginning
with '*' and don't find the closing '*', we don't backtrack; instead, we just
emit a '*' and the content we've parsed so far.

I'd love to make the library faster.  It is the fastest pure Haskell implementation
I'm aware of, but C libraries like discount are still quite a bit faster.
(Note:  often they skimp on proper unicode support, which accounts for some of
this.)  Profiling does not show me any particular bottleneck, but I'd welcome
suggestions.

John

> On Mon, Jan 6, 2014 at 11:15 AM, John MacFarlane <jgm at berkeley.edu> wrote:
> > I've released a new markdown library on Hackage:
> > http://hackage.haskell.org/package/cheapskate
> >
> > This library is designed to be used in web applications.  It is small,
> > accurate, and fast, in pure Haskell with few dependencies.  All output
> > is sanitized through a whitelist by default.  It is designed to have
> > performance that varies linearly with the input size, even with garbage
> > input.  To illustrate:
> >
> > % head -c 100000 /dev/urandom | iconv -f latin1 -t utf-8 | time cheapskate >/dev/null
> > cheapskate > /dev/null  0.15s user 0.01s system 82% cpu 0.199 total
> > % head -c 1000000 /dev/urandom | iconv -f latin1 -t utf-8 | time cheapskate >/dev/null
> > cheapskate > /dev/null  1.53s user 0.03s system 88% cpu 1.770 total
> > % head -c 10000000 /dev/urandom | iconv -f latin1 -t utf-8 | time cheapskate >/dev/null
> > cheapskate > /dev/null  15.50s user 0.20s system 89% cpu 17.632 total
> >
> > This is a test that many markdown parsers fail (including my own pandoc
> > and the markdown package on Hackage).
> >
> > Performance is about seven times faster than pandoc (with five times
> > less memory used), and about 25% faster than the markdown package on Hackage.
> >
> > Generic functions are provided that allow transformations of the AST
> > prior to rendering (e.g., promotion of headers, insertion of syntax
> > highlighting, or the conversion of specially marked code blocks into
> > diagrams).
> >
> > Several markdown extensions are supported, and sensible decisions have
> > been made about several aspects of markdown syntax that are left vague
> > by John Gruber's specification.  For details, see the README
> > at https://github.com/jgm/cheapskate.
> >
> > _______________________________________________
> > Haskell-Cafe mailing list
> > Haskell-Cafe at haskell.org
> > http://www.haskell.org/mailman/listinfo/haskell-cafe
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe