[Haskell-cafe] Generating PDF from HTML with Pandoc

Geraldus heraldhoi at gmail.com
Tue Jun 9 07:13:54 UTC 2020


Hi dear Cafe!

I'm trying to achieve trivial task to generate PDF from HTML template using
Pandoc.

So far I've tried `wkhtmltopdf` and `pdflatex` creators, both with no luck.

I want to put few words about `pdflatex` and `xelatex` creators first, for
someone who will struggle with same task in future, it's quite hard to find
code examples on the web.

Initially I wasn't able to render document with `pdflatex` creator.  I
would like to mention that `pdflatex` required a lot of LaTeX stuff to be
installed, especially font packages.  Also I've spent several hours to make
rendering happen because I haven't specified template in `WriterOptions`.
`pdflatex` do not capable to handle Cyrillic Unicode characters, and
finally I figured out I have to use `xelatex` creator.  Also I've found and
used default template:

> pandoc <- readHtml def (toStrict $ renderHtml html)
> tpl' <- getDefaultTemplate "latex"
> makePDF "xelatex" [] writeLaTeX  (def {writerTemplate = Just tpl'}) pandoc

But in this case I got white space instead of Cyrillic chars in resulting
PDF and a bunch of warnings about missing chars in default font in
console.  I assume the font itself is specified in template.  I've looked
into default template and it's huge.  I guess I can prepare more simple
template for my own needs but it will take a lot of time to get familiar
with LaTeX document syntax.

I've tried `wkhtmltopdf`, which seems to be lightweight and easy solution.
It seemed to work well except encoding issues: resulting PDF contains
Cyrillic which rendered incorrectly.  I've tried to pass `["encoding
utf-8"]` as arguments in `makePDF` call, but this results in runtime error:

> --margin-bottom specified in incorrect location

Googling around this issue led me to glue that when I pass encoding
argument to `wkhtmltopdf` it breaks expected arguments order in command
which Pandoc generates.  This is likely could be easily fixed, but Pandoc
have a lot of opened issues on Github and also it requires some digging
into `wkhtmltopdf` command line arguments syntax.  I've looked into Pandoc
sources and it seems possible to provide simple patch, but I need a
guidance.   According to `wkhtmltopdf` it distinguish global args, page
args, cover args, table of contents args.  `encoding` argument is page
level argument, but Pandoc put extra args specified in `makePDF` after
default page arguments (`pdfargs` in following code sample):

>  let args   = mathArgs ++ concatMap toArgs
>                  [("page-size", getField "papersize" meta')
>                  ,("title", getField "title" meta')
>                  ,("margin-bottom", Just $ fromMaybe "1.2in"
>                             (getField "margin-bottom" meta'))
>                  ,("margin-top", Just $ fromMaybe "1.25in"
>                             (getField "margin-top" meta'))
>                  ,("margin-right", Just $ fromMaybe "1.25in"
>                             (getField "margin-right" meta'))
>                  ,("margin-left", Just $ fromMaybe "1.25in"
>                             (getField "margin-left" meta'))
>                  ,("footer-html", getField "footer-html" meta')
>                  ,("header-html", getField "header-html" meta')
>                  ] ++ pdfargs

Likely this breaks everything. The quickest and dirtiest workaround I see
is to check each argument, and if it is a page level argument put it for
each page object.  Another solution may be to specify encoding for Pandoc
document some other way, but I can't guess how to do that yet.

Maybe someone have already faced similar task and knows easier way to
render HTML to PDF with Haskell.  I will very grateful for any help, advice
or other glues how to achieve my goal.

Arthur.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20200609/1bd4c219/attachment.html>


More information about the Haskell-Cafe mailing list