[Haskell-cafe] Extract text from PDF file: need testers

Yuras Shumovich shumovichy at gmail.com
Mon Oct 21 00:02:29 UTC 2013


I just uploaded new version of pdf-toolbox suite.

Now it supports text extraction, see

New library, pdf-toolbox-content, contains low level tools for text
extraction. For example, one can extract glyphs with exact positions. It
can be used e.g. to implement text selection in PDF viewer (see

Is anybody interested in that functionality? I tested it on all PDF
files in my ~/Downloads, but there is a number of corner cases that are
not handled because I never saw them in the wild. So, if you are
interested, please try it out and report any issue. The easiest way is
to install pdf-toolbox-viewer (not on Hackage, see
https://github.com/Yuras/pdf-toolbox/tree/master/viewer , it depends on
gtk2hs) and run it with path to PDF file as an argument. Or you can just
use pageExtractText function directly:

import System.IO
import Pdf.Toolbox.Document

main =
  withBinaryFile "input.pdf" ReadMode $ \handle ->
    runPdfWithHandle handle knownFilters $ do
      pdf <- document
      catalog <- documentCatalog pdf
      rootNode <- catalogPageNode catalog
      count <- pageNodeNKids rootNode
      liftIO $ print count
      -- the first page of the document
      page <- pageNodePageByNum rootNode 0
      txt <- pageExtractText page
      liftIO $ print txt

Few screenshots (please let me know if you can't access them):
 - render via ImageMagick:
 - render extracted text with correct positions:
 - combined image:

On Hackage:
 - pdf-toolbox-document:
 - pdf-toolbox-core: http://hackage.haskell.org/package/pdf-toolbox-core
 - pdf-toolbox-content:


More information about the Haskell-Cafe mailing list