[Haskell-cafe] Extract text from PDF file: need testers

Mon Oct 21 00:02:29 UTC 2013

Hello,

I just uploaded new version of pdf-toolbox suite.

Now it supports text extraction, see
http://hackage.haskell.org/package/pdf-toolbox-document-0.0.2.0/docs/Pdf-Toolbox-Document-Page.html#v:pageExtractText

New library, pdf-toolbox-content, contains low level tools for text
extraction. For example, one can extract glyphs with exact positions. It
can be used e.g. to implement text selection in PDF viewer (see
screenshots).

Is anybody interested in that functionality? I tested it on all PDF
files in my ~/Downloads, but there is a number of corner cases that are
not handled because I never saw them in the wild. So, if you are
interested, please try it out and report any issue. The easiest way is
to install pdf-toolbox-viewer (not on Hackage, see
https://github.com/Yuras/pdf-toolbox/tree/master/viewer , it depends on
gtk2hs) and run it with path to PDF file as an argument. Or you can just
use pageExtractText function directly:

import System.IO
import Pdf.Toolbox.Document

main =
  withBinaryFile "input.pdf" ReadMode $ \handle ->
    runPdfWithHandle handle knownFilters $ do
      pdf <- document
      catalog <- documentCatalog pdf
      rootNode <- catalogPageNode catalog
      count <- pageNodeNKids rootNode
      liftIO $ print count
      -- the first page of the document
      page <- pageNodePageByNum rootNode 0
      txt <- pageExtractText page
      liftIO $ print txt

Few screenshots (please let me know if you can't access them):
 - render via ImageMagick:
https://docs.google.com/file/d/0B0K_fl2fc1ZgcnVtZXhFTUx5ekE/edit?usp=sharing
 - render extracted text with correct positions:
https://docs.google.com/file/d/0B0K_fl2fc1ZgZE52X0hMcVNnaG8/edit?usp=sharing
 - combined image:
https://docs.google.com/file/d/0B0K_fl2fc1ZgaUE5Qkt6S19VQlE/edit?usp=sharing

On Hackage:
 - pdf-toolbox-document:
http://hackage.haskell.org/package/pdf-toolbox-document
 - pdf-toolbox-core: http://hackage.haskell.org/package/pdf-toolbox-core
 - pdf-toolbox-content:
http://hackage.haskell.org/package/pdf-toolbox-content

Thanks,
Yuras