[Haskell-cafe] NLP libraries and tools?

Thu Jul 7 00:45:06 CEST 2011

On Wed, Jul 06, 2011 at 03:14:07PM -0700, Rogan Creswick wrote:
> Have you used that particular combination yet? I'd like to know the
> details of how you hooked everything together if that's something you
> can share.  (We're working on a similar Frankenstein at the moment.)

These Frankensteins, as your so dearly call them, are always very task-specific.
Here's a setup I've used:

- Take some sort of corpus you want to work with, and annotate it with, say,
  Java tools. This will probably require you to massage the input corpus into
  something your tools can read, and then call the tools to process it
- Let your Java stuff write everything to disk in a format that you can easily
  read in with Haskell. If your annotations are not interleaving, you're lucky,
  because you can probably just use a word-per-line with columns for markup
  format. That's trivial to read in with Haskell. More complicated stuff should
  probably be handled in XML-fashion. I like HXT for reading in XML, but it's
  slow (as are its competitors. Although it's been a while since I've used it;
  maybe it supports Text or ByteStrings by now.)
- Advanced mode: instead of dumping to files, use named pipes or TCP sockets to
  transfer data. Good luck

Shell scripting comes in *very* handy here, in order to automate this process.

Now, everything I've done so far is only *research*, no finished product that
the end user wants to poke on their desktop and have it work interactively. For
that, it might be useful to have some sort of standing server architecture: you
have multiple annotation servers (one that runs in Java, one that runs in
Haskell) and have them communicate the data.

At this point, the benefits might be outweighed by the drawbacks. My love for
Haskell only goes that far.

One hint, if you ever find yourself reading in quantitative linguistic data with
Haskell: forget lazy IO. Forget strict IO, except your documents aren't ever
bigger than a few hundred megs. In case you're not keeping the whole document in
memory, but you're keeping some stuff in memory, never keep it around in
ByteStrings, but use Text or SmallString (ByteStrings will invariably leak space
in this scenario.) Learn how to use Iteratees and use them judiciously.

Keep in touch on the Haskell NLP list :-)
Regards,
Aleks
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 490 bytes
Desc: Digital signature
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20110707/c1922774/attachment.pgp>