[Haskell-cafe] NLP libraries and tools?

Aleksandar Dimitrov aleks.dimitrov at googlemail.com
Thu Jul 7 00:03:00 CEST 2011


On Wed, Jul 06, 2011 at 11:04:30PM +0400, Dmitri O.Kondratiev wrote:
> On Wed, Jul 6, 2011 at 8:32 PM, wren ng thornton <wren at freegeek.org> wrote:
> 
> > On 7/6/11 9:27 AM, Dmitri O.Kondratiev wrote:
> > > Hi,
> > > Continuing my search of Haskell NLP tools and libs, I wonder if the
> > > following Haskell libraries exist (googling them does not help):
> > > 1) End of Sentence (EOS) Detection. Break text into a collection of
> > > meaningful sentences.
> >
> > Depending on how you mean, this is either fairly trivial (for English) or
> > an ill-defined problem. For things like determining whether the "."
> > character is intended as a full stop vs part of an abbreviation; that's
> > trivial.
> >
> > But for general sentence breaking, how do you intend to deal with
> > quotations? What about when news articles quote someone uttering a few
> > sentences before the end-quote marker? So far as I'm aware, there's no
> > satisfactory definition of what the solution should be in all reasonable
> > cases. A "sentence" isn't really very well-defined in practice.
> >
> 
> I am looking for Haskell implementation of sentence tokenizer such as
> described by Tibor Kiss and Jan Strunk’s in “Unsupervised Multilingual
> Sentence Boundary Detection”,  which is implemented in NLTK:
> 
> http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tokenize.punkt-module.html
> 
> 
> > > 2) Part-of-Speech (POS) Tagging. Assign part-of-speech information to
> > each
> > > token.
> >
> > There are numerous approaches to this problem; do you care about the
> > solution, or will any one of them suffice?
> >
> > I've been working over the last year+ on an optimized HMM-based POS
> > tagger/supertagger with online tagging and anytime n-best tagging. I'm
> > planning to release it this summer (i.e., by the end of August), though
> > there are a few things I'd like to polish up before doing so. In
> > particular, I want to make the package less monolithic. When I release it
> > I'll make announcements here and on the nlp@ list.
> 
> 
> I am looking for some already working POS tagging framework that can be
> customized for different pidgin languages.
> 
> 
> > > 3) Chunking. Analyze each tagged token within a sentence and assemble
> > > compound tokens that express logical concepts. Define a custom grammar.
> > >
> > > 4) Extraction. Analyze each chunk and further tag the chunks as named
> > > entities, such as people, organizations, locations, etc.
> > >
> > > Any ideas where to look for similar Haskell libraries?
> >
> > I don't know of any work in these areas in Haskell (though I'd love to
> > hear about it). You should try asking on the nlp@ list where the other
> > linguists and NLPers are more likely to see it.
> >
> >
> I will, though nlp at projects.haskell.org. looks very quiet...

Quiet, yes, but, hey, we all start out… nevermind, humans start out loud.

Well anyhow, it's quiet, but it's gotta start somewhere. I wouldn't hold my
breath for a full-scale Haskell-native solution to your problem just yet though.

Here's what I'm doing: I usually use external programs to do the heavy lifting
for which there aren't Haskell programs. Then I use Haskell (where applicable)
to do the logic, and shell scripts to glue together everything.

So you'd use, say, UIMA+OpenNLP to do sentence boundaries, tokens, tags,
named-entities whatnot, then spit out some annotated format, read it in with
Haskell, and do the logic/magic there.

Complicated, yes. But it gets me around having to code too much in Java. That's
a gain if I've ever seen one.

Regards,
Aleks
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 490 bytes
Desc: Digital signature
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20110707/c994d317/attachment.pgp>


More information about the Haskell-Cafe mailing list