[Haskell-cafe] NLP libraries and tools?

wren ng thornton wren at freegeek.org
Thu Jul 7 04:22:41 CEST 2011


On 7/6/11 5:58 PM, Aleksandar Dimitrov wrote:
> On Wed, Jul 06, 2011 at 09:32:27AM -0700, wren ng thornton wrote:
>> On 7/6/11 9:27 AM, Dmitri O.Kondratiev wrote:
>>> Hi,
>>> Continuing my search of Haskell NLP tools and libs, I wonder if the
>>> following Haskell libraries exist (googling them does not help):
>>> 1) End of Sentence (EOS) Detection. Break text into a collection of
>>> meaningful sentences.
>>
>> Depending on how you mean, this is either fairly trivial (for English) or
>> an ill-defined problem. For things like determining whether the "."
>> character is intended as a full stop vs part of an abbreviation; that's
>> trivial.
>
> I disagree. It's not exactly trivial in the sense that it is solved. It is
> trivial in the sense that, usually, one would use a list of know
abbreviations
> and just compare. This, however, just says that the most common approach is
> trivial, not that the problem is.

Perhaps. I recall David Yarowsky suggesting it was considered solved (for
English, as I qualified earlier).

The solution I use is just to look at a window around the point and run a
standard feature-based machine learning algorithm over it[1]. Memorizing
known abbreviations is actually quite fragile, for reasons you mention.
This approach will give you accuracy in the high 90s, though I forget the
exact numbers.


[1] With obvious features like whether the following word is capitalized,
whether the preceding word is capitalized, length of the preceding word,
whether there's another period after the next word,...


>> But for general sentence breaking, how do you intend to deal with
>> quotations? What about when news articles quote someone uttering a few
>> sentences before the end-quote marker? So far as I'm aware, there's no
>> satisfactory definition of what the solution should be in all reasonable
>> cases. A "sentence" isn't really very well-defined in practice.
>
> As long as you have one routine and stick to it, you don't need a formal
> definition every linguist will agree on. Computational Linguists (and their
> tools,) more often than not, just need a dependable solution, not a
correct one.

But the problem is that what constitutes an appropriate solution for
computational needs is still very ill-defined. For example, the treatment
of quotations will depend on the grammar theory used in the tagger,
parser, translator, etc. The quality of output is often quite susceptible
to EOS being meaningfully[2] distributed. Thus, what constitutes a
"dependable" solution often varies depending on the task in question.[3]

Also, a lot of the tools in this area assume there's some sort of
punctuation marking the end of sentences, even if it's unreliable as an
EOS indicator. That works well enough for languages with European-like
orthographic traditions, but it falls apart quite rapidly when moving to
East Asian languages (e.g., Burmese, Thai,...). And languages like
Japanese or Arabic can have "sentences" that go on forever, but are best
handled by chunking them into clauses.


[2] In a statistical sense, relative to the structure of the model.

[3] Personally, I think the idea of having a single EOS type is the bulk
of the problem. If we allowed for different kinds of EOS in grammars then
the upstream tools could handle sentence fragments better, which would
make it easier to make fragment breaking reliable.


>> I've been working over the last year+ on an optimized HMM-based POS
>> tagger/supertagger with online tagging and anytime n-best tagging. I'm
>> planning to release it this summer (i.e., by the end of August), though
>> there are a few things I'd like to polish up before doing so. In
>> particular, I want to make the package less monolithic. When I release it
>> I'll make announcements here and on the nlp@ list.
>
> I'm very interested in your progress! Keep us posted :-)

Will do :)

-- 
Live well,
~wren





More information about the Haskell-Cafe mailing list