[Haskell-cafe] NLP libraries and tools?

Rogan Creswick creswick at gmail.com
Sat Jul 2 00:03:36 CEST 2011


On Fri, Jul 1, 2011 at 2:52 PM, Dmitri O.Kondratiev <dokondr at gmail.com> wrote:
> Any other then 'toktok' Haskell word tokenizer that compiles and works?
> I need something like:
> http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tokenize.regexp.WordPunctTokenizer-class.html
>

I don't think this exists out of the box, but since it appears to be a
basic regex tokenizer, you could use Data.List.Split to create one.
(or one of the regex libraries may be able to do this more simply).

If you go the Data.List.Split route, I suspect you'll want to create a
Splitter based on the whenElt Splitter:

http://hackage.haskell.org/packages/archive/split/0.1.1/doc/html/Data-List-Split.html#v:whenElt

which takes a function from an element to a bool.  (which you can
implement however you wish, possibly with a regular expression,
although it will have to be pure.)

If you want something like a maxent tokenizer, then you're currently
out of luck :( (as far as I know).

--Rogan



More information about the Haskell-Cafe mailing list