[Haskell-cafe] Stripping text of xml tags and special symbols

Pieter Laeremans pieter at laeremans.org
Tue Aug 5 17:21:43 EDT 2008


Hi,
I  've got a lot of files which I need to proces in order to make them
indexable by sphinx.
The files contain the data of a website with a custom perl based cms.
 Unfortunatly they sometimes contain xml/html tags like <i>

And since most of the texts are in dutch and some are in French they also
contain a lot of special characters like ë é, ...

I'm trying to replace the custom based perl based cms by a haskell one.  And
I would like to add search capability. Since someone wrote sphinx
bindings a few weeks ago I thought I try that.

But transforming the files in something that sphinx seems a challenge.  Most
special character problems seem to go aways when I  use encodeString
(Codec.Binary.UTF8.String)
on the indexable data.

But the sphinx indexer complains that the xml isn't valid.  When I look at
the errors this seems due to some documents containing not well formed
 html.
I would like to use a programmatic solution to this problem.

And is there some haskell function which converts special tokens lik & ->
&amp; and é -> &egu; ?

thanks in advance,

Pieter



-- 
Pieter Laeremans <pieter at laeremans.org>

"The future is here. It's just not evenly distributed yet." W. Gibson
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.haskell.org/pipermail/haskell-cafe/attachments/20080805/b41bdeb6/attachment.htm


More information about the Haskell-Cafe mailing list