[Haskell-cafe] Lightweight Unicode normalization library

Harendra Kumar harendra.kumar at gmail.com
Fri Mar 25 13:31:15 UTC 2016


I looked at prose and did some tests. It builds and works very well
functionally (normalization tests passed) but turns out to be pretty poor
 on normalization performance (171 times slower than text-icu).  I believe
it can be improved with some changes to the data structures. Though the
performance may or may not matter depending on your use case.

Here are the results of a quick normalization benchmarking test that I did
using text-icu, unicode-transforms (bindings to the utf8proc C library) and
prose:

text-icu                    = 1 sec     (224 MB/s on the test machine)
unicode-transforms = 6 sec     (40 MB/s)
prose                       = 171 sec (1.3 MB/s)

It looks like icu is the gold standard in performance. Even GNU
libunistring's performance seems to be very similar to utf8proc.

-harendra

On 25 March 2016 at 15:57, Harendra Kumar <harendra.kumar at gmail.com> wrote:

> Ah, I created a package for unicode normalization already since I got no
> responses to my mail:
>
> https://github.com/harendra-kumar/unicode-transforms
>
> I will take a look at prose as well since it is native Haskell. It does
> not seem to be on Hackage yet.
>
> -harendra
>
>
> On 25 March 2016 at 05:08, Rob Leslie <rob at mars.org> wrote:
>
>> I don’t have a good answer, but I thought I’d mention this project which
>> looks interesting and I’m considering using myself:
>>
>>     https://github.com/llelf/prose
>>
>> --
>> Rob Leslie
>> rob at mars.org
>>
>>
>> On Mar 17, 2016, at 12:59 AM, Harendra Kumar <harendra.kumar at gmail.com>
>> wrote:
>>
>> I looked around and found only one package, text-icu which provides
>> unicode normalization operations and a lot more. But text-icu depends on
>> the icu library being installed on the system. We would prefer to avoid
>> dependency on the icu library.
>>
>> Is there a lightweight alternative which does not depend on icu? It could
>> be a pure Haskell package or bindings to a lightweight C library where the
>> library is small and shipped with the package itself.
>>
>> I wonder if there is a need for unicode normalization operations in GHC
>> code itself? If so how does it handle that?
>>
>> I found a lightweight C library (https://github.com/JuliaLang/utf8proc)
>> for normalization and case folding used by the Julia lang project. If there
>> is no other option I am considering creating bindings to this library.
>>
>> Any pointers, thoughts?
>>
>> Thanks,
>> Harendra
>> _______________________________________________
>> Haskell-Cafe mailing list
>> Haskell-Cafe at haskell.org
>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20160325/1a527871/attachment.html>


More information about the Haskell-Cafe mailing list