[Haskell-cafe] [ANN] unicode-transforms-0.2.0 pure Haskell unicode normalization

Tue Oct 25 17:34:14 UTC 2016

I did not fully compare the implementation, I just focussed on getting as
much performance out of the Haskell implementation as was possible. I can
say two things that might have allowed it to be better:

1) I extracted as much as was possible in terms of implementation
efficiency of the Haskell code. So I did not lose there. The code could
have been much simpler without all the optimizations.

2) My implementation may be better in terms of algorithms and data
structures used. Unicode normalization is complicated, the implementation
can differ in many ways making you lose or gain performance.

Beating the utf8proc implementation was easy. The best (highly optimized)
normalization implementation is the ICU C++ implementation and my target
was to get close to that. I got pretty close to it (using llvm backend) in
most benchmarks and even beat it clearly in one benchmark. There are a
couple of enhancements that I filed against GHC, hopefully they will allow
it to be completely at par in all benchmarks. Though the difference may not
matter other than proving that it can be as good.

-harendra

On 25 October 2016 at 22:36, William Yager <will.yager at gmail.com> wrote:

> Interesting! What would you say allowed you to get better decompose
> performance than the C library?
>
> Will
>
> On Tue, Oct 25, 2016 at 11:59 AM, Harendra Kumar <harendra.kumar at gmail.com
> > wrote:
>
>> Hi,
>>
>> I released unicode-transforms sometime back as bindings to a C library
>> (utf8proc). Since then I have rewritten it completely in Haskell. Haskell
>> data structures are automatically generated from unicode database, so it
>> can be kept up-to-date with the standard unlike the C implementation which
>> was stuck at unicode 5. The implementation comes with a test suite
>> providing 100% code coverage.
>>
>> After a number of algorithmic and implementation efficiency
>> optimizations, I was able to get several times better decompose performance
>> compared to the C implementation. I have not yet got a chance to fully
>> optimize the compose operations but they are still as fast as utf8proc.
>>
>> I would like to thank Antonio Nikishaev for the unicode character
>> database parsing code which I borrowed from the prose library.
>>
>> https://github.com/harendra-kumar/unicode-transforms
>> https://hackage.haskell.org/package/unicode-transforms
>>
>> -harendra
>>
>> _______________________________________________
>> Haskell-Cafe mailing list
>> To (un)subscribe, modify options or view archives go to:
>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
>> Only members subscribed via the mailman list are allowed to post.
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20161025/6b8959b1/attachment.html>