[Haskell-cafe] Re: Strings and utf-8

Reinier Lamers reinier.lamers at phil.uu.nl
Thu Nov 29 11:07:35 EST 2007


Thomas Hartman wrote:

>
> A translation of
>
> http://www.ahinea.com/en/tech/perl-unicode-struggle.html
>
> from perl to haskell would be a very useful piece of documentation, I 
> think. 

Perl encodes both Unicode and binary data as the same (dynamic) data 
type. Haskell - at least in theory - has two different types for them, 
namely [Char] for characters and [Word8] or ByteString for sequences of 
bytes. I think the Haskell approach is better, because the programmer in 
most cases knows whether he wants to treat his data as characters or as 
bytes. Perl does it the Perlish "We guess at what the coder means" way, 
which leads to a lot of frustration when Perl guesses wrong.

The problems of the Haskeller trying to use Unicode, I think, will be 
different from those of the Perl hacker trying to use Unicode: the 
Haskeller will have to search for third-party modules to do what he 
wants, and finding those modules is the problem. The Perl hacker has all 
the Unicode support built in, but has to fight Perl occasionally to keep 
it from doing byte operations on his Unicode data.

I had a colleague here go all but insane last week trying to use 'split' 
on a Unicode string in Perl on Windows. split would break the string in 
the middle of a UTF-8 wide character, crashing UTF-8 processing later on.

Reinier


More information about the Haskell-Cafe mailing list