[Haskell-i18n] unicode notation \uhhhh implementation

Sven Moritz Hallberg pesco@gmx.de
16 Aug 2002 12:11:50 +0200

On Fri, 2002-08-16 at 11:01, Simon Marlow wrote:
> > > I wasn't aware of that paragraph in the report until recently, and
> > > as far as I know none of the current Haskell implementations
> > > implement the '\uhhhh' escape sequences.
> > 
> > HBC implemented Unicode years ago.
> > 
> >  http://www.math.chalmers.se/~augustss/hbc/lexemes.html
> No, HBC doesn't implement the paragraph of the report that we're talking about.  HBC allows the '\uhhhh' escape sequence in characters and string literals, but not in identifiers and other parts of the source.
> Also, it's not clear to me why you need '\uhhhh' escape sequence in character and string literals at all, since it appears to mean the same thing as '\xhhhh' (the report isn't clear that '\xhhhh' means a "unicode code point", but that seems to be the only reasonable interpretation).

You're most probably right, this looks like a misinterpretation on HBC's
side to me, too.

> > One reason to use this approach would be if there already existed a
> > preprocessor to do the job - does anyone know of one? 
> > Can't be more than a few lines of Perl.  It's quite short in Haskell too:
> > 
> >   convert :: String -> String
> >   convert ('\\':'u':c1:c2:c3:c4:cs) 
> >     | isHex c1 && isHex c2 && isHex c3 && isHex c4 
> >     = chr (readHex [c1,c2,c3,c4]) : convert cs
> >     | otherwise                              -- not clear if this is 
> >     = error "Malformed unicode sequence"     -- allowed by the spec
> >   convert (c:cs) = c : convert cs
> >   convert [] = []
> I meant a preprocessor to take source code in some random encoding and convert it into ASCII with '\uhhhh' escape sequences.  If there was such a thing, then we could all use it and save re-implementing N different encodings in each compiler.

There is GNU recode which knows virtually all kinds of codecs. It's
homepage says something about extensibility, so it might be fairly easy
to add our own \uhhhh (or whatever we settle on) escaped ASCII to the
list of codecs. It could then convert back and forth between this and
any other encoding it is aware of. My version reports 281 supported

Sven Moritz