[C2hs] CHS lexer goes into infinite loop on chars > 255

Sun Dec 13 23:03:57 EST 2009

Duncan Coutts:
> Found another bug that surfaces when we compile c2hs with ghc-6.12.
> 
> By default text files are now read in the locale encoding rather than
> just ASCII. This means we can (and do) get characters over 255. The
> behaviour is that c2hs goes into an infinite loop and consumes all the
> memory on your machine (in particular this happens with some files in
> gtk2hs).
> 
> Unfortunately the 255 assumption is pretty strongly wired into the c2hs
> lexer. From Lexer.hs:
> 
> -- * Unicode posses a problem as the character domain becomes too big 
> -- for using arrays to represent transition tables and even sparse 
> -- structures will posse a significant overhead when character ranges
> -- are naively represented. So, it might be time for finite maps again.
> 
> The short term solution is to set the text mode to be ASCII. In the
> longer term we might want to replace the .chs lexer and parser, like we
> did already for the C parser.

Yes, that make sense.  At the time, unicode support in GHC was a still far away.

Manuel