[Haskell'-private] pragmas and annotations (RE: the record system)

Tue Feb 28 10:10:46 EST 2006

"Simon Marlow" <simonmar at microsoft.com> wrote:

> How does ENCODING work for a UTF-16 file, for example?
> We don't know the file is UTF-16 until we read the ENCODING pragma,
> and we can't read the ENCODING pragma because it's in UTF-16. 

Use the same type of heuristic as XML uses (for instance).

  * If the first three bytes of the file are "{-#", then keep reading in
    ASCII/Latin-1/whatever until you discover an ENCODING decl (or not).

  * If the first six bytes of the file are one of the two possible
    UTF-16 representations of "{-#", then assume UTF-16 with that
    byte-encoding until we find the ENCODING decl.  (A missing decl in
    this case would be an error.)

  * If the first twelve bytes of the file are a UCS-4 representation of
    "{-#" then ... you get the picture.

  * For UTF-16 and UCS-4 variations, you must also permit the file to
    begin with an optional byte-order mark (two or four bytes).

  * Otherwise, there is no ENCODING pragma, so assume the implementation
    default of {ASCII, Latin-1, UTF-8, ...}.

I know it's pretty horrible, but it seems to work in practice for the
XML people.  In practice, the ENCODING decl is most needed for those
that have ASCII as a subset - one could argue that the heuristic tells
you the UTF-16 and UCS-4 variations without needing a pragma.  (But
then, how would you guarantee that the first three characters in the
file must be "{-#" ?)

Regards,
    Malcolm