[Haskell-cafe] What's the status with unicode characters on haddock ?

Thomas ten Cate ttencate at gmail.com
Fri Jul 10 09:06:26 EDT 2009


I ran a little experiment of my own, using a GHC HEAD build of a week
or so ago. Here's a hex dump of my test source, so that we can see
that it's really UTF-8.

$ od -xc Test.hs
0000000 6f6d 7564 656c 4d20 6961 206e 6877 7265
          m   o   d   u   l   e       M   a   i   n       w   h   e   r
0000020 0a65 2d0a 202d 207c 7250 6e69 7374 7420
          e  \n  \n   -   -       |       P   r   i   n   t   s       t
0000040 6568 7420 7865 2074 4822 6c65 6f6c 7720
          h   e       t   e   x   t       "   H   e   l   l   o       w
0000060 726f 646c 2e22 2d0a 202d 6548 6572 7327
          o   r   l   d   "   .  \n   -   -       H   e   r   e   '   s
0000100 6120 6520 7275 206f 6973 6e67 202c 82e2
              a       e   u   r   o       s   i   g   n   ,     342 202
0000120 20ac 5528 322b 4130 2943 202c 6e61 2064
        254       (   U   +   2   0   A   C   )   ,       a   n   d
0000140 6e61 6520 656c 656d 746e 6f2d 2066 6973
          a   n       e   l   e   m   e   n   t   -   o   f       s   i
0000160 6e67 203a 88e2 208a 5528 322b 3032 2941
          g   n   :     342 210 212       (   U   +   2   2   0   A   )
0000200 0a2e 616d 6e69 3a20 203a 4f49 2820 0a29
          .  \n   m   a   i   n       :   :       I   O       (   )  \n
0000220 616d 6e69 3d20 7020 7475 7453 4c72 206e
          m   a   i   n       =       p   u   t   S   t   r   L   n
0000240 4822 6c65 6f6c 7720 726f 646c 0a22
          "   H   e   l   l   o       w   o   r   l   d   "  \n
0000256

Then I invoked
$ haddock -h Test.hs

The generated Main.html contains this tag:
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">
Firefox picks this up, because in the View menu, Character Encoding is
set to UTF-8.

Yet, I see the little blocks instead of the characters from my source file! Why?
$ od -xc Main.html
...
0003220 6120 6520 7275 206f 6973 6e67 202c 2004
              a       e   u   r   o       s   i   g   n   ,     004
...
0003260 6520 656c 656d 746e 6f2d 2066 6973 6e67
              e   l   e   m   e   n   t   -   o   f       s   i   g   n
0003300 203a 2004 5528 322b 3032 2941 0a2e 2f3c
          :     004       (   U   +   2   2   0   A   )   .  \n   <   /

It seems that Haddock replaced both characters with a 0x04 (ASCII
end-of-transmission) byte! Apparently you've hit a bug in Haddock.
Since Haskell source files are UTF-8 by definition, and the HTML file
it produces is also UTF-8, this is clearly incorrect behaviour.

Thomas


More information about the Haskell-Cafe mailing list