[Haskell-cafe] Unicode pretty-printing

Peter Gromov gromopetr at gmail.com
Sun Aug 29 06:24:22 EDT 2010


Thanks for getting back to me. I was imprecise, by UTF8 characters I
mean Unicode. My source files are UTF8-encoded, and Haskell reads them
fine, it only has problems outputting them in a readable way. At this
point I'm not talking of any I/O besides plain console output.

Not using Show is not that of a choice, since I'm using HUnit which
uses Show and prints the test results via the standard output
functions. I've tried to wrap my strings and redefine Show so that it
doesn't escape anything, but the standard output functions don't
accept that, and HUnit doesn't know anything about System.IO.UTF8:

----
import System.IO.UTF8
import qualified System.IO
import Test.HUnit

newtype UString = UString String

instance Show UString where
  show (UString s) = s
instance Eq UString where
  (==) (UString s1) (UString s2) = s1 == s2

test1 = TestCase (assertEqual "fail" (UString "абв") (UString "где"))

main =
	System.IO.hSetBinaryMode System.IO.stdout True >>
	System.IO.UTF8.putStrLn "это тест"
---------
Prelude> :load utest.hs
[1 of 1] Compiling Main             ( utest.hs, interpreted )
Ok, modules loaded: Main.
*Main> main
это тест
*Main> runTestTT test1
### Failure:
fail
expected: *** Exception: <stderr>: hPutChar: invalid argument (Illegal
byte sequence)
---------

I've tried replacing UString X in the test with Data.Text.pack X and
even desperately with Data.Text.Encoding.encodeUtf8 (Data.Text.pack
X), but no dice. Though this time instead of crashes I get the good
old escapes.

On 29 August 2010 00:09, Yitzchak Gale <gale at sefer.org> wrote:
> Peter Gromov wrote:
>> Unfortunately, Haskell escapes UTF8 characters.
>
> What do you mean by "UTF8 characters"?
>
> Each element of the Char type represents a single Unicode
> character, not encoded in UTF-8 or any other encoding.
>
> When you read a text file using the traditional IO functions,
> recent versions of GHC will use the encoding of the
> "current locale" (whatever that means on your system)
> to decode the input into Unicode, unless you specify
> otherwise. The same is true for writing to the console or
> to a file.
>
> As Don pointed out, you may be interested in
> using the newer Data.Text instead, especially when
> encodings matter to you. It will usually be faster than
> traditional IO, and it is designed to be the new standard
> for representing text in Haskell.
>
> A third option would be to read the data as raw binary
> bytes, without any decoding, using Data.ByteString.
> Then it is totally up to you to do any decoding or
> encoding.
>
> In any case, the standard Show instances will not be
> able to do a very good job of displaying non-ASCII
> characters; Show cannot make very many assumptions
> about your data or your environment. As Don suggested,
> you may want to define your own type class similar to
> Show that does what you want.
>
> Regards,
> Yitz
>


More information about the Haskell-Cafe mailing list