[Haskell-cafe] Data.Text UTF-8 question

Gregory Collins greg at gregorycollins.net
Fri Aug 31 09:27:22 CEST 2012


On Fri, Aug 31, 2012 at 7:59 AM, jeff p <mutjida at gmail.com> wrote:

> Hello,
>
> I have a sample file (attached) which I cannot read into Text:
>
>     Prelude Control.Applicative> Data.Text.IO.readFile "foo"
>     *** Exception: utf8.txt: hGetContents: invalid argument (invalid
> byte sequence)
>
>     Prelude Control.Applicative> Data.Text.Encoding.decodeUtf8 <$>
> Data.ByteString.Char8.readFile "foo"
>     "*** Exception: Cannot decode byte '\x6e':
> Data.Text.Encoding.decodeUtf8: Invalid UTF-8 stream
>
> So it seems that foo doesn't contain valid UTF-8. However,
> System.IO.UTF8 has no problem reading the data:
>
>     Prelude Control.Applicative> System.IO.UTF8.readFile "foo"
>
> "3591,,,dihigma99h,1905,5,25,CUBA,,Matanzas,1971,5,20,CUBA,,Cienfuegos,Martin,Dihigo,,Mart\65533n
> Magdaleno Dihigo
>
> (Llanos),,190,74,R,R,,,,dihigma99,dihigma99,dihim001,dihigma99,dihigma99\r\n"
>
> Shouldn't these all have the same behavior?
>

\65533 is the unicode replacement character U+FFFD. This means that the
source text is not valid UTF-8; the parser in System.IO.UTF8 is silently
replacing the bad characters while the others are throwing an exception. If
you want the same behaviour with the Text parser, use
Data.Text.Encoding.decodeUtf8With which allows you to replicate this. It's
likely, however, that your input text is in some other encoding like
ISO-8859-1. Use the text-icu package (
http://hackage.haskell.org/package/text-icu) to decode these.

G
-- 
Gregory Collins <greg at gregorycollins.net>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20120831/a135e0d1/attachment.htm>


More information about the Haskell-Cafe mailing list