[GHC] #15553: GHC.IO.Encoding not flushing partially converted input
GHC
ghc-devs at haskell.org
Tue Aug 21 16:11:13 UTC 2018
#15553: GHC.IO.Encoding not flushing partially converted input
-------------------------------------+-------------------------------------
Reporter: msakai | Owner: (none)
Type: bug | Status: new
Priority: normal | Milestone: 8.6.1
Component: Core | Version: 8.4.3
Libraries |
Keywords: | Operating System: Linux
Architecture: | Type of failure: Incorrect result
Unknown/Multiple | at runtime
Test Case: | Blocked By:
Blocking: | Related Tickets:
Differential Rev(s): | Wiki Page:
-------------------------------------+-------------------------------------
Conversion by `GHC.IO.Encoding` produces incomplete output for some
encodings because it does not flush ''partially converted input'' at the
end of the string.
[https://manpages.debian.org/stretch/manpages-dev/iconv.3 iconv(3)]
provides API for the flushing.
> In each series of calls to iconv(), the last should be one with inbuf or
*inbuf equal to NULL, in order to flush out any partially converted input.
But `GHC.IO.Encoding` does not perform the flushing properly and it can
cause incomplete conversion result.
I found two cases that it actually produces incomplete output, but there
might be more cases.
= Case 1: EUC-JISX0213
For example, the following code is expected to output two bytes 0xa4 0xb1,
but it outputs none.
{{{#!hs
enc <- mkTextEncoding "EUC-JISX0213"
withFile "test.txt" WriteMode $ \h -> hSetEncoding h enc >> hPutStr h
"\x3051"
}}}
The problem happens because of the following mapping between Unicode and
EUC-JISX0213.
||Unicode||EUC-JISX0213||
||U+3051 U+309A||0xa4 0xfa||
||U+3051||0xa4 0xb1||
After seeing the codepoint U+3051, the converter is unable to determine
which of the two byte sequence to output until it sees the next character
or ''the end of the string''. But `GHC.IO.Encoding` does not call the
above mentioned ''flushing'' API, therefore the converter is unable to
recognize the end of the string.
= Case 2: ISO-2022-JP
Similarly, following code is expected to output byte sequence `0x1b 0x24
0x42` `0x24 0x22` `0x1b 0x28 0x42` but the last three bytes `0x1b 0x28
0x42` is not produced.
{{{#!hs
enc <- mkTextEncoding "ISO-2022-JP"
withFile "test.txt" WriteMode $ \h -> hSetEncoding h enc >> hPutStr h
"\x3042"
}}}
ISO-2022-JP is a stateful encoding and
[https://www.ietf.org/rfc/rfc1468.txt RFC 1468] requires the state is
reset to initial state at the end of the string. The missing three bytes
`0x1b 0x28 0x42` are the escape sequence for that purpose. But again
`GHC.IO.Encoding` does not call the above mentioned`flushing` API,
therefore the converter cannot recognize the end of the string and cannot
reset the state.
--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/15553>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler
More information about the ghc-tickets
mailing list