[GHC] #15553: GHC.IO.Encoding not flushing partially converted input

Tue Aug 21 16:11:13 UTC 2018

#15553: GHC.IO.Encoding not flushing partially converted input
-------------------------------------+-------------------------------------
           Reporter:  msakai         |             Owner:  (none)
               Type:  bug            |            Status:  new
           Priority:  normal         |         Milestone:  8.6.1
          Component:  Core           |           Version:  8.4.3
  Libraries                          |
           Keywords:                 |  Operating System:  Linux
       Architecture:                 |   Type of failure:  Incorrect result
  Unknown/Multiple                   |  at runtime
          Test Case:                 |        Blocked By:
           Blocking:                 |   Related Tickets:
Differential Rev(s):                 |         Wiki Page:
-------------------------------------+-------------------------------------
 Conversion by `GHC.IO.Encoding` produces incomplete output for some
 encodings because it does not flush ''partially converted input'' at the
 end of the string.

 [https://manpages.debian.org/stretch/manpages-dev/iconv.3 iconv(3)]
 provides API for the flushing.

 > In each series of calls to iconv(), the last should be one with inbuf or
 *inbuf equal to NULL, in order to flush out any partially converted input.

 But `GHC.IO.Encoding` does not perform the flushing properly and it can
 cause incomplete conversion result.
 I found two cases that it actually produces incomplete output, but there
 might be more cases.

 = Case 1: EUC-JISX0213

 For example, the following code is expected to output two bytes 0xa4 0xb1,
 but it outputs none.

 {{{#!hs
 enc <- mkTextEncoding "EUC-JISX0213"
 withFile "test.txt" WriteMode $ \h -> hSetEncoding h enc >> hPutStr h
 "\x3051"
 }}}

 The problem happens because of the following mapping between Unicode and
 EUC-JISX0213.

 ||Unicode||EUC-JISX0213||
 ||U+3051 U+309A||0xa4 0xfa||
 ||U+3051||0xa4 0xb1||

 After seeing the codepoint U+3051, the converter is unable to determine
 which of the two byte sequence to output until it sees the next character
 or ''the end of the string''. But `GHC.IO.Encoding` does not call the
 above mentioned ''flushing'' API, therefore the converter is unable to
 recognize the end of the string.

 = Case 2: ISO-2022-JP

 Similarly, following code is expected to output byte sequence `0x1b 0x24
 0x42` `0x24 0x22` `0x1b 0x28 0x42` but the last three bytes `0x1b 0x28
 0x42` is not produced.

 {{{#!hs
 enc <- mkTextEncoding "ISO-2022-JP"
 withFile "test.txt" WriteMode $ \h -> hSetEncoding h enc >> hPutStr h
 "\x3042"
 }}}

 ISO-2022-JP is a stateful encoding and
 [https://www.ietf.org/rfc/rfc1468.txt RFC 1468] requires the state is
 reset to initial state at the end of the string. The missing three bytes
 `0x1b 0x28 0x42` are the escape sequence for that purpose. But again
 `GHC.IO.Encoding` does not call the above mentioned`flushing` API,
 therefore the converter cannot recognize the end of the string and cannot
 reset the state.

-- 
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/15553>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler