implementation of UTF-8 conversion for text I/O: iconv vs hand-made

Tue Apr 25 17:46:11 EDT 2006

FWIW, there's a fairly complete pure-Haskell UTF-8 converter implementation in
HXML toolbox, which I "stole" and adapted for a version of HaXml;  e.g.:

http://www.ninebynine.org/Software/HaskellUtils/HaXml-1.12/src/Text/XML/HaXml/Unicode.hs

(Please ignore me if I miss your point.)

#g
--

Bulat Ziganshin wrote:
> Hello all
> 
> this letter describes why i think that using hand-made (de)coder for
> support of UTF-8 encoded files is better than using iconv. to let
> other readers know, iconv is wide-spread C library that performs
> buffer-to-buffer conversion between any text encodings (utf-8, utf-16,
> latin-1, ucs-2, ucs-4 and more). hand-made (en)coder implemented
> by me is just "converter", i.e. high-order function, between the
> getByte/putByte and getChar/putChar operations. so it can be used in
> any monad and with any purposes, not only for text I/O
> 
> one can find example of library that uses iconv in the "System\IO\Text.hs"
> module from http://haskell.org/~simonmar/new-io.tar.gz and example of
> hand-made encoder in module "Data\CharEncoding.hs"
> and its usage - in "System\Stream\Transformer\CharEncoding.hs"
> from http://freearc.narod.ru/Streams.tar.gz
> 
> i crossposted this letter to Marcin and Simon because you have
> discussed with me this question and to Einar because he once asked
> me about one specific feature in this area.
> 
> 
> why iconv is better:
> 
> 1) it's lightning fast, making virtually zero speed overhead
> 2) it's robust
> 3) it contains already implemented and debugged algorithms for all
> possible encodings we can encounter
> 4) it has highly developed error processing facilities
> (i mean signalling about errors in input data and/or masking them)
> 
> why hand-made conversion is better:
> 
> 1) i don't know whether iconv will be available on every Hugs and GHC
> installation?
> 
> 2) Einar once asked me about changing the encoding on the
> fly, that is needed for some HTML processing. it is also possible that
> some program will need to intersperse text I/O with
> buffer/array/byte/bits I/O. it's a sort of things that are absolutely
> impossible with iconv 
> 
> 3) my library support Streams that works in ANY monad (not only IO, ST
> and their derivatives). it's impossible to implement iconv conversion
> for such stream types
> 
> as you can see, while the last arguments says about very specific
> situations, these situations absolutely can't be handled by iconv, so
> we need to implement hand-made conversions anyway. on the other side,
> iconv strong points don't have principal meaning - the speed with
> hand-made routines will be enough, about several mb/s; all possible
> encodings can be implemented and debugged sooner or later; only
> processing of errors in input data is weak point of the current design
> itself
> 
> moreover, there are implementation issues that make me more enthusiastic
> about hand-made solution. it just already implemented and really works.
> implementation of the CharEncoding for streams is in module
> "System\Stream\Transformer\CharEncoding.hs", which is very trivial.
> implementation of different encoders in "Data\CharEncoding.hs"
> is slightly more complex, but these routines also used in
> "instance Binary String", i.e. to serialize strings. also, i think
> that "Data\CharEncoding.hs" module should be a part of standard
> Haskell library, so implementation of CharEncoding stream transformer
> is almost "free"
> 
> on the other side, implementation of text encoding in "new I/O"
> library is about 1000 lines long. while i don't need to copy them all,
> using iconv anyway will be much more complex than using hand-made routines.
> this include complexity of interaction with iconv itself and complexity of
> implementing various I/O operations over the buffer that contains
> 4-byte characters. i already implemented 3 buffering transformers and
> adding one more buffering scheme is the last thing i want to do. vice
> versa - now i'm searching for ways to omit repetitions of code by joining
> them all into one. it's very boring - to have 3 or 4 similar things
> and replicate every change to them all
> 
> at the same time, the library design is open and it's entirely
> possible to have two alternative char encoding transformers. everyone
> can develop additional transformers even without interaction with me -
> in this case, it should just implement vGetChar/bPutChar operations
> via the vGetBuf/vPutBuf ones. i just propose to leave the things as
> they are, and go to implementing of iconv-based transformer only when we
> will be actually bothered by it's restrictions
>   
> 

-- 
Graham Klyne
For email:
http://www.ninebynine.org/#Contact