[Haskell-i18n] SourceForge Project Active

Glynn Clements glynn.clements@virgin.net
Thu, 5 Sep 2002 16:37:14 +0100


Ketil Z. Malde wrote:

> > 3. The basic decoder interface shouldn't attempt to recover from
> > errors. Rather, it should return the list of complete characters, the
> > list of remaining octets, and the final state. Any error recovery
> > should be an optional add-on.
> 
> Could error handling be passed as a parameter to the encoder, perhaps?
> E.g. if I'm not really interested in debugging the code, just
> extracting what's possible, I could pass an error handler that tries
> to skip errors and keep going, without having to pollute my higher
> level code with it?

I would prefer not to see the base decoders cluttered with error
recovery functionality.

IMHO, a better alternative would be to provide functions to generate
fault-tolerant decoders from an existing decoder, e.g. by repeatedly
calling the underlying decoder until all octets have been consumed,
handling errors in either a predefined or user-defined manner.

Individual encodings could also provide custom fault-tolerant
decoders; in some cases, it may be desirable to have a choice of
several alternatives.

The nature of the problem differs substantially between different
types of encoding.

ISO-8859-* is trivial; one octet corresponds to one character. The
only possible error is an undefined codepoint (e.g. 0x80-0x9F); there
are no synchronisation issues.

UTF-8 is almost as simple; character boundaries are unambiguous, even
for invalid streams. However, there exists some variation between
existing decoders. Over-long sequences (e.g. using a two byte sequence
to represent a 7-bit character) are technically invalid, but many
decoders allow this; some applications even encourage it (e.g. using
0xC0,0x80 to represent an "embedded" NUL).

Other encodings may be more problematic, and error recovery may
involve some guesswork. This can be helped by knowledge of the
relative likelihood of certain classes of error and/or the nature of
the text (e.g. language).

My main concern is that less knowledgable users don't end up being
"steered" into using dubious semantics (e.g. fault tolerance,
especially when involving somewhat arbitrary heuristics) by way of
using the "default" interface.

Of particular concern are the potential security implications. The
most obvious[1] example is the use of invalid or ambiguous encoded
forms to circumvent access controls or input validation.

[1] Obvious to regular readers of BugTraq, at least; this specific
issue has been identified as a security problem in a wide range of
products.

Basically, my view is that if the user is required to explicitly
choose fault tolerance, there's more chance that they will consider
some of the issues involved than if some form of fault tolerance is
"bundled".

-- 
Glynn Clements <glynn.clements@virgin.net>