patch applied (cabal): Make UTF-8 decoding errors in .cabal files non-fatal

Thu Mar 27 13:55:33 EDT 2008

In message <20080327172026.GB10636 at soi.city.ac.uk> cabal-devel at haskell.org writes:
> On Thu, Mar 27, 2008 at 04:39:17PM +0000, Duncan Coutts wrote:
> > Can't we just reject them with the error message and ask people to fix the
> > latin-1 sequences and re-upload using proper UTF-8?
> 
> The problem is that there are packages there now with .cabal files
> assuming Latin-1.  Stopping more of them from getting in is fine, but
> we need to display the ones that are there correctly.

Parsing them is essential, displaying them correctly is a bonus.

> Hmm, after considering a few schemes it's probably simplest to introduce
> strict enforcement on upload and retroactively patch the existing Latin-1
> packages to UTF.  Naughty, but a one-off.

I'm quite happy for those to be fixed. The main point is that parsing the files
does not fail, though the content for those fields would (or at least should)
contain a Unicode replacement char.

> > You suggested previously that we should add a warning for the cases where an
> > isolated latin-1 char in someone's name turns out to be valid UTF-8 (but
> > encoding for an unexpected char). I think that's a good idea. Obviously that'd
> > want to be a non-fatal warning. Hmm, I now can't find the note where you made
> > that suggestion. Can you give more details on how that check would work exactly?
> 
> The common case is ASCII char, non-ASCII char, ASCII char.  That's not a
> valid UTF-8 sequence, but fromUTF is erroneously accepting it.  It needs
> to tighten up to keep these errors out.

Hmm. I'll replace the UTF decoder with the one from the utf8-string package
(which is also BSD licensed).

> Incidentally, a UTF decoder is also supposed to reject non-minimal
> encodings, e.g. a 3-byte encoding for a Char that can be encoded in
> 2 bytes.  That's to force canonical encodings for security.

I believe the utf8-string version does that correctly. It detects over-long
encodings specifically and makes them an invalid char. As I understand it that's
so that it generates a single replacement char for non-minimal encodings rather
than several.

Duncan