patch applied (cabal): Make UTF-8 decoding errors in .cabal files non-fatal

Duncan Coutts duncan.coutts at worc.ox.ac.uk
Thu Mar 27 12:39:17 EDT 2008


In message <20080327160333.GA10636 at soi.city.ac.uk> cabal-devel at haskell.org writes:
> On Thu, Mar 27, 2008 at 08:45:29AM -0700, Duncan Coutts wrote:
> > Wed Mar 26 20:17:40 PDT 2008  Duncan Coutts <duncan at haskell.org>
> >   * Make UTF-8 decoding errors in .cabal files non-fatal
> >   Previously we checked for invalid UTF-8 in the first phase of the
> >   parser, which splitting the file up into nested sections and fields.
> >   This meant the whole parser falls over if there is invalid UTF-8
> >   anywhere in the file. Sadly there are already packages on hackage
> >   with invalid UTF-8 so we would fail when parsing the hackage index.
> >   The solution is to move the check into the parsing of the individual
> >   fields and making it a warning not an error. We most typically get
> >   invalid UTF-8 in free text fields like author name, copyright,
> >   description etc so this should work out ok usually.
> >   We now get pretty decent error messages, like:
> >     Warning: hsx.cabal:5: Invalid UTF-8 text in the 'author' field.
> >   The warning type is now structured so that hackage will be able to
> >   distinguish general non-fatal warnings from UTF-8 decoding problems
> >   which really should be fatal errors for package uploads. 
> 
> These invalid UTF-8 strings are usually valid Latin-1 in people's names,
> which the web interface needs to show.

Can't we just reject them with the error message and ask people to fix the
latin-1 sequences and re-upload using proper UTF-8?

Is the web interface sending UTF-8 now? I don't know if we've done an end-to-end
test yet. If we have then we should close ticket #145:
http://hackage.haskell.org/trac/hackage/ticket/145

> So would it be possible give the warning, but either to treat bytes
> that comprise an encoding error as Latin-1 Chars, or to reparse a string
> (or file) with UTF errors as a Latin-1 string?  In almost all cases, the
> problematic sequence is a single non-ASCII byte surrounded by ASCII bytes.

I really don't think we should continue to allow mixed/undefined encodings. We
should be strict about enforcing UTF-8, but of course we should provide helpful
error messages to make it easy for people to make the corrections.

So I think hackage should reject them with a suitable error message. I can send
a patch. On that topic in fact, I think all parser warnings should be fatal
errors as far as hackage is concerned. I'll send a separate patch for that. The
more permissive hackage is, the more legacy of weird corner cases we accumulate.

You suggested previously that we should add a warning for the cases where an
isolated latin-1 char in someone's name turns out to be valid UTF-8 (but
encoding for an unexpected char). I think that's a good idea. Obviously that'd
want to be a non-fatal warning. Hmm, I now can't find the note where you made
that suggestion. Can you give more details on how that check would work exactly?

Duncan



More information about the cabal-devel mailing list