[Python-ideas] Py3k invalid unicode idea

Stephen J. Turnbull stephen at xemacs.org
Fri Oct 10 04:04:56 CEST 2008


Terry Reedy writes:

 > Would it make any sense to have a Filename subclass

Sure ... but as glyph has been explaining, that should really be
generalized to a representation of filesystem paths, and that is an
unsolved problem at the present time.

 > or a BadFilename subclass or more generally a PUAcode subclass for
 > any unicode generated by the core that uses the PUA?

IMO, this doesn't work, because either they act like strings when you
access them naively, and you end up with corrupt Unicode loose in the
wider system, or they throw exceptions if they aren't first placated
with appropriate rituals -- but those exceptions and rituals are what
we wanted to avoid handling in the first place!

As I see it, this is not a technical problem!  It's a social problem.
It's not that we have no good ways to handle Unicode exceptions: we
have several.  It's not that we have no perfect and universally
acceptable way to handle them: as usual, that's way too much to ask.
The problem that we face is that there are several good ways to handle
the decoding exceptions, and different users/applications will
*strongly* prefer different ones.

In particular, if we provide one and make it default, almost all
programmers will do the easy thing, so that most code will not be
prepared for applications that do want a different handler.  Code that
does expect to receive uncorrupt Unicode will have to do extra
checking, etc.

I think that the best thing to do would be to improve the exception
handling in codecs and library functions like os.listdir() -- IMO the
problem that one exception can cost you an entire listing is a bug in
os.listdir().




More information about the Python-ideas mailing list