[Python-Dev] Python-3.0, unicode, and os.environ
Stephen J. Turnbull
stephen at xemacs.org
Mon Dec 8 09:57:19 CET 2008
Glenn Linderman writes:
> On approximately 12/7/2008 8:13 PM, came the following characters from
> I have no problem with having strict validation available. But
> doesn't validation take significantly longer than decoding?
I think you're thinking of XML, where validation can take significant
resources over and above syntax checking. For Unicode, not unless
you're seriously CPU-bound. Unicode validation is a matter of a few
range checks and a couple of flags to handle things like lone
In the case of "excess length" in UTF-8, you can actually often do it
in *zero* time if you use a table to analyze the leading byte (eg,
0xC0 and 0xC1 are invalid UTF-8 leading bytes because they would
necessarily decode to U+0000 to U+007F, ie, the ASCII range), because
you have to make a check for 0xFE and 0xFF anyway, which can't be
UTF-8 leading bytes. (I'm not sure this generalizes to longer UTF-8
sequences, but it would reject the use of 0xC0 0xAF to sneak in a "/"
in zero time!)
> So I think it should be logically decoupled... do validation
> when/where it is needed for security reasons,
Security is an important application, but the real issue is that
naively decoded text is a bomb with a sensitive impact fuse. Pass it
around long enough, and it will blow up eventually.
The whole point of the fairly complex rules about Unicode formats and
the *requirement* that broken coding be a fatal error *in a
connforming Unicode process* is intended to ensure that Unicode
exceptions only ever occur on input (or memory corruption and the
like, which is actually a form of I/O, of course). That's where
efficiency comes from.
I think Python 3 should aspire to (eventually) be a conforming process
by default, with lax behavior an option.
> and allow internal [de]coding to be faster.
"Internal decoding" is (or should be) an oxymoron. Why would your
software be passing around text in any format other than internal? So
decoding will happen (a) on I/O, which is itself almost certainly
slower than making a few checks for Unicode hygiene, or (b) on receipt
of data from other software that whose sanitation you shouldn't trust
more than you trust the Internet.
Encoding isn't a problem, AFAICS.
> You didn't address the issue that if the decoding to a canonical
> form is done first, many of the insecurities just go away, so why
> throw errors?
Because as long as you're decoding anyway, it costs no more to do it
right, except in rare cases. Why do you think Python should aspire to
"quick and dirty" in a context where dirty is known to be unhealthy,
and there is no known need for speed? Why impose "doing it right" on
the application programmer when there's a well-defined spec for that
that we could implement in the standard library?
It's the errors themselves that people are objecting to. See Guido's
posts for concisely stated arguments for a "don't ask, don't tell"
policy toward Unicode breakage. I agree that Python should implement
that policy as an option, but I think that the user should have to
request it either with a runtime option or (in the case of user == app
programmer) by deliberately specifying a lax codec. The default
*Unicode* codecs should definitely aspire to full Unicode conformance
within their sphere of responsibility.
 A character outside the repertoire that the app can handle is not
a "Unicode exception", unless the reason the app can't handle it is
that the Unicode handler blew up.
More information about the Python-Dev