[Python-Dev] Python-3.0, unicode, and os.environ

Mon Dec 8 10:21:32 CET 2008

Glenn Linderman writes:

 > "significantly" seems to be the only word at question; it seems that 
 > there are a fair number of validation checks that could be performed; 
 > the numeric part of UTF-8 decoding is just a sequence of shifts, masks, 
 > and ORs, so can be coded pretty tightly in C or assembly language.
 > 
 > Anything extra would be slower; how much slower is hard to predict prior 
 > to the implementation.

Not much, see my previous response.

 > This also seems to be supported by Stephen's comment "That's a lot
 > to ask, as it turns out."

Not what I meant.  Inefficiency is not an objection to checking for
validity at the level a codec can handle.  The objection is that "we
don't want *any* exceptions thrown that we didn't explicitly ask for",
and adding validation certainly will violate that.

 > So I don't understand how this is responsive to the "decoding removes 
 > many insecurities" issue?

Because you have to recheck every time the data crosses from Python
into your code.  To the extent that Python codecs promise validation
and keep that promise, internal code *never* has to make those checks.
That is a significant savings in programmer effort, because auditing a
large body of code for *any* I/O from Python is going to be costly.

 > So when you examine a library for potential use, you have documentation 
 > or code to help you set your expectations about what it does, and 
 > whether or not it may have vulnerabilities, and whether or not those 
 > vulnerabilities are likely or unlikely, whether you can reduce the 
 > likelihood or prevent the vulnerabilities by wrapping the API, etc.  And 
 > so you choose to use the library, or not.

Python is precisely such a component that people will choose to use,
or not, based on whether they can expect that when Python hands them a
Unicode object freshly input from the outside world, it won't contain
lone surrogates, or invalid UTF-8 characters that got through a
3rd-party spam filter, or whatever.

 > This whole discussion about libraries seems somewhat irrelevant to the 
 > question at hand,

No, it's the *only* point that matters.  IMO, speed is not relevant
here.  The question is whether throwing a Unicode exception on invalid
encoding by default generally does more good than harm.  Guido seems
to think "not!", which gives me pause.<wink>  I still disagree, though.