Re: [Python-Dev] Python-3.0, unicode, and os.environ

On Sun, Dec 7, 2008 at 2:07 AM, Hagen Fürstenau hfuerstenau@gmx.net wrote:
If the Unicode APIs only have correct unicode, sure. If not you'll get errors translating to UTF-8 (and the byte APIs are supposed to pass bad names through unaltered.) Kinda ironic, no?
As far as I can see all Python Unicode strings can be encoded to UTF-8, even things like lone surrogates because Python doesn't care about them. So both the Unicode API and the binary API would be fail-safe on Windows.
Python is broken and needs to be fixed.
http://bugs.python.org/issue3672 http://bugs.python.org/issue3297

As far as I can see all Python Unicode strings can be encoded to UTF-8, even things like lone surrogates because Python doesn't care about them. So both the Unicode API and the binary API would be fail-safe on Windows.
Python is broken and needs to be fixed.
http://bugs.python.org/issue3672 http://bugs.python.org/issue3297
But the question of whether Python should care about lone surrogates or not is at best tangential to the issue at hand. If you have lone surrogates in the Unicode API (and didn't raise an exception on the way getting there), then the sensible thing is to encode them into lone UTF-8 surrogates. Even if you wanted to prevent lone surrogates, encoding to UTF-8 for the binary API would not be the place to enforce it.
- Hagen

On Sun, Dec 7, 2008 at 2:35 AM, Hagen Fürstenau hfuerstenau@gmx.net wrote:
As far as I can see all Python Unicode strings can be encoded to UTF-8, even things like lone surrogates because Python doesn't care about them. So both the Unicode API and the binary API would be fail-safe on Windows.
Python is broken and needs to be fixed.
http://bugs.python.org/issue3672 http://bugs.python.org/issue3297
But the question of whether Python should care about lone surrogates or not is at best tangential to the issue at hand. If you have lone surrogates in the Unicode API (and didn't raise an exception on the way getting there), then the sensible thing is to encode them into lone UTF-8 surrogates. Even if you wanted to prevent lone surrogates, encoding to UTF-8 for the binary API would not be the place to enforce it.
No. Unicode *requires* them to be treated as errors. If you want to pass them through then you're creating a custom encoding... which you might argue for in this case, but it needs to be clearly separate from the real UTF-8.

On Sun, Dec 7, 2008 at 11:35, Adam Olsen rhamph@gmail.com wrote:
http://bugs.python.org/issue3672 http://bugs.python.org/issue3297
No. Unicode *requires* them to be treated as errors. If you want to pass them through then you're creating a custom encoding... which you might argue for in this case, but it needs to be clearly separate from the real UTF-8.
I suspect it is a common and convenient but (according to what you say) misconceived expectation that using UTF-8 to encode any Unicode string will not raise an exception. This behavior is not something which should be discarded lightly.
I see little reason that this couldn't be a new codec or error handler that allowed people to choose between correct pure UTF-8 behavior or the technically incorrect but very practical behavior it currently has.
[My apologies, Adam, for sending this only to you the first time]

On Sun, Dec 7, 2008 at 11:18 AM, Michael Urman murman@gmail.com wrote:
On Sun, Dec 7, 2008 at 11:35, Adam Olsen rhamph@gmail.com wrote:
http://bugs.python.org/issue3672 http://bugs.python.org/issue3297
No. Unicode *requires* them to be treated as errors. If you want to pass them through then you're creating a custom encoding... which you might argue for in this case, but it needs to be clearly separate from the real UTF-8.
I suspect it is a common and convenient but (according to what you say) misconceived expectation that using UTF-8 to encode any Unicode string will not raise an exception. This behavior is not something which should be discarded lightly.
It is *not* a valid Unicode string in the first place. Therein lies the problem.
I see little reason that this couldn't be a new codec or error handler that allowed people to choose between correct pure UTF-8 behavior or the technically incorrect but very practical behavior it currently has.
Note that many of the restrictions were added for security reasons. You might receive a UTF-8 encoded file name from a malicious user, check if it contains something dangerous (like "../../../../../etc/password"), then decode it. If your decoder isn't compliant (ie doesn't check for overly long sequences) then a b'\xC0\xAF' gets translated into u'/', bypassing your previous check.
However, in this context we only need to allow lone surrogates. CESU-8 comes to mind. (It is a perverse world we live in.)

On approximately 12/7/2008 10:56 AM, came the following characters from the keyboard of Adam Olsen:
You might receive a UTF-8 encoded file name from a malicious user, check if it contains something dangerous (like "../../../../../etc/password"), then decode it. If your decoder isn't compliant (ie doesn't check for overly long sequences) then a b'\xC0\xAF' gets translated into u'/', bypassing your previous check.
You might indeed.
But if you are interested in checking for security issues, shouldn't you _first_ decode into some canonical form, specifying what sorts of Unicode strictness (such as overlong sequences) to check for during the decode process, and once the string is in canonical form, _then_ do checks for various attacks, such as the ../ sequence you mention?
And with that order of operation, even if you don't reject overlong sequences, you have canonized them, and can recognize the resulting characters as good or bad.

Glenn Linderman writes:
But if you are interested in checking for security issues, shouldn't you _first_ decode into some canonical form,
Yes. That's all that is being asked for: that Python do strict decoding to a canonical form by default. That's a lot to ask, as it turns out, but that is what we (the minority of strict Unicode adherents, that is) want.
If you want the convenience and risk, I believe you should ask for it by name (I suggest a name like "own_me" for the relaxed decoding flag<wink>). Failing that, it would be nice to have a global flag to change the default.

On approximately 12/7/2008 8:13 PM, came the following characters from the keyboard of Stephen J. Turnbull:
Glenn Linderman writes:
But if you are interested in checking for security issues, shouldn't you _first_ decode into some canonical form,
Yes. That's all that is being asked for: that Python do strict decoding to a canonical form by default. That's a lot to ask, as it turns out, but that is what we (the minority of strict Unicode adherents, that is) want.
I have no problem with having strict validation available. But doesn't validation take significantly longer than decoding? So I think it should be logically decoupled... do validation when/where it is needed for security reasons, and allow internal [de]coding to be faster.
I'm mostly indifferent about which should be the default... maybe there shouldn't be a default! Use the "vUTF-8" decoder for strict validation, and the "fUTF-8" decoder for the faster, non-validating version. Or something like that. With appropriate documentation. Of course, "UTF-8" already exists... as "fUTF-8", so for compatibility, I guess it shouldn't change... but it could be deprecated.
You didn't address the issue that if the decoding to a canonical form is done first, many of the insecurities just go away, so why throw errors?

On Sun, Dec 7, 2008 at 9:45 PM, Glenn Linderman v+python@g.nevcal.com wrote:
On approximately 12/7/2008 8:13 PM, came the following characters from the keyboard of Stephen J. Turnbull:
Glenn Linderman writes:
But if you are interested in checking for security issues, shouldn't
you > _first_ decode into some canonical form,
Yes. That's all that is being asked for: that Python do strict decoding to a canonical form by default. That's a lot to ask, as it turns out, but that is what we (the minority of strict Unicode adherents, that is) want.
I have no problem with having strict validation available. But doesn't validation take significantly longer than decoding? So I think it should be logically decoupled... do validation when/where it is needed for security reasons, and allow internal [de]coding to be faster.
I'd like to see benchmarks of such a claim.
I'm mostly indifferent about which should be the default... maybe there shouldn't be a default! Use the "vUTF-8" decoder for strict validation, and the "fUTF-8" decoder for the faster, non-validating version. Or something like that. With appropriate documentation. Of course, "UTF-8" already exists... as "fUTF-8", so for compatibility, I guess it shouldn't change... but it could be deprecated.
You didn't address the issue that if the decoding to a canonical form is done first, many of the insecurities just go away, so why throw errors?
Unicode is intended to allow interaction between various bits of software. It may be that a library checked it in UTF-8, then passed it to python. It would be nice if the library validated too, but a major advantage of UTF-8 is older libraries (or protocols!) intended for ASCII need only be 8-bit clean to be repurposed for UTF-8. Their security checks continue to work, so long as nobody down stream introduces problems with a non-validating decoder.

On approximately 12/7/2008 9:11 PM, came the following characters from the keyboard of Adam Olsen:
On Sun, Dec 7, 2008 at 9:45 PM, Glenn Linderman v+python@g.nevcal.com wrote:
On approximately 12/7/2008 8:13 PM, came the following characters from the keyboard of Stephen J. Turnbull:
Glenn Linderman writes:
But if you are interested in checking for security issues, shouldn't
you > _first_ decode into some canonical form,
Yes. That's all that is being asked for: that Python do strict decoding to a canonical form by default. That's a lot to ask, as it turns out, but that is what we (the minority of strict Unicode adherents, that is) want.
I have no problem with having strict validation available. But doesn't validation take significantly longer than decoding? So I think it should be logically decoupled... do validation when/where it is needed for security reasons, and allow internal [de]coding to be faster.
I'd like to see benchmarks of such a claim.
"significantly" seems to be the only word at question; it seems that there are a fair number of validation checks that could be performed; the numeric part of UTF-8 decoding is just a sequence of shifts, masks, and ORs, so can be coded pretty tightly in C or assembly language.
Anything extra would be slower; how much slower is hard to predict prior to the implementation. My "significantly" was just the expectation that the larger code with more conditional branches that is required for validation is less likely to stay in cache, and take longer to load into cache, and take longer to execute. This also seems to be supported by Stephen's comment "That's a lot to ask, as it turns out."
Once upon a time I did write an unvalidated UTF-8 encoder/decoder in C, I wonder if I could find that code? Can you supply a validated decoder? Then we could run some benchmarks, eh?
I'm mostly indifferent about which should be the default... maybe there shouldn't be a default! Use the "vUTF-8" decoder for strict validation, and the "fUTF-8" decoder for the faster, non-validating version. Or something like that. With appropriate documentation. Of course, "UTF-8" already exists... as "fUTF-8", so for compatibility, I guess it shouldn't change... but it could be deprecated.
You didn't address the issue that if the decoding to a canonical form is done first, many of the insecurities just go away, so why throw errors?
Unicode is intended to allow interaction between various bits of software. It may be that a library checked it in UTF-8, then passed it to python. It would be nice if the library validated too, but a major advantage of UTF-8 is older libraries (or protocols!) intended for ASCII need only be 8-bit clean to be repurposed for UTF-8. Their security checks continue to work, so long as nobody down stream introduces problems with a non-validating decoder.
So I don't understand how this is responsive to the "decoding removes many insecurities" issue?
Yes, you might use libraries. Either they have insecurities, or not. Either they validate, or not. Either they decode, or not. They may be immune to certain attacks, because of their structure and code, or not.
So when you examine a library for potential use, you have documentation or code to help you set your expectations about what it does, and whether or not it may have vulnerabilities, and whether or not those vulnerabilities are likely or unlikely, whether you can reduce the likelihood or prevent the vulnerabilities by wrapping the API, etc. And so you choose to use the library, or not.
This whole discussion about libraries seems somewhat irrelevant to the question at hand, although it is certainly true that understanding how a library handles Unicode is an important issue for the potential user of a library.
So how does a non-validating decoder introduce problems? I can see that it might not solve all problems, but how does it introduce problems? Wouldn't the problems be introduced by something else, and the use of a non-validating decoder may not catch the problem... but not be the cause of the problem?
And then, if you would like to address the original issue, that would be fine too.

On Sun, Dec 7, 2008 at 11:04 PM, Glenn Linderman v+python@g.nevcal.com wrote:
On approximately 12/7/2008 9:11 PM, came the following characters from the keyboard of Adam Olsen:
On Sun, Dec 7, 2008 at 9:45 PM, Glenn Linderman v+python@g.nevcal.com wrote:
Once upon a time I did write an unvalidated UTF-8 encoder/decoder in C, I wonder if I could find that code? Can you supply a validated decoder? Then we could run some benchmarks, eh?
There is no point for me, as the behaviour of a real UTF-8 codec is clear. It is you who needs to justify a second non-standard UTF-8-ish codec. See below.
You didn't address the issue that if the decoding to a canonical form is done first, many of the insecurities just go away, so why throw errors?
Unicode is intended to allow interaction between various bits of software. It may be that a library checked it in UTF-8, then passed it to python. It would be nice if the library validated too, but a major advantage of UTF-8 is older libraries (or protocols!) intended for ASCII need only be 8-bit clean to be repurposed for UTF-8. Their security checks continue to work, so long as nobody down stream introduces problems with a non-validating decoder.
So I don't understand how this is responsive to the "decoding removes many insecurities" issue?
Yes, you might use libraries. Either they have insecurities, or not. Either they validate, or not. Either they decode, or not. They may be immune to certain attacks, because of their structure and code, or not.
So when you examine a library for potential use, you have documentation or code to help you set your expectations about what it does, and whether or not it may have vulnerabilities, and whether or not those vulnerabilities are likely or unlikely, whether you can reduce the likelihood or prevent the vulnerabilities by wrapping the API, etc. And so you choose to use the library, or not.
This whole discussion about libraries seems somewhat irrelevant to the question at hand, although it is certainly true that understanding how a library handles Unicode is an important issue for the potential user of a library.
So how does a non-validating decoder introduce problems? I can see that it might not solve all problems, but how does it introduce problems? Wouldn't the problems be introduced by something else, and the use of a non-validating decoder may not catch the problem... but not be the cause of the problem?
And then, if you would like to address the original issue, that would be fine too.
Your non-validating encoder is translating an invalid sequence into a valid one, thus you are introducing the problem. A completely naive environment (8-bit clean ASCII) would leave it as an invalid sequence throughout.
This is not a theoretical problem. See http://tools.ietf.org/html/rfc3629#section-10 . We MUST reject invalid sequences, or else we are not using UTF-8. There is no wiggle room, no debate.
(The absoluteness is why the standard behaviour doesn't need a benchmark. You are essentially arguing that, when logging in as root over the internet, it's a lot faster if you use telnet rather than ssh. One is simply not an option.)

Glenn Linderman writes:
"significantly" seems to be the only word at question; it seems that there are a fair number of validation checks that could be performed; the numeric part of UTF-8 decoding is just a sequence of shifts, masks, and ORs, so can be coded pretty tightly in C or assembly language.
Anything extra would be slower; how much slower is hard to predict prior to the implementation.
Not much, see my previous response.
This also seems to be supported by Stephen's comment "That's a lot to ask, as it turns out."
Not what I meant. Inefficiency is not an objection to checking for validity at the level a codec can handle. The objection is that "we don't want *any* exceptions thrown that we didn't explicitly ask for", and adding validation certainly will violate that.
So I don't understand how this is responsive to the "decoding removes many insecurities" issue?
Because you have to recheck every time the data crosses from Python into your code. To the extent that Python codecs promise validation and keep that promise, internal code *never* has to make those checks. That is a significant savings in programmer effort, because auditing a large body of code for *any* I/O from Python is going to be costly.
So when you examine a library for potential use, you have documentation or code to help you set your expectations about what it does, and whether or not it may have vulnerabilities, and whether or not those vulnerabilities are likely or unlikely, whether you can reduce the likelihood or prevent the vulnerabilities by wrapping the API, etc. And so you choose to use the library, or not.
Python is precisely such a component that people will choose to use, or not, based on whether they can expect that when Python hands them a Unicode object freshly input from the outside world, it won't contain lone surrogates, or invalid UTF-8 characters that got through a 3rd-party spam filter, or whatever.
This whole discussion about libraries seems somewhat irrelevant to the question at hand,
No, it's the *only* point that matters. IMO, speed is not relevant here. The question is whether throwing a Unicode exception on invalid encoding by default generally does more good than harm. Guido seems to think "not!", which gives me pause.<wink> I still disagree, though.

Glenn Linderman writes:
On approximately 12/7/2008 8:13 PM, came the following characters from
I have no problem with having strict validation available. But doesn't validation take significantly longer than decoding?
I think you're thinking of XML, where validation can take significant resources over and above syntax checking. For Unicode, not unless you're seriously CPU-bound. Unicode validation is a matter of a few range checks and a couple of flags to handle things like lone surrogates.
In the case of "excess length" in UTF-8, you can actually often do it in *zero* time if you use a table to analyze the leading byte (eg, 0xC0 and 0xC1 are invalid UTF-8 leading bytes because they would necessarily decode to U+0000 to U+007F, ie, the ASCII range), because you have to make a check for 0xFE and 0xFF anyway, which can't be UTF-8 leading bytes. (I'm not sure this generalizes to longer UTF-8 sequences, but it would reject the use of 0xC0 0xAF to sneak in a "/" in zero time!)
So I think it should be logically decoupled... do validation when/where it is needed for security reasons,
Security is an important application, but the real issue is that naively decoded text is a bomb with a sensitive impact fuse. Pass it around long enough, and it will blow up eventually.
The whole point of the fairly complex rules about Unicode formats and the *requirement* that broken coding be a fatal error *in a connforming Unicode process* is intended to ensure that Unicode exceptions[1] only ever occur on input (or memory corruption and the like, which is actually a form of I/O, of course). That's where efficiency comes from.
I think Python 3 should aspire to (eventually) be a conforming process by default, with lax behavior an option.
and allow internal [de]coding to be faster.
"Internal decoding" is (or should be) an oxymoron. Why would your software be passing around text in any format other than internal? So decoding will happen (a) on I/O, which is itself almost certainly slower than making a few checks for Unicode hygiene, or (b) on receipt of data from other software that whose sanitation you shouldn't trust more than you trust the Internet.
Encoding isn't a problem, AFAICS.
You didn't address the issue that if the decoding to a canonical form is done first, many of the insecurities just go away, so why throw errors?
Because as long as you're decoding anyway, it costs no more to do it right, except in rare cases. Why do you think Python should aspire to "quick and dirty" in a context where dirty is known to be unhealthy, and there is no known need for speed? Why impose "doing it right" on the application programmer when there's a well-defined spec for that that we could implement in the standard library?
It's the errors themselves that people are objecting to. See Guido's posts for concisely stated arguments for a "don't ask, don't tell" policy toward Unicode breakage. I agree that Python should implement that policy as an option, but I think that the user should have to request it either with a runtime option or (in the case of user == app programmer) by deliberately specifying a lax codec. The default *Unicode* codecs should definitely aspire to full Unicode conformance within their sphere of responsibility.
Footnotes: [1] A character outside the repertoire that the app can handle is not a "Unicode exception", unless the reason the app can't handle it is that the Unicode handler blew up.

On approximately 12/8/2008 12:57 AM, came the following characters from the keyboard of Stephen J. Turnbull:
"Internal decoding" is (or should be) an oxymoron. Why would your software be passing around text in any format other than internal? So decoding will happen (a) on I/O, which is itself almost certainly slower than making a few checks for Unicode hygiene, or (b) on receipt of data from other software that whose sanitation you shouldn't trust more than you trust the Internet.
Encoding isn't a problem, AFAICS.
So I can see validating user supplied data, which always comes in via I/O.
But during manipulation of internal data, including file and database I/O, there is a need for encoding and decoding also. If all the data has already been validated, then there would be no need to revalidate on every conversion.
I hear you when you say that clever coding can make the validation nearly free, and I applaud that: the UTF-8 coder that I wrote predated most of the rules that have been created since, so I didn't attempt to be clever in that regard.
Thanks to you and Adam for your explanations; I see your points, and if it is nearly free, I withdraw most of my negativity on this topic.
participants (5)
-
Adam Olsen
-
Glenn Linderman
-
Hagen Fürstenau
-
Michael Urman
-
Stephen J. Turnbull