Python 1.5.2 modules need porting to 2.0 because of unicode - comments please
I doubt that we can fix all Unicode related bugs in the 2.0 stdlib before the final release... let's make this a project for 2.1.
Exactly my feelings. Since we cannot possibly fix all problems, we may need to change the behaviour later. If we now silently do the wrong thing, silently changing it to the then-right thing in 2.1 may break peoples code. So I'm asking that cases where it does not clearly do the right thing produces an exception now; we can later fix it to accept more cases, should need occur. In the specific case, dropping support for Unicode output in binary files is the right thing. We don't know what the user expects, so it is better to produce an exception than to silently put incorrect bytes into the stream - that is a bug that we still can fix. The easiest way with the clearest impact is to drop the buffer interface in unicode objects. Alternatively, not supporting them in for s# also appears reasonable. Users experiencing the problem in testing will then need to make an explicit decision how they want to encode the Unicode objects. If any expedition of the issue is necessary, I can submit a bug report, and propose a patch. Regards, Martin
I doubt that we can fix all Unicode related bugs in the 2.0 stdlib before the final release... let's make this a project for 2.1.
Exactly my feelings. Since we cannot possibly fix all problems, we may need to change the behaviour later.
If we now silently do the wrong thing, silently changing it to the then-right thing in 2.1 may break peoples code. So I'm asking that cases where it does not clearly do the right thing produces an exception now; we can later fix it to accept more cases, should need occur.
In the specific case, dropping support for Unicode output in binary files is the right thing. We don't know what the user expects, so it is better to produce an exception than to silently put incorrect bytes into the stream - that is a bug that we still can fix.
The easiest way with the clearest impact is to drop the buffer interface in unicode objects. Alternatively, not supporting them in for s# also appears reasonable. Users experiencing the problem in testing will then need to make an explicit decision how they want to encode the Unicode objects.
If any expedition of the issue is necessary, I can submit a bug report, and propose a patch.
Sounds reasonable to me (but I haven't thought of all the issues). For writing binary Unicode strings, one can use f.write(u.encode("utf-16")) # Adds byte order mark f.write(u.encode("utf-16-be")) # Big-endian f.write(u.encode("utf-16-le")) # Little-endian --Guido van Rossum (home page: http://www.pythonlabs.com/~guido/)
Guido van Rossum wrote:
I doubt that we can fix all Unicode related bugs in the 2.0 stdlib before the final release... let's make this a project for 2.1.
Exactly my feelings. Since we cannot possibly fix all problems, we may need to change the behaviour later.
If we now silently do the wrong thing, silently changing it to the then-right thing in 2.1 may break peoples code. So I'm asking that cases where it does not clearly do the right thing produces an exception now; we can later fix it to accept more cases, should need occur.
In the specific case, dropping support for Unicode output in binary files is the right thing. We don't know what the user expects, so it is better to produce an exception than to silently put incorrect bytes into the stream - that is a bug that we still can fix.
The easiest way with the clearest impact is to drop the buffer interface in unicode objects. Alternatively, not supporting them in for s# also appears reasonable. Users experiencing the problem in testing will then need to make an explicit decision how they want to encode the Unicode objects.
If any expedition of the issue is necessary, I can submit a bug report, and propose a patch.
Sounds reasonable to me (but I haven't thought of all the issues).
For writing binary Unicode strings, one can use
f.write(u.encode("utf-16")) # Adds byte order mark f.write(u.encode("utf-16-be")) # Big-endian f.write(u.encode("utf-16-le")) # Little-endian
Right. Possible ways to fix this: 1. disable Unicode's getreadbuf slot This would effectively make Unicode object unusable for all APIs which use "s#"... and probably give people a lot of headaches. OTOH, it would probably motivate lots of users to submit patches for the stdlib which makes it Unicode aware (hopefully ;-) 2. same as 1., but also make "s#" fall back to getcharbuf in case getreadbuf is not defined This would make Unicode objects compatible with "s#", but still prevent writing of binary data: getcharbuf returns the Unicode object encoded using the default encoding which is ASCII per default. 3. special case "s#" in some way to handle Unicode or to raise an exception pointing explicitly to the problem and its (possible) solution I'm not sure which of these paths to take. Perhaps solution 2. is the most feasable compromise between "exceptions everywhere" and "encoding confusion". -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
participants (3)
-
Guido van Rossum
-
M.-A. Lemburg
-
Martin v. Loewis