Unicode Proposal: Version 0.4

I've uploaded a new version of the proposal which incorporates a lot of what has been discussed on the list. Thanks to everybody who helped so far. Note that I have extended the list of references for those who want to join in, but are in need of more background information. The latest version of the proposal is available at: http://starship.skyport.net/~lemburg/unicode-proposal.txt Older versions are available as: http://starship.skyport.net/~lemburg/unicode-proposal-X.X.txt Some POD (points of discussion) that are still open: · support for line breaks (see http://www.unicode.org/unicode/reports/tr13/ ) · support for case conversion: Problems: string lengths can change due to multiple characters being mapped to a single new one, capital letters starting a word can be different than ones occurring in the middle, there are locale dependent deviations from the standard mappings. · support for numbers, digits, whitespace, etc. · support (or no support) for private code point areas · should Unicode objects support %-formatting ? One possibility would be to emulate this via strings and <default encoding>: s = '%s %i abcäöü' # a Latin-1 encoded string t = (u,3) # Convert Latin-1 s to a <default encoding> string s1 = unicode(s,'latin-1').encode() # The '%s' will now add u in <default encoding> s2 = s1 % t # Finally, convert the <default encoding> encoded string to Unicode u1 = unicode(s2) · specifying file wrappers: Open issues: what to do with Python strings fed to the .write() method (may need to know the encoding of the strings) and when/if to return Python strings through the .read() method. Perhaps we need more than one type of wrapper here. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

FYI, I've uploaded a new version of the proposal which incorporates proposals for line breaks, case mapping, character properties and private code points support. The latest version of the proposal is available at: http://starship.skyport.net/~lemburg/unicode-proposal.txt Older versions are available as: http://starship.skyport.net/~lemburg/unicode-proposal-X.X.txt Some POD (points of discussion) that are still open: · should Unicode objects support %-formatting ? One possibility would be to emulate this via strings and <default encoding>: s = '%s %i abcäöü' # a Latin-1 encoded string t = (u,3) # Convert Latin-1 s to a <default encoding> string s1 = unicode(s,'latin-1').encode() # The '%s' will now add u in <default encoding> s2 = s1 % t # Finally, convert the <default encoding> encoded string to Unicode u1 = unicode(s2) · specifying file wrappers: Open issues: what to do with Python strings fed to the .write() method (may need to know the encoding of the strings) and when/if to return Python strings through the .read() method. Perhaps we need more than one type of wrapper here. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 48 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

FYI, I've uploaded a new version of the proposal which incorporates many things we have discussed lately, e.g. the buffer interface, "s#" vs. "t#", etc. The latest version of the proposal is available at: http://starship.skyport.net/~lemburg/unicode-proposal.txt Older versions are available as: http://starship.skyport.net/~lemburg/unicode-proposal-X.X.txt Some POD (points of discussion) that are still open: · Unicode objects support for %-formatting · specifying StreamCodecs -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

FYI, I've uploaded a new version of the proposal which includes new codec APIs, a new codec search mechanism and some minor fixes here and there. The latest version of the proposal is available at: http://starship.skyport.net/~lemburg/unicode-proposal.txt Older versions are available as: http://starship.skyport.net/~lemburg/unicode-proposal-X.X.txt Some POD (points of discussion) that are still open: · Unicode objects support for %-formatting · Design of the internal C API and the Python API for the Unicode character properties database -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

I haven't been following this discussion closely at all, and have no previous experience with Unicode, so please pardon a couple stupid questions from the peanut gallery: 1. What does U+0061 mean (other than 'a')? That is, what is U? 2. I saw nothing about encodings in the Codec/StreamReader/StreamWriter description. Given a Unicode object with encoding e1, how do I write it to a file that is to be encoded with encoding e2? Seems like I would do something like u1 = unicode(s, encoding=e1) f = open("somefile", "wb") u2 = unicode(u1, encoding=e2) f.write(u2) Is that how it would be done? Does this question even make sense? 3. What will the impact be on programmers such as myself currently living with blinders on (that is, writing in plain old 7-bit ASCII)? Thx, Skip Montanaro | http://www.mojam.com/ skip@mojam.com | http://www.musi-cal.com/ 847-971-7098 | Python: Programming the way Guido indented...

Skip Montanaro wrote:
U+XXXX means Unicode character with ordinal hex number XXXX. It is basically just another way to say, hey I want the Unicode character at position 0xXXXX in the Unicode spec.
The unicode() constructor converts all input to Unicode as basis for other conversions. In the above example, s would be converted to Unicode using the assumption that the bytes in s represent characters encoded using the encoding given in e1. The line with u2 would raise a TypeError, because u1 is not a string. To convert a Unicode object u1 to another encoding, you would have to call the .encode() method with the intended new encoding. The Unicode object will then take care of the conversion of its internal Unicode data into a string using the given encoding, e.g. you'd write: f.write(u1.encode(e2))
3. What will the impact be on programmers such as myself currently living with blinders on (that is, writing in plain old 7-bit ASCII)?
If you don't want your scripts to know about Unicode, nothing will really change. In case you do use e.g. Latin-1 characters in your scripts for strings, you are asked to include a pragma in the comment lines at the beginning of the script (so that programmers viewing your code using other encoding have a chance to figure out what you've written). Here's the text from the proposal: """ Note that you should provide some hint to the encoding you used to write your programs as pragma line in one the first few comment lines of the source file (e.g. '# source file encoding: latin-1'). If you only use 7-bit ASCII then everything is fine and no such notice is needed, but if you include Latin-1 characters not defined in ASCII, it may well be worthwhile including a hint since people in other countries will want to be able to read you source strings too. """ Other than that you can continue to use normal strings like you always have. Hope that clarifies things at least a bit, -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

FYI, I've uploaded a new version of the proposal which includes the encodings package, definition of the 'raw unicode escape' encoding (available via e.g. ur""), Unicode format strings and a new method .breaklines(). The latest version of the proposal is available at: http://starship.skyport.net/~lemburg/unicode-proposal.txt Older versions are available as: http://starship.skyport.net/~lemburg/unicode-proposal-X.X.txt Some POD (points of discussion) that are still open: · Stream readers: What about .readline(), .readlines() ? These could be implemented using .read() as generic functions instead of requiring their implementation by all codecs. Also see Line Breaks. · Python interface for the Unicode property database · What other special Unicode formatting characters should be enhanced to work with Unicode input ? Currently only the following special semantics are defined: u"%s %s" % (u"abc", "abc") should return u"abc abc". Pretty quiet around here lately... -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 38 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Pretty quiet around here lately...
My guess is that most positions and opinions have been covered. It is now probably time for less talk, and more code! It is time to start an implementation plan? Do we start with /F's Unicode implementation (which /G *smirk* seemed to approve of)? Who does what? When can we start to play with it? And a key point that seems to have been thrust in our faces at the start and hardly mentioned recently - does the proposal as it stands meet our sponsor's (HP) requirements? Mark.

Mark Hammond wrote:
Or that everybody is on holidays... like Guido.
This depends on whether HP agrees on the current specs. If they do, there should be code by mid December, I guess.
Haven't heard anything from them yet (this is probably mainly due to Guido being offline). -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 37 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

FYI, I've uploaded a new version of the proposal which incorporates proposals for line breaks, case mapping, character properties and private code points support. The latest version of the proposal is available at: http://starship.skyport.net/~lemburg/unicode-proposal.txt Older versions are available as: http://starship.skyport.net/~lemburg/unicode-proposal-X.X.txt Some POD (points of discussion) that are still open: · should Unicode objects support %-formatting ? One possibility would be to emulate this via strings and <default encoding>: s = '%s %i abcäöü' # a Latin-1 encoded string t = (u,3) # Convert Latin-1 s to a <default encoding> string s1 = unicode(s,'latin-1').encode() # The '%s' will now add u in <default encoding> s2 = s1 % t # Finally, convert the <default encoding> encoded string to Unicode u1 = unicode(s2) · specifying file wrappers: Open issues: what to do with Python strings fed to the .write() method (may need to know the encoding of the strings) and when/if to return Python strings through the .read() method. Perhaps we need more than one type of wrapper here. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 48 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

FYI, I've uploaded a new version of the proposal which incorporates many things we have discussed lately, e.g. the buffer interface, "s#" vs. "t#", etc. The latest version of the proposal is available at: http://starship.skyport.net/~lemburg/unicode-proposal.txt Older versions are available as: http://starship.skyport.net/~lemburg/unicode-proposal-X.X.txt Some POD (points of discussion) that are still open: · Unicode objects support for %-formatting · specifying StreamCodecs -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

FYI, I've uploaded a new version of the proposal which includes new codec APIs, a new codec search mechanism and some minor fixes here and there. The latest version of the proposal is available at: http://starship.skyport.net/~lemburg/unicode-proposal.txt Older versions are available as: http://starship.skyport.net/~lemburg/unicode-proposal-X.X.txt Some POD (points of discussion) that are still open: · Unicode objects support for %-formatting · Design of the internal C API and the Python API for the Unicode character properties database -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

I haven't been following this discussion closely at all, and have no previous experience with Unicode, so please pardon a couple stupid questions from the peanut gallery: 1. What does U+0061 mean (other than 'a')? That is, what is U? 2. I saw nothing about encodings in the Codec/StreamReader/StreamWriter description. Given a Unicode object with encoding e1, how do I write it to a file that is to be encoded with encoding e2? Seems like I would do something like u1 = unicode(s, encoding=e1) f = open("somefile", "wb") u2 = unicode(u1, encoding=e2) f.write(u2) Is that how it would be done? Does this question even make sense? 3. What will the impact be on programmers such as myself currently living with blinders on (that is, writing in plain old 7-bit ASCII)? Thx, Skip Montanaro | http://www.mojam.com/ skip@mojam.com | http://www.musi-cal.com/ 847-971-7098 | Python: Programming the way Guido indented...

Skip Montanaro wrote:
U+XXXX means Unicode character with ordinal hex number XXXX. It is basically just another way to say, hey I want the Unicode character at position 0xXXXX in the Unicode spec.
The unicode() constructor converts all input to Unicode as basis for other conversions. In the above example, s would be converted to Unicode using the assumption that the bytes in s represent characters encoded using the encoding given in e1. The line with u2 would raise a TypeError, because u1 is not a string. To convert a Unicode object u1 to another encoding, you would have to call the .encode() method with the intended new encoding. The Unicode object will then take care of the conversion of its internal Unicode data into a string using the given encoding, e.g. you'd write: f.write(u1.encode(e2))
3. What will the impact be on programmers such as myself currently living with blinders on (that is, writing in plain old 7-bit ASCII)?
If you don't want your scripts to know about Unicode, nothing will really change. In case you do use e.g. Latin-1 characters in your scripts for strings, you are asked to include a pragma in the comment lines at the beginning of the script (so that programmers viewing your code using other encoding have a chance to figure out what you've written). Here's the text from the proposal: """ Note that you should provide some hint to the encoding you used to write your programs as pragma line in one the first few comment lines of the source file (e.g. '# source file encoding: latin-1'). If you only use 7-bit ASCII then everything is fine and no such notice is needed, but if you include Latin-1 characters not defined in ASCII, it may well be worthwhile including a hint since people in other countries will want to be able to read you source strings too. """ Other than that you can continue to use normal strings like you always have. Hope that clarifies things at least a bit, -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

FYI, I've uploaded a new version of the proposal which includes the encodings package, definition of the 'raw unicode escape' encoding (available via e.g. ur""), Unicode format strings and a new method .breaklines(). The latest version of the proposal is available at: http://starship.skyport.net/~lemburg/unicode-proposal.txt Older versions are available as: http://starship.skyport.net/~lemburg/unicode-proposal-X.X.txt Some POD (points of discussion) that are still open: · Stream readers: What about .readline(), .readlines() ? These could be implemented using .read() as generic functions instead of requiring their implementation by all codecs. Also see Line Breaks. · Python interface for the Unicode property database · What other special Unicode formatting characters should be enhanced to work with Unicode input ? Currently only the following special semantics are defined: u"%s %s" % (u"abc", "abc") should return u"abc abc". Pretty quiet around here lately... -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 38 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Pretty quiet around here lately...
My guess is that most positions and opinions have been covered. It is now probably time for less talk, and more code! It is time to start an implementation plan? Do we start with /F's Unicode implementation (which /G *smirk* seemed to approve of)? Who does what? When can we start to play with it? And a key point that seems to have been thrust in our faces at the start and hardly mentioned recently - does the proposal as it stands meet our sponsor's (HP) requirements? Mark.

Mark Hammond wrote:
Or that everybody is on holidays... like Guido.
This depends on whether HP agrees on the current specs. If they do, there should be code by mid December, I guess.
Haven't heard anything from them yet (this is probably mainly due to Guido being offline). -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 37 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
participants (3)
-
M.-A. Lemburg
-
Mark Hammond
-
Skip Montanaro