
Skip Montanaro wrote:
I haven't been following this discussion closely at all, and have no previous experience with Unicode, so please pardon a couple stupid questions from the peanut gallery:
1. What does U+0061 mean (other than 'a')? That is, what is U?
U+XXXX means Unicode character with ordinal hex number XXXX. It is basically just another way to say, hey I want the Unicode character at position 0xXXXX in the Unicode spec.
2. I saw nothing about encodings in the Codec/StreamReader/StreamWriter description. Given a Unicode object with encoding e1, how do I write it to a file that is to be encoded with encoding e2? Seems like I would do something like
u1 = unicode(s, encoding=e1) f = open("somefile", "wb") u2 = unicode(u1, encoding=e2) f.write(u2)
Is that how it would be done? Does this question even make sense?
The unicode() constructor converts all input to Unicode as basis for other conversions. In the above example, s would be converted to Unicode using the assumption that the bytes in s represent characters encoded using the encoding given in e1. The line with u2 would raise a TypeError, because u1 is not a string. To convert a Unicode object u1 to another encoding, you would have to call the .encode() method with the intended new encoding. The Unicode object will then take care of the conversion of its internal Unicode data into a string using the given encoding, e.g. you'd write: f.write(u1.encode(e2))
3. What will the impact be on programmers such as myself currently living with blinders on (that is, writing in plain old 7-bit ASCII)?
If you don't want your scripts to know about Unicode, nothing will really change. In case you do use e.g. Latin-1 characters in your scripts for strings, you are asked to include a pragma in the comment lines at the beginning of the script (so that programmers viewing your code using other encoding have a chance to figure out what you've written). Here's the text from the proposal: """ Note that you should provide some hint to the encoding you used to write your programs as pragma line in one the first few comment lines of the source file (e.g. '# source file encoding: latin-1'). If you only use 7-bit ASCII then everything is fine and no such notice is needed, but if you include Latin-1 characters not defined in ASCII, it may well be worthwhile including a hint since people in other countries will want to be able to read you source strings too. """ Other than that you can continue to use normal strings like you always have. Hope that clarifies things at least a bit, -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/