Unicode program representation

Neil Hodgson neilh at hare.net.au
Mon Apr 3 03:25:14 CEST 2000


  I've been playing around with the Unicode support in Python 1.6.

  IDLE uses a Unicode enabled text widget so a script like the following:
print "hi ?"
   works well. There is a good chance this didn't travel over mail well, so
the string is 'h', 'i', ' '(space), '', (U+0429 cyrillic capital letter
shcha). The last letter looks like an angular 'W' with a tail.

   This saves as UTF-8, so it looks like IDLE's text widget stores data in
UTF-8 form. In my opinion, this is very much the right thing to do and I am
thinking of adding Unicode support to Scintilla so Pythonwin will also be
able to work like this.

   This means that if you want to use non-roman characters in string
constants then you should use 'ASCII' strings rather than the new u" form.
With:

i = 0
s = "hi ?"
print s
for x in s:
    print i, x
    i = i + 1
i = 0
su = u"hi ?"
print su
for x in su:
    print i, x
    i = i + 1

   Printing the 'ASCII' string did what I wanted but the Unicode form did
not.

   This leads to the question of what the use of the u" form is. The current
answer is that u" takes the ASCII string and makes a Unicode string object
by extending each byte with another zero byte. Because its a Unicode string
object it behaves appropriately with Unicode aware functions.

   I think this should be changed to interpreting the literal as a UTF-8
literal. The advantage here is that non-roman string literals become a
natural part of the language.

   The example scripts do not work well from the command line. Thought
should be given to whether on NT, the console should be reopened in wide
character mode so that Unicode I/O will work well.

   Saving the scripts as UCS-2 shows that the interpreter is unable to deal
with UCS-2 scripts, which is what I expected. I think very few people will
be creating script files in UCS-2, instead preferring to keep source code in
UTF-8 files.

   Neil






More information about the Python-list mailing list