to express unicode string

Sat Jan 28 19:10:19 EST 2012

On 01/28/2012 04:03 PM, Terry Reedy wrote:
> On 1/28/2012 2:58 PM, Michael Torrie wrote:
>> On 01/28/2012 12:21 AM, contro opinion wrote:
>>>>>> s='你好'
>>
>> On my computer, s is a byte string that contains the utf-8 formatted
>> encoding of 你好.
> 
> On mine, s is a (unicode) string containing those two characters. That 
> is because I pasted the above into IDLE 3.2.2 (on Win7, but should be 
> the same on all systems). (Pasting into the standard interpreter window, 
> which uses Windows stupid Command Prompt interface, does not work.)

Well yes of course because you are using Python 3.  With Python 2, the
line editor automatically encodes the unicode characters in to the
character set the terminal is using.  Basically if you are using a
terminal (such as MS Windows cmd prompt) that isn't UTF-8, things can be
confusing.  Thus if you are going to use Python 2, when using string
literals that contain unicode characters, you have to worry about what
text encoding scheme you are working with.

Even in Python 3, when you write this to your python file:
s='你好'
how that string literal is encoded to bytes in the py file depends on
your text editor.  Python 2 complains if you don't specify the encoding
at the top of the file when the file is UTF-8.  Python 3 seems to assume
UTF-8 by default (a good guess these days).  If your python file is
UTF-8, then the interpreter automatically decodes string literals to
unicode as the file is being parsed.

> To the OP. if you want to work easily with unicode, use Python 3.2 now 
> and Python 3.3 as soon as it comes out, in less than a year. We went 
> through the hassle of changing the string type from bytes to unicode 
> *because* having unicode as merely an add-on type was not working very well.

Agreed.