PEP 263 comments

Fri Mar 1 15:34:02 EST 2002

huaiyu at gauss.almadan.ibm.com (Huaiyu Zhu) writes:

> I've been following this discussion with quite some interest, but I do not
> have the background to delimit the scope of various concepts.  Is there a
> gentle introduction to a unicode-newbie?

There are a number of introductions to Unicode; you may want to search
www.unicode.org, e.g.

http://www.unicode.org/unicode/standard/WhatIsUnicode.html

> >IMO, the Python source code parser should never see any text data[1]
> >that is not UTF-8 encoded.  
> 
> Presumably this discussion only concerns unicode strings - I don't think
> want to lose the ability to read in arbitrary binary data as a raw string.

First and foremost, the discussion is only about source code. A byte
string should certainly be able to store arbitrary bytes. Under
Stephen's proposal, it would indeed not be possible anymore to put
arbitrary binary data into source code.

> >[1]  Ie, Python language or character text.  It might be convenient to
> >have an octet-string primitive data type, in which you could put
> >EUC-encoded Japanese or Java byte codes.  
> 
> What's the difference between this and a raw string (a byte sequence) that
> you can translate into any other encoding?

Arbitrary binary data uses don't have a character set. If they are
character data, they should be stored as a character string (which, in
Python, is a Unicode string).

Regards,
Martin