unicode by default
Ian Kelly
ian.g.kelly at gmail.com
Wed May 11 18:09:29 EDT 2011
On Wed, May 11, 2011 at 3:37 PM, harrismh777 <harrismh777 at charter.net> wrote:
> hi folks,
> I am puzzled by unicode generally, and within the context of python
> specifically. For one thing, what do we mean that unicode is used in python
> 3.x by default. (I know what default means, I mean, what changed?)
The `unicode' class was renamed to `str', and a stripped-down version
of the 2.X `str' class was renamed to `bytes'.
> I think part of my problem is that I'm spoiled (American, ascii heritage)
> and have been either stuck in ascii knowingly, or UTF-8 without knowing
> (just because the code points lined up). I am confused by the implications
> for using 3.x, because I am reading that there are significant things to be
> aware of... what?
Mainly Python 3 no longer does explicit conversion between bytes and
unicode, requiring the programmer to be explicit about such
conversions. If you have Python 2 code that is sloppy about this, you
may get some Unicode encode/decode errors when trying to run the same
code in Python 3. The 2to3 tool can help somewhat with this, but it
can't prevent all problems.
> On my installation 2.6 sys.maxunicode comes up with 1114111, and my 2.7
> and 3.2 installs come up with 65535 each. So, I am assuming that 2.6 was
> compiled with UCS-4 (UTF-32) option for 4 byte unicode(?) and that the
> default compile option for 2.7 & 3.2 (I didn't change anything) is set for
> UCS-2 (UTF-16) or 2 byte unicode(?). Do I understand this much correctly?
I think that UCS-2 has always been the default unicode width for
CPython, although the exact representation used internally is an
implementation detail.
> The books say that the .py sources are UTF-8 by default... and that 3.x is
> either UCS-2 or UCS-4. If I use the file handling capabilities of Python in
> 3.x (by default) what encoding will be used, and how will that affect the
> output?
If you open a file in binary mode, the result is a non-decoded byte stream.
If you open a file in text mode and do not specify an encoding, then
the result of locale.getpreferredencoding() is used for decoding, and
the result is a unicode stream.
> If I do not specify any code points above ascii 0xFF does any of this
> matter anyway?
You mean 0x7F, and probably, due to the need to explicitly encode and decode.
More information about the Python-list
mailing list