Use of Unicode in Python 2.5 source code literals

Uncle Bruce bruce306 at rogers.com
Sun May 3 12:43:27 CEST 2009


I'm working with Python 2.5.4 and the NLTK (Natural Language
Toolkit).  I'm an experienced programmer, but new to Python.

This question arose when I tried to create a literal in my source code
for a Unicode codepoint greater than 255.  (I also posted this
question in the NLTK discussion group).

The Python HELP (at least for version 2.5.4) states:

+++++++
Python supports writing Unicode literals in any encoding, but you have
to declare the encoding being used. This is done by including a
special comment as either the first or second line of the source file:

#!/usr/bin/env python
# -*- coding: latin-1 -*-
++++++++++++

Based on some experimenting I've done, I suspect that the support for
Unicode literals in ANY encoding isn't really accurate.  What seems to
happen is that there must be an 8-bit mapping between the set of
Unicode literals and what can be used as literals.

Even when I set Options / General / Default Source Encoding to UTF-8,
IDLE won't allow Unicode literals (e.g. characters copied and pasted
from the Windows Character Map program) to be used, even in a quoted
string, if they represent an ord value greater than 255.

I noticed, in researching this question, that Marc Andre Lemburg
stated, back in 2001, "Since Python source code is defined to be
ASCII..."

I'm writing code for linguistics (other than English), so I need
access to lots more characters.  Most of the time, the characters come
from files, so no problem.  But for some processing tasks, I simply
must be able to use "real" Unicode literals in the source code.
(Writing hex escape sequences in a complex regex would be a
nightmare).

Was this taken care of in the switch from Python 2.X to 3.X?

Is there a way to use more than 255 Unicode characters in source code
literals in Python 2.5.4?

Also, in the Windows version of Python, how can I tell if it was
compiled to support 16 bits of Unicode or 32 bits of Unicode?

Bruce in Toronto



More information about the Python-list mailing list