Use of Unicode in Python 2.5 source code literals
Matt Nordhoff
mnordhoff at mattnordhoff.com
Sun May 3 07:37:35 EDT 2009
Uncle Bruce wrote:
> I'm working with Python 2.5.4 and the NLTK (Natural Language
> Toolkit). I'm an experienced programmer, but new to Python.
>
> This question arose when I tried to create a literal in my source code
> for a Unicode codepoint greater than 255. (I also posted this
> question in the NLTK discussion group).
>
> The Python HELP (at least for version 2.5.4) states:
>
> +++++++
> Python supports writing Unicode literals in any encoding, but you have
> to declare the encoding being used. This is done by including a
> special comment as either the first or second line of the source file:
>
> #!/usr/bin/env python
> # -*- coding: latin-1 -*-
> ++++++++++++
>
> Based on some experimenting I've done, I suspect that the support for
> Unicode literals in ANY encoding isn't really accurate. What seems to
> happen is that there must be an 8-bit mapping between the set of
> Unicode literals and what can be used as literals.
>
> Even when I set Options / General / Default Source Encoding to UTF-8,
> IDLE won't allow Unicode literals (e.g. characters copied and pasted
> from the Windows Character Map program) to be used, even in a quoted
> string, if they represent an ord value greater than 255.
>
> I noticed, in researching this question, that Marc Andre Lemburg
> stated, back in 2001, "Since Python source code is defined to be
> ASCII..."
>
> I'm writing code for linguistics (other than English), so I need
> access to lots more characters. Most of the time, the characters come
> from files, so no problem. But for some processing tasks, I simply
> must be able to use "real" Unicode literals in the source code.
> (Writing hex escape sequences in a complex regex would be a
> nightmare).
>
> Was this taken care of in the switch from Python 2.X to 3.X?
>
> Is there a way to use more than 255 Unicode characters in source code
> literals in Python 2.5.4?
>
> Also, in the Windows version of Python, how can I tell if it was
> compiled to support 16 bits of Unicode or 32 bits of Unicode?
>
> Bruce in Toronto
Works for me:
--- snip ---
$ cat snowman.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import unicodedata
snowman = u'☃'
print len(snowman)
print unicodedata.name(snowman)
$ python2.6 snowman.py
1
SNOWMAN
--- snip ---
What did you set the encoding to in the declaration at the top of the
file? The help text you quoted uses latin-1 as an example, an encoding
which, of course, only supports 256 code points. Did you try utf-8 instead?
The regular expression engine's Unicode support is a different question,
and I do not know the answer.
By the way, Python 2.x only supports using non-ASCII characters in
source code in string literals. Python 3 adds support for Unicode
identifiers (e.g. variable names, function argument names, etc.).
--
More information about the Python-list
mailing list