length of unicode strings
Martin v. Loewis
martin at v.loewis.de
Mon Sep 2 01:35:51 EDT 2002
teg at redhat.com (Trond Eivind Glomsrød) writes:
> When running on a utf-8 system, python doesn't seem to take it input
> in unicode:
"it input" is a bit too general: Python applications can certainly
input and output Unicode, and any of its encodings.
But yes, you cannot really use UTF-8 in source code, for Unicode
literals.
> >>> b=u"å"
> >>> b
> u'\xc3\xa5'
In Python 2.2, non-ASCII bytes in a Unicode literal are always treated
as Latin-1. This is fixed in PEP 263, which allows you to write
# -*- coding: utf-8 -*-
b=u"å"
Setting the source encoding in interactive mode is still not supported
by this PEP - it also somewhat depends on the console's encoding.
> Any particular things to configure?
No. To use UTF-8 in source code, you have to write
unicode("å", "utf-8")
In a larger source code fragment, this can be written as
def u(s):
return unicode(s, "utf-8")
b = u("å")
This relies on the fact that non-ASCII bytes in a string literal are
treated as-is; you should still declare the encoding, or else you'll
get a warning in Python 2.3.
HTH,
Martin
More information about the Python-list
mailing list