length of unicode strings

Mon Sep 2 01:35:51 EDT 2002

teg at redhat.com (Trond Eivind Glomsrød) writes:

> When running on a utf-8 system, python doesn't seem to take it input
> in unicode:

"it input" is a bit too general: Python applications can certainly
input and output Unicode, and any of its encodings.

But yes, you cannot really use UTF-8 in source code, for Unicode
literals.

> >>> b=u"å"
> >>> b
> u'\xc3\xa5'

In Python 2.2, non-ASCII bytes in a Unicode literal are always treated
as Latin-1. This is fixed in PEP 263, which allows you to write

# -*- coding: utf-8 -*-
b=u"å"

Setting the source encoding in interactive mode is still not supported
by this PEP - it also somewhat depends on the console's encoding.

> Any particular things to configure? 

No. To use UTF-8 in source code, you have to write

unicode("å", "utf-8")

In a larger source code fragment, this can be written as

def u(s):
  return unicode(s, "utf-8")

b = u("å")

This relies on the fact that non-ASCII bytes in a string literal are
treated as-is; you should still declare the encoding, or else you'll
get a warning in Python 2.3.

HTH,
Martin