Multibyte Character Surport for Python

Sat May 11 03:57:21 EDT 2002

"Stephen J. Turnbull" <stephen at xemacs.org> writes:

> (2) Code that uses identifiers in eval constructs would need to do
>     some horrible thing like
> 
>     exec "print x + y".decode('iso-8859-1').encode('utf-8')

With PEP 263 implemented, the source encoding of identifiers and the
run-time encoding are two different issues. The source does not need
to be in UTF-8.

> Note that in this all-ASCII example it's redundant, but would work.
> Also the PEP 263 mechanism could be extended to give the program an
> "execution locale" and automatically do that conversion.  (Horrible,
> but in the spirit of that PEP.)

Actually, the PEP requires that if a byte string is exec'ed, you need
a proper encoding declaration. The easiest one would be the UTF-8
signature, but I'd recommend to exec Unicode objects in the first
place.

> Obviously I prefer the latter interpretation.  I suggest that projects
> that require reliable operation of introspective tools hire someone
> like the martellibot to do coding standard enforcement<wink>.  But the
> "broken" interpretation is also reasonable, and I assume that is the
> one that MvL holds.

This is not an artificial objection: people already complained that
pydoc breaks when confronted with a Unicode doc string. I expect that
even dir() might stop "working", since its result would contain
Unicode objects which then cannot be printed at the interactive
console.

> The basic fact is that Unicode support for strings is already decided.
> I disagree with some implementation decisions (eg, the idea of
> prepending ZERO-WIDTH NO-BREAK SPACE to strings intended to be
> exported in UTF-16 encoding is just insane IMO

That's how UTF-16 is specified. If you don't want the BOM, use
UTF-16LE or UTF-16BE.

Regards,
Martin