Unicode question

Fri Jul 18 05:41:23 EDT 2003

Ricardo Bugalho wrote:
> On Fri, 18 Jul 2003 02:07:13 +0200, Gerhard Häring wrote:
>>>Gerhard Häring <gh at ghaering.de> writes:
>>>>>>>u"äöü"
>>>>u'\x84\x94\x81'
>>>> [this works, but IMO shouldn't]
> 
> You can use string literals in any encoding like this:
> 'string in my favorite encoding'.decode('my favorite encoding'). 
> Note that the lack of the u prefix. Not very confortable though..
> u'string' ends up doing the same as 'string'.decode('latin1').

Yep. It's the latin1 default that I'm critizizing.

> It doesn't work for docstrings though..
> 
> I'm not sure for what you mean about encoding cookie, 

See PEP 263 @ http://www.python.org/peps/pep-0263.html

> but I like the idea
> of each source file having some element that defines the encoding used to
> process string literals.

Then you'll like that exactly this is implemented in Python 2.3:

#!/usr/bin/python
# -*- coding: latin1 -*-
...

> Either that or we define the Python code must be written in UTF-8. 

You can do that in Python 2.3 as well. Just save your source file with a 
  UTF-8 BOM and you don't even have to explicitly define an encoding 
using an encoding cookie.

> But that would break lots of code.. :D

You'll get warnings if you don't define an encoding (either encoding 
cookie or BOM) and use 8-Bit characters in your source files. These 
warnings will becomome errors in later Python versions.

It's all in the PEP :)

-- Gerhard