[Python-Dev] Python in Unicode context

Tue Aug 3 16:53:45 CEST 2004

Hello, people.

I'm switching from ISO-8859-1 to UTF-8 in my locale, knowing it may
take a while before everything gets fully adapted.  Of course, I am
prepared to do whatever it means.  On my side at least, the perception
of a meaning is an evolving process. :-)

So, my goal here is to share some of difficulties I see with the current
setup of Python in Unicode context, under the hypothesis that Python
should ideally be designed to alleviate the pain of migration.  I hope
this is not out of context on the Python development list.

Converting a Python source file from ISO-8859-1 to UTF-8, back and forth
at the charset level, is a snap within Vim, and I would like if it was
(almost) a snap in the Python code as well.  There is some amount of
trickery that I could put in to achieve this, but too much trickery does
not fit well in usual Python elegance.

As Martin once put it, the ultimate goal is to convert data to Unicode
as early as possible in a Python program, and back to the locale as late
as possible.  While it's very OK with me, we should not loose sight that
people might adopt different approaches.

One thing is that a Python module should have some way to know the
encoding used in its source file, maybe some kind of `module.__coding__'
next to `module.__file__', saving the coding effectively used while
compilation was going on.  When a Python module is compiled, per PEP
0263 as I understand it, strings are logically converted to UTF-8 before
scanning, and produced str-strings (but not unicode-strings), converted
back to the original file coding.  When later, at runtime, the string
has to be converted back to Unicode, it would help if the programmer did
not have to hardwire the encoding in the program, and edit more than the
`coding:' cookie at the beginning if s/he ever switches file charset.
That same `module.__coding__' could also be used for other things, like
for example, to decide at run-time whether codecs streawriters should be
used or not.

Another solution would of course be to edit all strings, or at least
those containing non-ASCII characters, to prepend a `u' and turn
them into Unicode strings.  This is what I intend to do in practice.
However, all this editing is cumbersome, especially until it is
definitive.  I wonder if some other cookie, next to the `coding:'
cookie, could not be used to declare that all strings _in this module
only_ should be interpreted as Unicode by default, but without the need
of resorting to `u' prefix all over.  That would be weaker than the
`-U' switch on a Python call, but likely much more convenient as well.
As a corollary, maybe that some `s' prefix could force `str' type in a
Unicodized module.  Another way of saying it would be that an unadorned
string would have `s' or `u' implied, depending if the Unicode cookie is
missing or given at the start of a module.

I have the intuition, still unverified, but to be confirmed over time
and maybe discussions, that the above would alleviate transition to
Unicode, back and forth.

P.S. - Should I say and confess, one thing I do not like much about
Unicode is how proponents often perceive it, like a religion, and all
the fanatism going with it.  Unicode should be seen and implemented as
a choice, more than a life commitment :-). Right now, my feeling is that
Python asks a bit too much of a programmer, in terms of commitment, if
we only consider the editing work required on sources to use it, or not.

-- 
François Pinard   http://www.iro.umontreal.ca/~pinard