[Python-Dev] Divorcing str and unicode (no more implicit conversions).

Tue Oct 25 02:56:40 CEST 2005

At 11:43 2005-10-24 +0200, M.-A. Lemburg wrote:
>Bengt Richter wrote:
>> Please bear with me for a few paragraphs ;-)
>
>Please note that source code encoding doesn't really have
>anything to do with the way the interpreter executes the
>program - it's merely a way to tell the parser how to
>convert string literals (currently on the Unicode ones)
>into constant Unicode objects within the program text.
>It's also a nice way to let other people know what kind of
>encoding you used to write your comments ;-)
>
>Nothing more.
I think somehow I didn't make things clear, sorry ;-)
As I tried to show in the example of module_a.cs vs module_b.cs,
the source encoding currently results in two different str-type
strings representing the source _character_ sequence, which is the
_same_ in both cases. To make it more clear, try the following little
program (untested except on NT4 with
Python 2.4b1 (#56, Nov  3 2004, 01:47:27)
[GCC 3.2.3 (mingw special 20030504-1)] on win32 ;-):

----< t_srcenc.py >--------------------------------
import os
def test():
    open('module_a.py','wb').write(
        "# -*- coding: latin-1 -*-" + os.linesep +
        "cs = '\xfcber-cool'" + os.linesep)
    open('module_b.py','wb').write(
        "# -*- coding: utf-8 -*-" + os.linesep +
        "cs = '\xc3\xbcber-cool'" + os.linesep)
    # show that we have two modules differing only in encoding:
    print ''.join(line.decode('latin-1') for line in open('module_a.py'))
    print ''.join(line.decode('utf-8') for line in open('module_b.py'))
    # see how results are affected:
    import module_a, module_b
    print module_a.cs + ' =?= ' + module_b.cs
    print module_a.cs.decode('latin-1') + ' =?= ' + module_b.cs.decode('utf-8')

if __name__ == '__main__':
    test()
---------------------------------------------------
The result copied from NT4 console to clipboard and pasted into eudora:
__________________________________________________________

[17:39] C:\pywk\python-dev>py24 t_srcenc.py
# -*- coding: latin-1 -*-
cs = 'über-cool'

# -*- coding: utf-8 -*-
cs = 'über-cool'

nber-cool =?= ++ber-cool
über-cool =?= über-cool
__________________________________________________________
(I'd say NT did the best it could, rendering the the copied cp437
superscript n as the 'n' above, and the '++' coming from the
cp437 box characters corresponding to the '\xc3\xbc'. Not sure
how it will show on your screen, but try the program to see ;-)

>Once a module is compiled, there's no distinction between
>a module using the latin-1 source code encoding or one using
>the utf-8 encoding.
ISTM module_a.cs and module_b.cs can readily be distinguished after
compilation, whereas the sources displayed according to their declared
encodings as above (or as e.g. different editors using different native
encoding might) cannot (other than the encoding cookie itself) ;-)
Perhaps you meant something else?

>Thanks,
You're welcome.

Regards,
Bengt Richter