[Python-Dev] Divorcing str and unicode (no more implicit conversions).
mal at egenix.com
Tue Oct 25 12:18:50 CEST 2005
Bengt Richter wrote:
> At 11:43 2005-10-24 +0200, M.-A. Lemburg wrote:
>>Bengt Richter wrote:
>>>Please bear with me for a few paragraphs ;-)
>>Please note that source code encoding doesn't really have
>>anything to do with the way the interpreter executes the
>>program - it's merely a way to tell the parser how to
>>convert string literals (currently on the Unicode ones)
>>into constant Unicode objects within the program text.
>>It's also a nice way to let other people know what kind of
>>encoding you used to write your comments ;-)
> I think somehow I didn't make things clear, sorry ;-)
> As I tried to show in the example of module_a.cs vs module_b.cs,
> the source encoding currently results in two different str-type
> strings representing the source _character_ sequence, which is the
> _same_ in both cases.
I don't follow you here. The source code encoding
is only applied to Unicode literals (you are using string
literals in your example). String literals are passed
Whether or not you editor will use the source
code encoding marker is really up to your editor
and not within the scope of Python.
If you open the two module files in Emacs, you'll
see identical renderings of the string literals.
With other editors, you may have to explicitly tell
the editor which encoding to assume. Dito for shell
> To make it more clear, try the following little
> program (untested except on NT4 with
> Python 2.4b1 (#56, Nov 3 2004, 01:47:27)
> [GCC 3.2.3 (mingw special 20030504-1)] on win32 ;-):
> ----< t_srcenc.py >--------------------------------
> import os
> def test():
> "# -*- coding: latin-1 -*-" + os.linesep +
> "cs = '\xfcber-cool'" + os.linesep)
> "# -*- coding: utf-8 -*-" + os.linesep +
> "cs = '\xc3\xbcber-cool'" + os.linesep)
> # show that we have two modules differing only in encoding:
> print ''.join(line.decode('latin-1') for line in open('module_a.py'))
> print ''.join(line.decode('utf-8') for line in open('module_b.py'))
> # see how results are affected:
> import module_a, module_b
> print module_a.cs + ' =?= ' + module_b.cs
> print module_a.cs.decode('latin-1') + ' =?= ' + module_b.cs.decode('utf-8')
> if __name__ == '__main__':
> The result copied from NT4 console to clipboard and pasted into eudora:
> [17:39] C:\pywk\python-dev>py24 t_srcenc.py
> # -*- coding: latin-1 -*-
> cs = 'über-cool'
> # -*- coding: utf-8 -*-
> cs = 'über-cool'
> nber-cool =?= ++ber-cool
> über-cool =?= über-cool
> (I'd say NT did the best it could, rendering the the copied cp437
> superscript n as the 'n' above, and the '++' coming from the
> cp437 box characters corresponding to the '\xc3\xbc'. Not sure
> how it will show on your screen, but try the program to see ;-)
>>Once a module is compiled, there's no distinction between
>>a module using the latin-1 source code encoding or one using
>>the utf-8 encoding.
> ISTM module_a.cs and module_b.cs can readily be distinguished after
> compilation, whereas the sources displayed according to their declared
> encodings as above (or as e.g. different editors using different native
> encoding might) cannot (other than the encoding cookie itself) ;-)
> Perhaps you meant something else?
What your editor displays to you is not within the scope
of Python, e.g. if you open the files in Emacs you'll see
something different than in Notepad.
I guess that's the price you have to pay for being able to write
programs that can include Unicode literals using the complete range
of possible Unicode characters without having to revert to
Professional Python Services directly from the Source (#1, Oct 25 2005)
>>> Python/Zope Consulting and Support ... http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
More information about the Python-Dev