[Python-Dev] Unicode source code

M.-A. Lemburg mal@lemburg.com
Sun, 09 Feb 2003 17:39:59 +0100


Just van Rossum wrote:
> M.-A. Lemburg wrote:
> 
> 
>>Just van Rossum wrote:
>>
>>>Now that PEP 263 is in place (yet hotly debated on c.l.py ;-),
>>>wouldn't it be fairly small step to fully support unicode strings
>>>in compile(), eval() and exec? I notice these still attempt to
>>>convert unicode to 8 bit with the default encoding, which isn't
>>>very useful.
>>
>>Patches are most welcome.
> 
> Some guidance on where to look is more than welcome.

The tokenizer/compiler works as follows (quote from another
email):

"""
source code using encoding ENC
-> via codec for ENC into Unicode
-> via UTF-8 codec into UTF-8 string
-> tokenizer
-> compiler
for 8-bit string literals in the source code
-> UTF-8 string is converted back into encoding ENC

Provided that the encoding ENC is roundtrip safe
for all 256 base character ordinals, 8-bit strings
will turn out as-is in the compiled byte code.
"""

Now, to accept Unicode it would probably be worthwhile hooking
into this chain at step 2 rather than step 1 (the code for the
tokenizer is in Parser/tokenizer.c, the compiler code in
Python/compiler.c), however, this is difficult because most
APIs for compiling code are built on char* buffers.

A short-term solution would probably be to convert Unicode to
UTF-8 and prepend a UTF-8 BOM mark so that the tokenizer
knows that it is getting UTF-8. Haven't tested this though.

A slightly better solution (on narrow Unicode Python builds)
would be to use UTF-16 for this. The UTF-16 support in the
tokenizer would have to be enabled for this, though. It is
currently disabled for some reason I don't remember. Martin
should know... but he's on vacation.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Software directly from the Source  (#1, Feb 09 2003)
 >>> Python/Zope Products & Consulting ...         http://www.egenix.com/
 >>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________
Python UK 2003, Oxford:                                     51 days left
EuroPython 2003, Charleroi, Belgium:                       135 days left