[PythonCE] Unicode default encoding

Thu Mar 2 11:19:49 CET 2006

Just to add my 2 or 3 cents.

Not all Strings in Python are Unicode.

Python has a StringType and a UnicodeType.
If you want to get a unicode string you have to write u"my test" instead of "my test".
But in principle:  u"my test" = "my test".decode("utf-8").  <-- depends on source encoding

As has already been suggested, you should not depend on the default encoding of the 
operating system, so better use sys.getdefaultencoding() and store the encoding for 
communication with the console etc. (also important for wxwindows in non-unicode mode)

Additionaly take care of the source encoding of your files, by adding a specific header to them:
See: http://www.python.org/peps/pep-0263.html

Finally I recommed the following tutorial, as the reportlab guys really know their stuff.
http://www.reportlab.com/i18n/python_unicode_tutorial.html

Kind Regards

Jan Ischebeck

------------------------------------------------------------------------------------

P3 GmbH - Ingenieurgesellschaft für Management und Organisation

Jan Ischebeck
Senior Consultant

Nürtinger Straße 9
70794 Filderstadt-Bernhausen

phone: +49 - (0)163 / 75 33 613
fax: +49 - (0)163 / 99 75 33 613
e-mail: jan.ischebeck at p3-gmbh.de
web: www.p3-gmbh.de

-----Ursprüngliche Nachricht-----
Von: pythonce-bounces at python.org im Auftrag von Fuzzyman
Gesendet: Do 02-Mrz-06 18:35
Cc: pythonce at python.org
Betreff: Re: [PythonCE] Unicode default encoding

Jeffrey Barish wrote:

>>Luke Dunstan wrote:
>>    
>>
>>>----- Original Message ----- 
>>>From: "Jeffrey Barish" <jeff_barish at earthlink.net>
>>>To: <pythonce at python.org>
>>>Sent: Friday, February 24, 2006 11:03 AM
>>>Subject: [PythonCE] Unicode default encoding
>>>  
>>>      
>>>
>>>>What is the correct way to set PythonCE's default Unicode encoding?  My
>>>>reading (Python in a Nutshell) indicates that I am supposed to make a 
>>>>change to site.py, but there doesn't seem to be a site.py in
>>>>PythonCE.  (The  closest I came is a site.pyc in python23.zip.)  Nutshell
>>>>suggests that in desperation one could put the following at the start of
>>>>the main script:   
>>>>
>>>>import sys
>>>>reload(sys)
>>>>sys.setdefaultencoding('iso-8859-15')
>>>>del sys.setdefaultencoding
>>>>
>>>>This code solved the problem I was having reading and processing text that
>>>>contains Unicode characters, but I am uncomfortable leaving a desperation
>>>>solution in place.
>>>>
>>>>        
>>>>
>>>I don't think modifying site.py would be a good solution, because if you 
>>>upgrade or reinstall python then the script will be overwritten. If you
>>>only  want to run your program on your own system then a better solution is
>>>to  create a file sitecustomize.py in your Python\Lib directory containing
>>>this: 
>>>
>>>import sys
>>>sys.setdefaultencoding('iso-8859-15')
>>>
>>>If you want to distribute your program to other people though, you can't 
>>>expect them to change their default encoding so it is better not to rely on 
>>>the default encoding at all.
>>>
>>>  
>>>      
>>>
>>Yep, using unicode and explicitly encoding/decoding is a better approach.
>>
>>Fuzzyman
>>    
>>
>
>Once again, I am forced to display my ignorance.  Sorry guys.  I really don't 
>know much about Unicode.  The solution that Luke suggested (sitecustomize.py 
>in my Python\Lib directory) works fine for me, but I am concerned about the 
>suggestion from him and Fuzzyman that explicit encoding/decoding is a better 
>approach.  What is explicit encoding/decoding?  Can someone point me to a 
>good resource for learning how to deal with Unicode correctly?
>  
>
Unicode, and text encodings in general, is a bit of a learning curve.
Once you get your head round it, Python makes it pretty straightforward.

Simple rules :

* In Python text *really* means a unicode string
* Because ordinary strings are really just strings of bytes
* If you know the encoding, decode it to turn it into encoding
* When writing or printing, encode it to turn it back into bytes
* If you don't know the encoding then you better pray that whatever it
is is encoded in the system default. ;-)

byte_string = open(filename).read() # read a file
text = byte_string.decode('utf_8')    # we know it is UTF8, so we decode
to unicode
# ....code that uses the text
byte_string = text.encode('utf_8')   # we encode it back to UTF8
open(filename, 'w').write(byte_string) # so we can write it back out

Decoding turns a byte string into a unicode object.
Encoding turns a unicode object into a byte string.

If this still confuses you (which it probably does) then there are lots
of good resources. I happen to like :

    http://www.pyzine.com/Issue008/Section_Articles/article_Encodings.html

Which seems to be down at the moment. :-(

All the best,

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml
_______________________________________________
PythonCE mailing list
PythonCE at python.org
http://mail.python.org/mailman/listinfo/pythonce