[PythonCE] Unicode default encoding

Fuzzyman fuzzyman at voidspace.org.uk
Thu Mar 2 11:42:35 CET 2006


Ischebeck, Jan wrote:

>Just to add my 2 or 3 cents.
>
>Not all Strings in Python are Unicode.
>
>  
>
Sure, but if you want to be certain that you are handling characters
correctly you either *ought* to use unicode or be certain that your
'text' is either ascii or in the encoding of any streams you use.

Note that the system default encoding (sys.stdout) is separate from (and
usually different to) the Python default encoding. On a normal windows
platform the *system* default will usually be cp1250 or cp850.

For a *very* good introduction to unicode in general, read :

    http://www.joelonsoftware.com/articles/Unicode.html

"The Absolute Minimum Every Software Developer Absolutely, Positively
Must Know About Unicode and Character Sets (No Excuses!)"

>Python has a StringType and a UnicodeType.
>If you want to get a unicode string you have to write u"my test" instead of "my test".
>But in principle:  u"my test" = "my test".decode("utf-8").  <-- depends on source encoding
>
>  
>
That's for using unicode string literals within your source code.

To get a unicode string from an 'external source' (i.e. a file) you use
the string decode method and supply the encoding.

If you aren't certain of the encoding, you might find the following (on
guessing encodings) useful :

    http://www.voidspace.org.uk/python/articles/guessing_encoding.shtml

>As has already been suggested, you should not depend on the default encoding of the 
>operating system, so better use sys.getdefaultencoding() and store the encoding for 
>communication with the console etc. (also important for wxwindows in non-unicode mode)
>
>Additionaly take care of the source encoding of your files, by adding a specific header to them:
>See: http://www.python.org/peps/pep-0263.html
>
>Finally I recommed the following tutorial, as the reportlab guys really know their stuff.
>http://www.reportlab.com/i18n/python_unicode_tutorial.html
>
>  
>

Cool. There are lots of good resources and it's a subject well worth
getting your head round the basics.

All the best,

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml

>Kind Regards
>
>Jan Ischebeck
>
>------------------------------------------------------------------------------------
>
>P3 GmbH - Ingenieurgesellschaft für Management und Organisation
>
>Jan Ischebeck
>Senior Consultant
>
>Nürtinger Straße 9
>70794 Filderstadt-Bernhausen
>
>phone: +49 - (0)163 / 75 33 613
>fax: +49 - (0)163 / 99 75 33 613
>e-mail: jan.ischebeck at p3-gmbh.de
>web: www.p3-gmbh.de
>
>
>
>-----Ursprüngliche Nachricht-----
>Von: pythonce-bounces at python.org im Auftrag von Fuzzyman
>Gesendet: Do 02-Mrz-06 18:35
>Cc: pythonce at python.org
>Betreff: Re: [PythonCE] Unicode default encoding
> 
>Jeffrey Barish wrote:
>
>  
>
>>>Luke Dunstan wrote:
>>>   
>>>
>>>      
>>>
>>>>----- Original Message ----- 
>>>>From: "Jeffrey Barish" <jeff_barish at earthlink.net>
>>>>To: <pythonce at python.org>
>>>>Sent: Friday, February 24, 2006 11:03 AM
>>>>Subject: [PythonCE] Unicode default encoding
>>>> 
>>>>     
>>>>
>>>>        
>>>>
>>>>>What is the correct way to set PythonCE's default Unicode encoding?  My
>>>>>reading (Python in a Nutshell) indicates that I am supposed to make a 
>>>>>change to site.py, but there doesn't seem to be a site.py in
>>>>>PythonCE.  (The  closest I came is a site.pyc in python23.zip.)  Nutshell
>>>>>suggests that in desperation one could put the following at the start of
>>>>>the main script:   
>>>>>
>>>>>import sys
>>>>>reload(sys)
>>>>>sys.setdefaultencoding('iso-8859-15')
>>>>>del sys.setdefaultencoding
>>>>>
>>>>>This code solved the problem I was having reading and processing text that
>>>>>contains Unicode characters, but I am uncomfortable leaving a desperation
>>>>>solution in place.
>>>>>
>>>>>       
>>>>>
>>>>>          
>>>>>
>>>>I don't think modifying site.py would be a good solution, because if you 
>>>>upgrade or reinstall python then the script will be overwritten. If you
>>>>only  want to run your program on your own system then a better solution is
>>>>to  create a file sitecustomize.py in your Python\Lib directory containing
>>>>this: 
>>>>
>>>>import sys
>>>>sys.setdefaultencoding('iso-8859-15')
>>>>
>>>>If you want to distribute your program to other people though, you can't 
>>>>expect them to change their default encoding so it is better not to rely on 
>>>>the default encoding at all.
>>>>
>>>> 
>>>>     
>>>>
>>>>        
>>>>
>>>Yep, using unicode and explicitly encoding/decoding is a better approach.
>>>
>>>Fuzzyman
>>>   
>>>
>>>      
>>>
>>Once again, I am forced to display my ignorance.  Sorry guys.  I really don't 
>>know much about Unicode.  The solution that Luke suggested (sitecustomize.py 
>>in my Python\Lib directory) works fine for me, but I am concerned about the 
>>suggestion from him and Fuzzyman that explicit encoding/decoding is a better 
>>approach.  What is explicit encoding/decoding?  Can someone point me to a 
>>good resource for learning how to deal with Unicode correctly?
>> 
>>
>>    
>>
>Unicode, and text encodings in general, is a bit of a learning curve.
>Once you get your head round it, Python makes it pretty straightforward.
>
>Simple rules :
>
>* In Python text *really* means a unicode string
>* Because ordinary strings are really just strings of bytes
>* If you know the encoding, decode it to turn it into encoding
>* When writing or printing, encode it to turn it back into bytes
>* If you don't know the encoding then you better pray that whatever it
>is is encoded in the system default. ;-)
>
>byte_string = open(filename).read() # read a file
>text = byte_string.decode('utf_8')    # we know it is UTF8, so we decode
>to unicode
># ....code that uses the text
>byte_string = text.encode('utf_8')   # we encode it back to UTF8
>open(filename, 'w').write(byte_string) # so we can write it back out
>
>Decoding turns a byte string into a unicode object.
>Encoding turns a unicode object into a byte string.
>
>If this still confuses you (which it probably does) then there are lots
>of good resources. I happen to like :
>
>    http://www.pyzine.com/Issue008/Section_Articles/article_Encodings.html
>
>Which seems to be down at the moment. :-(
>
>All the best,
>
>Fuzzyman
>http://www.voidspace.org.uk/python/index.shtml
>_______________________________________________
>PythonCE mailing list
>PythonCE at python.org
>http://mail.python.org/mailman/listinfo/pythonce
>
>
>
>_______________________________________________
>PythonCE mailing list
>PythonCE at python.org
>http://mail.python.org/mailman/listinfo/pythonce
>
>  
>



More information about the PythonCE mailing list