[PythonCE] Unicode default encoding
fuzzyman at voidspace.org.uk
Thu Mar 2 10:35:26 CET 2006
Jeffrey Barish wrote:
>>Luke Dunstan wrote:
>>>----- Original Message -----
>>>From: "Jeffrey Barish" <jeff_barish at earthlink.net>
>>>To: <pythonce at python.org>
>>>Sent: Friday, February 24, 2006 11:03 AM
>>>Subject: [PythonCE] Unicode default encoding
>>>>What is the correct way to set PythonCE's default Unicode encoding? My
>>>>reading (Python in a Nutshell) indicates that I am supposed to make a
>>>>change to site.py, but there doesn't seem to be a site.py in
>>>>PythonCE. (The closest I came is a site.pyc in python23.zip.) Nutshell
>>>>suggests that in desperation one could put the following at the start of
>>>>the main script:
>>>>This code solved the problem I was having reading and processing text that
>>>>contains Unicode characters, but I am uncomfortable leaving a desperation
>>>>solution in place.
>>>I don't think modifying site.py would be a good solution, because if you
>>>upgrade or reinstall python then the script will be overwritten. If you
>>>only want to run your program on your own system then a better solution is
>>>to create a file sitecustomize.py in your Python\Lib directory containing
>>>If you want to distribute your program to other people though, you can't
>>>expect them to change their default encoding so it is better not to rely on
>>>the default encoding at all.
>>Yep, using unicode and explicitly encoding/decoding is a better approach.
>Once again, I am forced to display my ignorance. Sorry guys. I really don't
>know much about Unicode. The solution that Luke suggested (sitecustomize.py
>in my Python\Lib directory) works fine for me, but I am concerned about the
>suggestion from him and Fuzzyman that explicit encoding/decoding is a better
>approach. What is explicit encoding/decoding? Can someone point me to a
>good resource for learning how to deal with Unicode correctly?
Unicode, and text encodings in general, is a bit of a learning curve.
Once you get your head round it, Python makes it pretty straightforward.
Simple rules :
* In Python text *really* means a unicode string
* Because ordinary strings are really just strings of bytes
* If you know the encoding, decode it to turn it into encoding
* When writing or printing, encode it to turn it back into bytes
* If you don't know the encoding then you better pray that whatever it
is is encoded in the system default. ;-)
byte_string = open(filename).read() # read a file
text = byte_string.decode('utf_8') # we know it is UTF8, so we decode
# ....code that uses the text
byte_string = text.encode('utf_8') # we encode it back to UTF8
open(filename, 'w').write(byte_string) # so we can write it back out
Decoding turns a byte string into a unicode object.
Encoding turns a unicode object into a byte string.
If this still confuses you (which it probably does) then there are lots
of good resources. I happen to like :
Which seems to be down at the moment. :-(
All the best,
More information about the PythonCE