Unicode again ... default codec ...

Tue Oct 20 23:50:57 EDT 2009

En Tue, 20 Oct 2009 17:13:52 -0300, Stef Mientki <stef.mientki at gmail.com>
escribió:

> Form the thread "how to write a unicode string to a file ?"
> and my specific situation:
>
> - reading data from Excel, Delphi and other Windows programs and unicode  
> Python
> - using wxPython, which forces unicode
> - writing to Excel and other Windows programs
>
> almost all answers, directed to the following solution:
> - in the python program, turn every string as soon as possible into  
> unicode
> - in Python all processing is done in unicode
> - at the end, translate unicode into the windows specific character set  
> (if necessary)

Yes. That's the way to go; if you follow the above guidelines when working
with character data, you should not encounter big unicode problems.

> The above approach seems to work nicely,
> but manipulating heavily with string like objects it's a crime.
> It's impossible to change all my modules from strings to unicode at once,
> and it's very tempting to do it just the opposite : convert everything  
> into strings !

Wide is the road to hell...

> # adding unicode string and windows strings, results in an error:
> my_u = u'my_u'
> my_w = 'my_w' + chr ( 246 )
> x = my_s + my_u

(I guess you meant my_w + my_u). Formally:

x = my_w.decode('windows-1252') + my_u  # [1]

but why are you using a byte string in the first place? Why not:

my_w = u'my_w' + u'ö'

so you can compute my_w + my_u directly?

> # to correctly handle the above ( in my situation), I need to write the  
> following code (which my code quite unreadable
> my_u = u'my_u'
> my_w = 'my_w' + chr ( 246 )
> x = unicode ( my_s, 'windows-1252' )  + my_u
>
> # converting to strings gives much better readable code:
> my_u = u'my_u'
> my_w = 'my_w' + chr ( 246 )
> x = my_s + str(my_u)

But it's not the same thing, i.e., in the former case x is an unicode
object, in the later x is a byte string. Also, str(my_u) only works if it
contains just ascii characters. The counterpart of my code [1] above would
be:

x = my_w + my_u.encode('windows-1252')

That is, you use some_unicode_object.encode("desired-encoding") to do the
unicode->bytestring conversion, and
some_string_object.decode("known-encoding") to convert in the opposite
sense.

> until I found this website:
>   http://diveintopython.org/xml_processing/unicode.html
>
> By settings the default encoding:
> I now can go to unicode much more elegant and almost fully automatically:
> (and I guess the writing to a file problem is also solved)
> # now the manipulations of strings and unicode works OK:
> my_u = u'my_u'
> my_w = 'my_w' + chr ( 246 )
> x = my_s + my_u
>
> The only disadvantage is that you've to put a special named file into  
> the Python directory !!
> So if someone knows a more elegant way to set the default codec,
> I would be much obliged.

DON'T do that. Really. Changing the default encoding is a horrible,
horrible hack and causes a lot of problems. 'Dive into Python' is a great
book, but suggesting to alter the default character encoding is very, very
bad advice:

    - site.py and sitecustomize.py contain *global* settings, affecting  
*all*
users and *all* scripts running on that machine. Other users may get very
angry at you when their own programs break or give incorrect results when
run with a different encoding.
    - you must have administrative rights to alter those files.
    - you won't be able to distribute your code, since almost everyone else
in the world won't be using *your* default encoding.
    - what if another library/package/application wants to set a different
default encoding?
    - the default encoding for Python>=3.0 is now 'utf-8' instead of 'ascii'

More reasons:
http://tarekziade.wordpress.com/2008/01/08/syssetdefaultencoding-is-evil/
See also this recent thread in python-dev:
http://comments.gmane.org/gmane.comp.python.devel/106134

-- 
Gabriel Genellina