[Python-3000] Unicode and OS strings

Tue Sep 18 06:56:37 CEST 2007

>>>>> "Marcin 'Qrczak' Kowalczyk" <qrczak at knm.org.pl> writes:

 >> When a codec encounters something it can't handle, whether it's a
 >> valid character in a legacy encoding, a private use character in a
 >> UTF, or an invalid sequence of code units, it throws an exception
 >> specifying the character or code unit and the current coded character
 >> set,

 > Does this mean that this:
 > $ python -c 'import sys; print("%x" % ord(sys.argv[1]))' $(printf "\ue650")
 > would no longer print e650 in a UTF-8 locale

What do you mean "no longer"?  Look:

chibi:MacPorts steve$ export LC_ALL=en_US.UTF-8
chibi:MacPorts steve$ python -c 'import sys; print("%s" % sys.argv[1])' $(printf "\ue650") 
\ue650
chibi:MacPorts steve$ python -c 'import sys; print("%x" % ord(sys.argv[1]))' $(printf "\ue650") 
Traceback (most recent call last):
  File "<string>", line 1, in ?
TypeError: ord() expected a character, but string of length 6 found
chibi:MacPorts steve$ 

Note that some people are currently arguing that sys.argv should be an
array of bytes objects, and Guido has not yet said "no".  In that
case, all of the current proposals should have exactly this result.

My position is that if you do something that depends on the internal
representation of implementation-dependent objects, you deserve
whatever results you get.