unicode bit me
Mark Tolonen
metolone+gmane at gmail.com
Sat May 9 14:06:56 EDT 2009
<anuraguniyal at yahoo.com> wrote in message
news:994147fb-cdf3-4c55-8dc5-62d769b12cdc at u9g2000pre.googlegroups.com...
> Sorry being unclear again, hmm I am becoming an expert in it.
>
> I pasted that code as continuation of my old code at start
> i.e
> class A(object):
> def __unicode__(self):
> return u"©au"
>
> def __repr__(self):
> return unicode(self).encode("utf-8")
> __str__ = __repr__
>
> doesn't work means throws unicode error
> my question boils down to
> what is diff between, why one doesn't throws error and another does
> print unicode(a)
> vs
> print unicode([a])
That is still an incomplete example. Your results depend on your source
code's encoding and your system's stdout encoding. Assuming a=A(),
unicode(a) returns u'©au', but then is converted to stdout's encoding for
display. An encoding such as cp437 (U.S. Windows console) will fail. the
repr of [a] is a byte string in the encoding of your source file. The
unicode() function, given a byte string of unspecified encoding, uses the
ASCII codec. Assuming your source encoding was utf-8, unicode([a],'utf-8')
will correctly convert it to unicode, and then printing that unicode string
will attempt to convert it to stdout encoding. On a utf-8 console, it will
work, on a cp437 console it will not.
Here's a new one:
In PythonWin (from pywin32-313), stdout is utf-8, so:
>>> print '©' # this is a utf8 byte string
©
>>> '©' # view the utf8 bytes
'\xc2\xa9'
>>> u'©' # view the unicode character
u'\xa9'
>>> print '\xc2\xa9' # stdout is utf8, so it is understood
©
>>> print u'\xa9' # auto-converts to utf8.
©
>>> print unicode('\xc2\xa9') # encoding not given, defaults to ASCII.
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0:
ordinal not in range(128)
>>> print unicode('\xc2\xa9','utf8') # provide the encoding
©
This gives different results when the stdout encoding is different. Here's
a couple of the same instructions on my Windows console with cp437 encoding,
which doesn't support the copyright character:
>>> print '\xc2\xa9' # stdout is cp437
©
>>> print u'\xa9' # tries to convert to cp437
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\dev\python\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\xa9' in
position 0: character maps to <undefined>
Hope that helps your understanding,
Mark
More information about the Python-list
mailing list