unicode bit me

Sat May 9 14:06:56 EDT 2009

<anuraguniyal at yahoo.com> wrote in message 
news:994147fb-cdf3-4c55-8dc5-62d769b12cdc at u9g2000pre.googlegroups.com...
> Sorry being unclear again, hmm I am becoming an expert in it.
>
> I pasted that code as continuation of my old code at start
> i.e
>  class A(object):
>      def __unicode__(self):
>          return u"©au"
>
>      def __repr__(self):
>          return unicode(self).encode("utf-8")
>      __str__ = __repr__
>
> doesn't work means throws unicode error
> my question boils down to
> what is diff between, why one doesn't throws error and another does
> print unicode(a)
> vs
> print unicode([a])

That is still an incomplete example.  Your results depend on your source 
code's encoding and your system's stdout encoding.  Assuming a=A(), 
unicode(a) returns u'©au', but then is converted to stdout's encoding for 
display.  An encoding such as cp437 (U.S. Windows console) will fail.  the 
repr of [a] is a byte string in the encoding of your source file.  The 
unicode() function, given a byte string of unspecified encoding, uses the 
ASCII codec.  Assuming your source encoding was utf-8, unicode([a],'utf-8') 
will correctly convert it to unicode, and then printing that unicode string 
will attempt to convert it to stdout encoding.  On a utf-8 console, it will 
work, on a cp437 console it will not.

Here's a new one:

In PythonWin (from pywin32-313), stdout is utf-8, so:

>>> print '©'  # this is a utf8 byte string
©
>>> '©'  # view the utf8 bytes
'\xc2\xa9'
>>> u'©'  # view the unicode character
u'\xa9'
>>> print '\xc2\xa9'  # stdout is utf8, so it is understood
©
>>> print u'\xa9'  # auto-converts to utf8.
©
>>> print unicode('\xc2\xa9')  # encoding not given, defaults to ASCII.
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: 
ordinal not in range(128)
>>> print unicode('\xc2\xa9','utf8')  # provide the encoding
©

This gives different results when the stdout encoding is different.  Here's 
a couple of the same instructions on my Windows console with cp437 encoding, 
which doesn't support the copyright character:

>>> print '\xc2\xa9' # stdout is cp437
┬⌐
>>> print u'\xa9'  # tries to convert to cp437
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\dev\python\lib\encodings\cp437.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\xa9' in 
position 0: character maps to <undefined>

Hope that helps your understanding,
Mark