unicode mystery/problem

John Machin sjmachin at lexicon.net
Fri Sep 22 10:09:00 EDT 2006


Petr Jakeš wrote:
> John, thanks for your extensive answer.
> >> Hi,
> >> I am using Python 2.4.3 on Fedora Core4 and  "Eric3" Python IDE
> >> .
> >> Below mentioned code works fine in the Eric3 environment. While trying
> >> to start it from the command line, it returns:
> >>
> >> Traceback (most recent call last):
> >>   File "pokus_1.py", line 5, in ?
> >>     print str(a)
> >> UnicodeEncodeError: 'ascii' codec can't encode character u'\xc1' in
> >> position 6: ordinal not in range(128)
>
> JM> So print a works, but print str(a) crashes.
>
> JM> Instead, insert this:
> JM>    import sys
> JM>    print "default", sys.getdefaultencoding()
> JM>    print "stdout", sys.stdout.encoding
> JM> and run your script at the command line. It should print:
> JM>     default ascii
> JM>     stdout x
> ****  in the command line it prints:  *****
> default ascii
> stdout UTF-8
> JM> here, and crash at the later use of str(a).
> JM> Step 2: run your script under Eric3. It will print:
> JM>     default y
> JM>     stdout z
>
> ****  in the Eric3 it prints:  ****
> if the # -*- Eencoding: utf_8 -*- is set than:
>
> default utf_8
> stdout
> unhandled AttributeError, "AsyncFile instance has no attribute
> 'encoding' "
>
> if the encoding is not set than it prints:
>
> DeprecationWarning: Non-ASCII character '\xc3' in file
> /root/eric/analyza_dat_TPC/pokus_1.py on line 26, but no encoding
> declared; see http://www.python.org/peps/pep-0263.html for details execfile(sys.argv[0], self.debugMod.__dict__)
>
> default latin-1
> stdout
> unhandled AttributeError, "AsyncFile instance has no attribute
> 'encoding' "
>
> JM> and then should work properly. It is probable that x == y == z ==
> JM> 'utf-8'
> JM> Step 3: see below.
>
> >>
> >> ========== 8< =============
> >> #!/usr/bin python
> >> # -*- Encoding: utf_8 -*-
>
> JM> There is no UTF8-encoded text in this short test script. Is the above
> JM> encoding comment merely a carry-over from your real script, or do you
> JM> believe it is necessary or useful in this test script?
> Generally, I am working with string like u'DISKOV\xc1 POLE' (I am
> getting it from the database)
>
> My intention to use >> # -*- Encoding: utf_8 -*- was to suppress
> DeprecationWarnings if I use utf_8 in the code (like u'DISKOV\xc1 POLE')
>
> >>
> >> a= u'DISKOV\xc1 POLE'
> >> print a
> >> print str(a)
> >> ========== 8< =============
> >>
> >> Even it looks strange, I have to use str(a) syntax even I know the "a"
> >> variable is a string.
>
> JM> Some concepts you need to understand:
> JM> (a) "a" is not a string, it is a reference to a string.
> JM> (b) It is a reference to a unicode object (an implementation of a
> JM> conceptual Unicode string) ...
> JM> (c) which must be distinguished from a str object, which represents a
> JM> conceptual string of bytes.
> JM> (d) str(a) is trying to produce a str object from a unicode object. Not
> JM> being told what encoding to use, it uses the default encoding
> JM> (typically ascii) and naturally this will crash if there are non-ascii
> JM> characters in the unicode object.
>
> >> I am trying to use ChartDirector for Python (charts for Python) and the
> >> method "layer.addDataSet()" needs above mentioned syntax otherwise it
> >> returns an Error.
>
> JM> Care to tell us which error???
> you can see the Error description and author comments here:
> http://tinyurl.com/ezohe

You have two different episodes on that website; adding the one we have
been discussing gives *three* different stories:

Episode 1:

The error description: "TypeError: Error converting argument 1 to type
PCc" -- you should ask him "What is type PCc???" If arg 1 is an
arbitrary str object, which byte values could it possibly be objecting
to?

The author comments: "The error code usually means the filename is not
a text string, ..." (1) Input file or output file? Is it possible that
one or more bytes are not allowable in a filename? (2) Is it possible
for you to give him the exact args that you are passing in (use print
repr(arg) before the call), and for him to tell you the *exact* reason,
not the "usual" reason?

Episode 2: Evidently arg is a str object, but passing in str(arg) and
just plain arg give different results??? I doubt it. print repr(arg)
and type(arg) and see what you've actually got there.

>
> >>
> >> layer.addDataSet(data, colour, str(dataName))
> I have try to experiment with the code a bit.
> the simplest code where I can demonstrate my problems:
> #!/usr/bin python
> import sys
> print "default", sys.getdefaultencoding()
> print "stdout", sys.stdout.encoding
>
> a=['P\xc5\x99\xc3\xad','Petr Jake\xc5\xa1']
> b="my nice try %s" % ''.join(a).encode("utf-8")

So ''.join(a) is a str object, encoded in utf-8 *already*.
Please try to understand:
(1) unicode_object.encode('utf-8') produces a str_object # in utf-8
encoding
(2) str_object.decode('utf-8') produces a unicode object # if
str_object contains valid utf-8.
(3) str_object.encode('anything') is a nonsense; it is the equivalent
of str_object.decode('ascii').encode('anything') and will typically
fail, as your next error message shows.

What were you trying to do?? I don't understand the relationship
between this little exercise and Episodes 1, 2, & 3.

Try to concentrate on what your data is (u"DISKOetcetc" is a unicode
string, but then you say that str(x) should be unnecessary because x is
already a str object!?) and what you need to have to get it passed
through to that package's methods.


> print b
>
> When I run it from the command line i am getting:
> sys:1: DeprecationWarning: Non-ASCII character '\xc3' in file pokus_1.py on line 26, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
>
> default ascii
> stdout UTF-8
>
> Traceback (most recent call last):
>   File "pokus_1.py", line 8, in ?
>     b="my nice try %s" % ''.join(a).encode("utf-8")
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 1: ordinal not in range(128)
> 

As expected. 

Regards,
John




More information about the Python-list mailing list