[ python-Bugs-1668295 ] Strange unicode behaviour

SourceForge.net noreply at sourceforge.net
Sun Feb 25 20:43:47 CET 2007


Bugs item #1668295, was opened at 2007-02-25 11:10
Message generated for change (Comment added) made by gbrandl
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1668295&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
>Status: Closed
>Resolution: Invalid
Priority: 5
Private: No
Submitted By: Santiago Gala (sgala)
Assigned to: Nobody/Anonymous (nobody)
Summary: Strange unicode behaviour

Initial Comment:

I know that python is very funny WRT unicode processing, but this defies all my knowledge.

I use the es_ES.UTF-8 encoding on linux. The script:


python -c "print unicode('á %s' % 'éí','utf8') " works, i.e., prints á éí in the next line.

However, if I redirect it to less or to a file, like

python -c "print unicode('á %s' % 'éí','utf8') " >test
Traceback (most recent call last):
  File "<string>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 0: ordinal not in range(128)


Why is the behaviour different when stdout is redirected? How can I get it to do "the right thing" in both cases?

----------------------------------------------------------------------

>Comment By: Georg Brandl (gbrandl)
Date: 2007-02-25 19:43

Message:
Logged In: YES 
user_id=849994
Originator: NO

First of all: Python's Unicode handling is very consistent and
straightforward, if you know the basics. Sadly, most people don't know the
difference between Unicode and encoded strings.

What you're seeing is not a bug, it is due to the fact that if you print
Unicode to the console, and Python could correctly find out your terminal
encoding, the Unicode string is automatically encoded in that encoding.

If you output to a file, Python does not know which encoding you want to
have, so all Unicode strings are converted to ascii only.

Please direct further questions to the Python mailing list or newsgroup.

The basic rule when handling Unicode is: use Unicode everywhere inside the
program, and byte strings for input and output.
So, your code is exactly the other way round: it takes a byte string,
decodes it to unicode and *then* prints it.

You should do it the other way: use Unicode literals in your code, and
when you write something to a file, *encode* them in utf-8.

----------------------------------------------------------------------

Comment By: Santiago Gala (sgala)
Date: 2007-02-25 11:17

Message:
Logged In: YES 
user_id=178886
Originator: YES

Forgot to say that it happens consistently with 2.4.3, 2.5-svn and svn
trunk

Also, some people asks for repr of strings (I guess to reproduce if they
can't read the caracters). Those are printed in utf-8:

$python -c "print repr('á %s')"
'\xc3\xa1 %s'
$ python -c "print repr('éi')"
'\xc3\xa9i'

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1668295&group_id=5470


More information about the Python-bugs-list mailing list