print() and unicode strings (python 3.1)
7stud
bbxx789_05ss at yahoo.com
Tue Aug 25 06:41:54 EDT 2009
On Aug 24, 10:09 pm, Ned Deily <n... at acm.org> wrote:
> In article
> <e5e2ec2e-2b4a-4ca8-8c0f-109e5f4eb... at v23g2000pro.googlegroups.com>,
>
>
>
> 7stud <bbxx789_0... at yahoo.com> wrote:
> > On Aug 24, 2:41 pm, "Martin v. Löwis" <mar... at v.loewis.de> wrote:
> > > > I can't figure out a way to programatically set the encoding for
> > > > sys.stdout. So where does that leave me?
>
> > > You should be setting the terminal encoding administratively, not
> > > programmatically.
>
> > The terminal encoding has always been utf-8. It was not set
> > programmatically.
>
> > It seems to me that python 3.1's string handling is broken.
> > Apparently, in python 3.1 I am unable to explicitly set the encoding
> > of a string and print() it out with the result being human readable
> > text. On the other hand, if I let python do the encoding implicitly,
> > python uses a codec I don't want it to.
>
> If you are running on a Unix-y system, check your locale settings (LANG,
> LC.*, et al). I think you'll likely find that your locale is really not
> UTF-8. The following was on Python 3.1 on OS X 10.5, similar results
> on Debian Linux:
>
> $ cat t3.py
> import sys
> print(sys.stdout.encoding)
> s = "¤"
> print(s.encode("utf-8"))
> print(s)
>
> $ export LANG=en_US.UTF-8
> $ python3.1 t3.py
> UTF-8
> b'\xe2\x82\xac'
> ¤
>
> $ export LANG=C
> $ python3.1 t3.py
> US-ASCII
> b'\xe2\x82\xac'
> Traceback (most recent call last):
> File "t3.py", line 7, in <module>
> print(s)
> UnicodeEncodeError: 'ascii' codec can't encode character '\u20ac' in
> position 0: ordinal not in range(128)
>
> --
> Ned Deily,
> n... at acm.org
Hi,
Thanks for the response. My OS is mac osx 10.4.11. I'm not really
sure how to check my locale settings. Here is some stuff I tried:
$ echo $LANG
$ echo $LC_ALL
$ echo $LC_CTYPE
$ locale
LANG=
LC_COLLATE="C"
LC_CTYPE="C"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL="C"
$man locale
...
...
...
ENVIRONMENT:
LANG
Used as a substitute for any unset LC_* variable. If LANG is unset it
will act as if set to "C". If any of LANG or LC_* are set to invalide
values locale acts as if they are all unset.
===========
As in your last example, my 'C' settings mean that an ascii codec is
used somewhere to encode() the unicode string.
--
The locale C or POSIX is a portable locale; its LC_CTYPE part
corresponds to the 7-bit ASCII character set.
http://linux.about.com/library/cmd/blcmdl3_setlocale.htm
--
Is this the way it works:
1) python sets the codec for sys.stdout to the LANG environment
variable.
2) It doesn't matter that my terminal's encoding is set to utf-8
because output has to pass through sys.stdout first.
So:
a) My terminal's environment is telling python(and all other programs
running in the terminal) that output sent to sys.stdout must be
encoded in ascii.
b) The solution is to set a LANG environment variable.
Why does echoing $LC_ALL or $LC_CTYPE just give me a blank string?
Previously, I've set environment variables that I want to be
permanent, e.g PATH, in ~/.bash_profile, so I did this:
~/.bash_profile:
--------------
...
...
LANG="en_US.UTF-8"
export LANG
and now python 3.1 acts like I expect it to:
-------
import locale
import sys
print(locale.getlocale(locale.LC_CTYPE))
print(sys.stdout.encoding)
s = "€"
print(s)
print(s.encode("utf-8"))
--output:--
('en_US', 'UTF8')
UTF-8
€
b'\xe2\x82\xac'
----------
In conclusion, as far as I can tell, if your python 3.1 program tries
to output a unicode string, and the unicode string cannot be encoded
by the codec specified in the user's LANG environment variable**, then
the user will get an encode error. Just because the programmer's
system can handle the output doesn't mean that another user's system
can. I guess that's the way it goes: if a user's environment is
telling all programs that it only wants ascii output to go to the
screen(sys.stdout), you can't(or shouldn't) do anything about it.
**Or if the LANG environment variable is not present, then the codec
corresponding to the locale settings(C' corresponds to ascii).
some good locale info:
http://www.chemie.fu-berlin.de/chemnet/use/info/libc/libc_19.html
More information about the Python-list
mailing list