string processing question
Piet van Oostrum
piet at cs.uu.nl
Sat May 2 00:08:55 CEST 2009
>>>>> Kurt Mueller <mu at problemlos.ch> (KM) wrote:
>KM> But from the command line python interprets the code
>KM> as 'latin_1' I presume. That is why I have to convert
>KM> the "ä" with unicode().
>KM> Am I right?
There are a couple of stages:
1. Your terminal emulator interprets your keystrokes, encodes them in a
sequence of bytes and passes them to the shell. How the characters
are encodes depends on the encoding used in the terminal emulator. So
for example when the terminal is set to utf-8, your "ä" is converted
to two bytes: \xc3 and \xa4.
2. The shell passes these bytes to the python command.
3. The python interpreter must interpret these bytes with some decoding.
If you use them in a bytes string they are copied as such, so in the
example above the string "ä" will consist of the 2 bytes '\xc3\xa4'.
If your terminal encoding would have been iso-8859-1, the string
would have had a single byte '\xe4'. If you use it in a unicode
string the Python parser has to convert it to unicode. If there is an
encoding declaration in the source than that is used. Of course it
should be the same as the actual encoding used by the shell (or the
editor when you have a script saved in a file) otherwise you have a
problem. If there is no encoding declaration in the source Python has
to guess. It appears that in Python 2.x the default is iso-8859-1 but
in Python 3.x it will be utf-8. You should avoid making any
assumptions about this default.
4. During runtime unicode characters that have to be printed, written to
a file, passed as file names or arguments to other processes etc.
have to be encoded again to a sequence of bytes. In this case Python
refuses to guess. Also you can't use the same encoding as in step 3,
because the program can run on a completely different system than
were it was compiled to byte code. So if the (unicode) string isn't
ASCII and no encoding is given you get an error. The encoding can be
given explicitely, or depending on the context, by sys.stdout.encoding,
sys.getdefaultencoding or PYTHONIOENCODING (from 2.6 on).
Unfortunately there is no equivalent to PYTHONIOENCODING for the
interpretation of the source text, it only works on run-time.
python -c 'print len(u"ä")'
prints 2 on my system, because my terminal is utf-8 so the ä is passed
as 2 bytes (\xc3\xa4), but these are interpreted by Python 2.6.2 as two
If I do
python -c 'print u"ä"' in my terminal I therefore get two characters: Ã¤
but if I do this in Emacs I get:
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-1: ordinal not in range(128)
because my Emacs doesn't pass the encoding of its terminal emulation.
python -c '# -*- coding:utf-8 -*-
will correctly print 1.
Piet van Oostrum <piet at cs.uu.nl>
URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4]
Private email: piet at vanoostrum.org
More information about the Python-list