string processing question
Kurt Mueller
mu at problemlos.ch
Fri May 1 05:28:12 EDT 2009
Scott David Daniels schrieb:
> To discover what is happening, try something like:
> python -c 'for a in "ä", unicode("ä"): print len(a), a'
>
> I suspect that in your encoding, "ä" is two bytes long, and in
> unicode it is converted to to a single character.
:> python -c 'for a in "ä", unicode("ä", "utf8"): print len(a), a'
2 ä
1 ä
:>
Yes it is. That is one of the two problems I see.
The solution for this is to unicode(<string>, <coding>) each string.
I'd like to have my python programs unicode enabled.
:> python -c 'for a in "ä", unicode("ä"): print len(a), a'
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0:
ordinal not in range(128)
It seems that the default encoding is "ascii", so unicode() cannot cope
with "ä".
If I specify "utf8" for the encoding, unicode() works.
:> python -c 'for a in "ä", unicode("ä", "utf8"): print len(a), a'
2 ä
1 ä
:>
But the print statement yelds an UnicodeEncodeError
if I pipe the output to a program or a file.
:> python -c 'for a in "ä", unicode("ä", "utf8"): print len(a), a' | cat
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in
position 0: ordinal not in range(128)
2 ä
1 :>
So it seems to me, that piping the output changes the behavior of the
print statement:
:> python -c 'for a in "ä", unicode("ä", "utf8", "ignore"): print a,
len(a), type(a)'
ä 2 <type 'str'>
ä 1 <type 'unicode'>
:> python -c 'for a in "ä", unicode("ä", "utf8", "ignore"): print a,
len(a), type(a)' | cat
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in
position 0: ordinal not in range(128)
ä 2 <type 'str'>
:>
How can I achieve that my python programs are unicode enabled:
- Input strings can have different encodings (mostly ascii, latin_1 or utf8)
- My python programs should always output "utf8".
Is that a good idea??
TIA
--
Kurt Müller, mu at problemlos.ch
More information about the Python-list
mailing list