string processing question

Scott David Daniels Scott.Daniels at Acm.Org
Fri May 1 12:06:28 EDT 2009


Kurt Mueller wrote:
> Scott David Daniels schrieb:
>> To discover what is happening, try something like:
>>     python -c 'for a in "ä", unicode("ä"): print len(a), a'
>>
>> I suspect that in your encoding, "ä" is two bytes long, and in
>> unicode it is converted to to a single character.
> 
> :> python -c 'for a in "ä", unicode("ä", "utf8"): print len(a), a'
> 2 ä
> 1 ä
> :>
> 
> Yes it is. That is one of the two problems I see.
> The solution for this is to unicode(<string>, <coding>) each string.
> 
> 
> I'd like to have my python programs unicode enabled.
> 
> 
> 
> 
> :> python -c 'for a in "ä", unicode("ä"): print len(a), a'
> Traceback (most recent call last):
>   File "<string>", line 1, in <module>
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0:
> ordinal not in range(128)
> 
> It seems that the default encoding is "ascii", so unicode() cannot cope
> with "ä".
> If I specify "utf8" for the encoding, unicode() works.
> 
> :> python -c 'for a in "ä", unicode("ä", "utf8"): print len(a), a'
> 2 ä
> 1 ä
> :>                 
> 
> 
> But the print statement yelds an UnicodeEncodeError
> if I pipe the output to a program or a file.
> 
> :> python -c 'for a in "ä", unicode("ä", "utf8"): print len(a), a' | cat
> Traceback (most recent call last):
>   File "<string>", line 1, in <module>
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in
> position 0: ordinal not in range(128)
> 2 ä
> 1 :>
> 
> 
> So it seems to me, that piping the output changes the behavior of the
> print statement:
> 
> :> python -c 'for a in "ä", unicode("ä", "utf8", "ignore"): print a,
> len(a), type(a)'
> ä 2 <type 'str'>
> ä 1 <type 'unicode'>
> 
> :> python -c 'for a in "ä", unicode("ä", "utf8", "ignore"): print a,
> len(a), type(a)'  | cat
> Traceback (most recent call last):
>   File "<string>", line 1, in <module>
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in
> position 0: ordinal not in range(128)
> ä 2 <type 'str'>
> :>
> 
> 
> 
> 
> How can I achieve that my python programs are unicode enabled:
> - Input strings can have different encodings (mostly ascii, latin_1 or utf8)
> - My python programs should always output "utf8".
> 
> Is that a good idea??

OK, the issue here is your use of -c, rather than an actual source file.
I don't know how to make -c take the magic initial encoding line.
If you rely on ascii source, you are safe, but have to write things like
      ms = u'That would be na\u00EFve'
  or  ms = u'That would be na\xEFve.'
  or  ms = u'That would be na\N{LATIN SMALL LETTER I WITH DIAERESIS}ve.'

If you do put an encoding line in your source (first or second line):
      # -*- coding: utf-8 -*-
  or  # -*- coding: iso-8859-1 -*-
  or  # -*- coding: latin-1 -*-

you can (later in that file) simply use:
      ms = u'That would be naïve.'

That is, I would avoid non-ascii source for plain strings in 2.X unless
you have a _very_ good reason; use it, instead, for unicode strings.

--Scott David Daniels
Scott.Daniels at Acm.Org



More information about the Python-list mailing list