os.popen encoding!

Wed Feb 18 09:00:17 EST 2009

"Gabriel Genellina" <gagsl-py2 at yahoo.com.ar> writes:

>> I'm playing with os.popen function.
>> a = os.popen("somecmd").read()
>>
>> If one of the lines contains characters like "è", "æ"or any other it loks
>> line this "velja\xe8a 2009" with that "\xe8". It prints fine if i go:
>>
>> for i in a:
>>     print i:
>
> '\xe8' is a *single* byte (not four). It is the 'LATIN SMALL LETTER E
> WITH  GRAVE' Unicode code point u'è' encoded in the Windows-1252
> encoding (and  latin-1, and others too).

Note that it is also 'LATIN SMALL LETTER C WITH CARON' (U+010D or
u'č'), encoded in Windows-1250, which is what the OP is likely using.

The rest of your message stands regardless: there is no problem, at
least as long as the OP only prints out the character received from
somecmd to something else that also expects Windows-1250.  The problem
would arise if the OP wanted to store the string in a PyGTK label
(which expects UTF8) or send it to a web browser (which expects
explicit encoding, probably defaulting to UTF8), in which case he'd
have to disambiguate whether '\xe8' refers to U+010D or to U+00E8 or
something else entirely.

That is the problem that Python 3 solves by requiring (or strongly
suggesting) that such disambiguation be performed as early in the
program as possible, preferrably while the characters are being read
from the outside source.  A similar approach is possible using Python
2 and its unicode type, but since the OP never specified exactly which
problem he had (except for the repr/str confusion), it's hard to tell
if using the unicode type would help.