[Tutor] subprocess.getstatusoutput : UnicodeDecodeError

Thu Sep 21 20:39:57 EDT 2017

On Thu, Sep 21, 2017 at 03:46:29PM -0700, Evuraan wrote:

> How can I work around this issue where  subprocess.getstatusoutput gives
> up, on Python 3.5.2:

getstatusoutput is a "legacy" function. It still exists for code that 
has already been using it, but it is not recommended for new code.

https://docs.python.org/3.5/library/subprocess.html#using-the-subprocess-module

Since you're using Python 3.5, let's try using the brand new `run` 
function and see if it does better:

import subprocess
result = subprocess.run(["tail", "-3", "/tmp/pmaster.db"], 
                        stdout=subprocess.PIPE)
print("return code is", result.returncode)
print("output is", result.stdout)

It should do better than getstatusoutput, since it returns plain bytes 
without assuming they are ASCII. You can then decode them yourself:

# try this and see if it is sensible
print("output is", result.stdout.decode('latin1'))

# otherwise this
print("output is", result.stdout.decode('utf-8', errors='replace'))

> >>> subprocess.getstatusoutput("tail -3 /tmp/pmaster.db",)
> Traceback (most recent call last):
[...]
>   File "/usr/lib/python3.5/encodings/ascii.py", line 26, in decode
>     return codecs.ascii_decode(input, self.errors)[0]
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 189:
> ordinal not in range(128)

Let's look at the error message. getstatusoutput apparently expects only 
pure ASCII output, because it is choking on a non-ASCII byte, namely 
0xe0. Obviously 0xe0 (or in decimal, 224) is not an ASCII value, since 
ASCII goes from 0 to 127 only.

If there's one non-ASCII byte in the file, there are probably more.

So what is that mystery 0xe0 byte? It is hard to be sure, because it 
depends on the source. If pmaster.db is a binary file, it could mean 
anything or nothing. If it is a text file, it depends on the encoding 
that the file uses. If it comes from a Mac, it might be:

py> b'\xe0'.decode('macroman')
'‡'

If it comes from Windows in Western Europe, it might be:

py> b'\xe0'.decode('latin1')
'à'

If it comes from Windows in Greece, it might be:

py> b'\xe0'.decode('iso 8859-7')
'ΰ'

and so forth. There's no absolutely reliable way to tell. This is the 
sort of nightmare that Unicode was invented to fix, but unfortunately 
there still exist millions of files, data formats and applications which 
insist on using rubbish "extended ASCII" encodings instead.

> That file's content is kryptonite for python apparently. Other shell
> operations work.
> 
> >>> subprocess.getstatusoutput("file /tmp/pmaster.db",)
> (0, '/tmp/pmaster.db: Non-ISO extended-ASCII text, with very long lines,
> with LF, NEL line terminators')

The `file` command agrees with me: it is not ASCII.

-- 
Steve