Python under PowerShell adds characters

Steve D'Aprano steve+python at pearwood.info
Thu Mar 30 08:54:37 EDT 2017


On Thu, 30 Mar 2017 04:43 pm, Marko Rauhamaa wrote:

> Steven D'Aprano <steve at pearwood.info>:
> 
>> On Thu, 30 Mar 2017 07:29:48 +0300, Marko Rauhamaa wrote:
>>> I'd expect not having to deal with Unicode decoding exceptions with
>>> arbitrary input.
>>
>> That's just silly. If you have *arbitrary* bytes, not all
>> byte-sequences are valid Unicode, so you have to expect decoding
>> exceptions, if you're processing text.
> 
> The input is not in my control, and bailing out may not be an option:


You have to deal with bad input *somehow*. You can't just say it will never
happen. If bailing out is not an option, then perhaps the solution is not
to read stdin as Unicode text, if there's a chance that it actually doesn't
contain Unicode text. Otherwise, you have to deal with any errors.

("Deal with" can include the case of not dealing with them at all, and just
letting your script raise an exception.)



>    $ echo $'aa\n\xdd\naa' | grep aa
>    aa
>    aa
>    $ echo $'\xdd' | python2 -c 'import sys; sys.stdin.read(1)'
>    $ echo $'\xdd' | python3 -c 'import sys; sys.stdin.read(1)'
>    Traceback (most recent call last):
>      File "<string>", line 1, in <module>
>      File "/usr/lib64/python3.5/codecs.py", line 321, in decode
>        (result, consumed) = self._buffer_decode(data, self.errors, final)
>    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdd in position 0:
>     invalid continuation byte

As I said, what did you expect? You choose to read from stdin as Unicode
text, then fed it something that wasn't Unicode text. That's no different
from expecting to read a file name, then passing an ASCII NUL byte.
Something is going to break, somewhere, so you have to deal with it.

I'm not sure if there are better ways, but one way of dealing with this is
to bypass the text layer and read from the raw byte-oriented stream:

[steve at ando ~]$ echo $'\xdd' | python3 -c 'import sys;
print(sys.stdin.buffer.read(1))'
b'\xdd'


You have a choice. The default choice is aimed at the most-common use-case,
which is that input will be text, but its not the only choice.



-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.



More information about the Python-list mailing list