[issue21927] BOM appears in stdin when using Powershell

STINNER Victor report at bugs.python.org
Fri Jul 11 16:04:50 CEST 2014


STINNER Victor added the comment:

> The BOM (byte order mark) appears in the standard input stream. When using cmd.exe, the BOM is not present. This behavior occurs in CP1252 as well as CP65001.

How you do change the console encoding? Using the chcp command?

I'm surprised that you get a UTF-8 BOM when the code page 1252 is used. Can you please check that sys.stdin.encoding is "cp1252"?


I tested PowerShell with Python 3.5 on Windows 7 with an OEM code page 850 and ANSI code page 1252:

- by default, the stdin encoding is cp850 (OEM code page) and os.device_encoding(0) returns "cp850". sys.stdin.readline() does not contain a BOM.

- when stdin is a pipe (ex: echo "abc"|python ...), the stdin encoding becomes cp1252 (ANSI code page) because os.device_encoding(0) returns None; cp1252 is the result of locale.getpreferredencoding(False) (ANSI code page). sys.stdin.readline() does not contain a BOM.

If I change the console encoding using the command "chcp 65001":

- by default, the stdin encoding = os.device_encoding(0) = "cp65001".  sys.stdin.readline() does not contain a BOM.

- when stdin is a pipe, stdin encoding = locale.getpreferredencoding(False) = "cp1252" and sys.stdin.readline() *contains* the UTF-8 BOM

Note: The UTF-8 BOM is only written once, before the first character.

So the UTF-8 BOM is only written in one case under these conditions:

- Python is running in PowerShell (The UTF-8 BOM is not written in cmd.exe, even with chcp 65001)
- sys.stdin is a pipe
- the console encoding was set manually to cp65001

--

It looks like PowerShell decodes the output of the producer program (echo, type, ...) and then encodes the output to the consumer program (ex: python).

It's possible to change the encoding of the encoder by setting $OutputEncoding variable. Example to encode to UTF-8 without the BOM:

   $OutputEncoding = New-Object System.Text.UTF8Encoding($False)

Example to encode to UTF-8 without the BOM:

   $OutputEncoding = [System.Text.Encoding]::UTF8

Using [System.Text.Encoding]::UTF8, sys.stdin.readline() starts with a BOM even if the console encoding is cp850. If you set the console encoding to 65001 (chcp 65001) and $OutputEncoding to [System.Text.Encoding]::UTF8, you get... two UTF-8 BOMs... yeah!

I tried different producer programs: [MS-DOS] echo "abc", [PowerShell] write-output "abc", [MS-DOS] type document.txt, [PowerShell] Get-Content document.txt, python -c "print('abc')". It doesn't like like using a different program changes anything. The UTF-8 BOM is added somewhere by PowerShell between by producer and the consumer programs.

To show the console input and output encodings in PowerShell, type "[console]::InputEncoding" and "[console]::OutputEncoding".

See also:
http://stackoverflow.com/questions/22349139/utf8-output-from-powershell

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue21927>
_______________________________________


More information about the Python-bugs-list mailing list