[issue21927] BOM appears in stdin when using Powershell
eryksun
report at bugs.python.org
Wed Jul 16 19:43:08 CEST 2014
eryksun added the comment:
> PS C:\Users\jaraco> echo £ | py -3 -c "import sys; print(repr(sys.stdin.buffer.read()))"
> b'?\r\n'
> Curiously, it appears as if powershell is actually receiving
> a question mark from the pipe.
PowerShell calls ReadConsoleW to read the console input buffer, i.e. it reads "£" as a wide character from the command line. The default encoding when writing to the pipe should be ASCII [*]. If that's the case it explains the question mark that Python reads from stdin. It's the default replacement character (WC_DEFAULTCHAR) used by WideCharToMultiByte.
[*] http://blogs.msdn.com/b/powershell/archive/2006/12/11/outputencoding-to-the-rescue.aspx
You can change PowerShell's output encoding to match the console:
$OutputEncoding = [Console]::OutputEncoding
If the console codepage is 65001, the above is equivalent to setting
$OutputEncoding = [System.Text.Encoding]::UTF8
http://msdn.microsoft.com/en-us/library/system.text.encoding.utf8
As Victor mentioned, this setting always writes a BOM, and under codepage 65001 it actually writes 2 BOMs (at least in PowerShell 2). Victor also mentioned that you can avoid the BOM by passing $False to the constructor:
$OutputEncoding = New-Object System.Text.UTF8Encoding($False)
http://msdn.microsoft.com/en-us/library/system.text.utf8encoding
There's still a BOM under codepage 65001, but maybe that's fixed in PowerShell 3.
I avoid setting the console to codepage 65001 anyway. ReadFile/WriteFile incorrectly return the number of characters read/written instead of the number of bytes because the call is actually handled by ReadConsoleA/WriteConsoleA. Maybe that's finally fixed in Windows 8.
----------
nosy: +eryksun
_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue21927>
_______________________________________
More information about the Python-bugs-list
mailing list