[issue21808] 65001 code page not supported

eryksun report at bugs.python.org
Thu Jun 19 15:06:24 CEST 2014


eryksun added the comment:

cp65001 was added in Python 3.3, for what it's worth. For me codepage 65001 (CP_UTF8) is broken for most console programs. 

Windows API WriteFile gets routed to WriteConsoleA for a console buffer handle, but WriteConsoleA has a different spec. It returns the number of wide characters written instead of the number of bytes. Then WriteFile returns this number without adjusting for the fact that 1 character != 1 byte. For example, the following writes 5 bytes (3 wide characters), but WriteFile returns that NumberOfBytesWritten is 3:

    >>> import sys, msvcrt 
    >>> from ctypes import windll, c_uint, byref

    >>> windll.kernel32.SetConsoleOutputCP(65001)
    1

    >>> h_out = msvcrt.get_osfhandle(sys.stdout.fileno())
    >>> buf = '\u0100\u0101\n'.encode('utf-8')
    >>> n = c_uint()
    >>> windll.kernel32.WriteFile(h_out, buf, len(buf),                
    ...                           byref(n), None)
    Āā
    1

    >>> n.value
    3
    >>> len(buf)
    5

There's a similar problem with ReadFile calling ReadConsoleA.

ANSICON (github.com/adoxa/ansicon) can hook WriteFile to fix this for select programs. However, it doesn't hook ReadFile, so stdin.read remains broken. 

>    >>> import locale
>    >>> locale.getpreferredencoding()
>    'cp1252'

The preferred encoding is based on the Windows locale codepage, which is returned by kernel32!GetACP, i.e. the 'ANSI' codepage. If you want the console codepages that were set at program startup, look at sys.stdin.encoding and sys.stdout.encoding:

    >>> windll.kernel32.SetConsoleCP(1252)       
    1
    >>> windll.kernel32.SetConsoleOutputCP(65001)
    1
    >>> script = r'''
    ... import sys
    ... print(sys.stdin.encoding, sys.stdout.encoding)
    ... '''

    >>> subprocess.call('py -3 -c "%s"' % script)
    cp1252 cp65001
    0

>    >>> locale.getlocale()
>    (None, None)
>    >>> locale.getlocale(locale.LC_ALL)
>    (None, None)

On most POSIX platforms nowadays, Py_Initialize sets the LC_CTYPE category to its default value by calling setlocale(LC_CTYPE, "") in order to "obtain the locale's charset without having to switch locales". On the other hand, the bootstrapping process for Windows doesn't use the C runtime locale, so at startup LC_CTYPE is still in the default "C" locale:

    >>> locale.setlocale(locale.LC_CTYPE, None)
    'C'

This in turn gets parsed into the (None, None) tuple that getlocale() returns:

    >>> locale._parse_localename('C')
    (None, None)

----------
nosy: +eryksun

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue21808>
_______________________________________


More information about the Python-bugs-list mailing list