[Tutor] myown.getfilesystemencoding()

eryksun eryksun at gmail.com
Thu Sep 5 01:03:23 CEST 2013


On Wed, Sep 4, 2013 at 8:39 AM, Albert-Jan Roskam <fomcl at yahoo.com> wrote:
> But given that chcp returns cp850 on my windows system (commandline),
> wouldn't it be more descriptive if sys.getfilesystemencoding()
> returned 'cp850'?

The common file systems (NTFS, FAT32, UDF, exFAT) support Unicode
filenames. The console also uses Unicode, but proper display depends
on the current font.

The cmd shell encodes to the current codepage when redirecting output
from an internal command, unless it was started with /U to force
Unicode (e.g. cmd /U /c dir > files.txt). For subprocess, run cmd.exe
explicitly with /U (i.e. don't use shell=True), and decode the output
as UTF-16. Also, some utilities, such as tree.com, display Unicode
fine but always use the OEM code page when output is redirected to a
file or pipe (i.e. changing the console code page won't help).

> In other words: In the code below, isn't line [1] an obfuscated version of
> line [2]? Both versions return only question marks on my system.
>
> # Python 2.7.3 (default, Apr 10 2012, 23:31:26) [MSC v.1500 32 bit (Intel)]
> on win32
> import ctypes
>
> ords = [3629, 3633, 3585, 3625, 3619, 3652, 3607, 3618]
> u = "".join([unichr(i) for i in ords])
> print u.encode("mbcs") # [1]
>
> #cp850 is what chcp returns on my Windows system
> print u.encode("cp850", "replace") # [2]
>
> thai_latin_cp = "cp874"
> cp_ = int(thai_latin_cp[2:])
> ctypes.windll.kernel32.SetConsoleCP(cp_)
> ctypes.windll.kernel32.SetConsoleOutputCP(cp_)
> print u.encode("cp874", "replace")

"mbcs" is the ANSI codepage (1252), not the OEM codepage (850) nor the
current codepage. Neither supports Thai characters. It would be better
to compare an OEM box drawing character:

    >>> from unicodedata import name
    >>> u = u'\u2500'
    >>> name(u)
    'BOX DRAWINGS LIGHT HORIZONTAL'

    >>> name(u.encode('850', 'replace').decode('850'))
    'BOX DRAWINGS LIGHT HORIZONTAL'

    >>> name(u.encode('mbcs', 'replace').decode('mbcs'))
    'HYPHEN-MINUS'

> ctypes.windll.kernel32.SetConsoleCP() and SetConsoleOutputCP seem useful.
> Can these functions be used to correctly display the Thai characters on
> my western European Windows version? (last block of code is an attempt)
> Or is that not possible altogether?

If stdout is a console, a write eventually ends up at WriteConsoleA(),
which decodes to the console's native Unicode based on the current
output codepage. If you're using codepage 847 and the current font
supports Thai characters, it should display fine. It's also possible
to write a Unicode string directly by calling WriteConsoleW with
ctypes.


More information about the Tutor mailing list