Python usage numbers

Steven D'Aprano steve+comp.lang.python at
Sun Feb 12 10:12:57 CET 2012

On Sun, 12 Feb 2012 01:05:35 -0600, Andrew Berg wrote:

> On 2/12/2012 12:10 AM, Steven D'Aprano wrote:
>> It's not just UTF8 either, but nearly all encodings. You can't even
>> expect to avoid problems if you stick to nothing but Windows, because
>> Windows' default encoding is localised: a file generated in (say)
>> Israel or Japan or Germany will use a different code page (encoding) by
>> default than one generated in (say) the US, Canada or UK.
> Generated by what? Windows will store a locale value for programs to
> use, but programs use Unicode internally by default

Which programs? And we're not talking about what they use internally, but 
what they write to files.

> (i.e., API calls are
> Unicode unless they were built for old versions of Windows), and the
> default filesystem (NTFS) uses Unicode for file names. 

No. File systems do not use Unicode for file names. Unicode is an 
abstract mapping between code points and characters. File systems are 
written using bytes.

Suppose you're a fan of Russian punk bank Наӥв and you have a directory 
of their music. The file system doesn't store the Unicode code points 
1053 1072 1253 1074, it has to be encoded to a sequence of bytes first.

NTFS by default uses the UTF-16 encoding, which means the actual bytes 
written to disk are \x1d\x040\x04\xe5\x042\x04 (possibly with a leading 
byte-order mark \xff\xfe).

Windows has two separate APIs, one for "wide" characters, the other for 
single bytes. Depending on which one you use, the directory will appear 
to be called Наӥв or 0å2.

But in any case, we're not talking about the file name encoding. We're 
talking about the contents of files. 

> AFAIK, only the
> terminal has a localized code page by default. Perhaps Notepad will
> write text files with the localized code page by default, but that's an
> application choice...

Exactly. And unless you know what encoding the application chooses, you 
will likely get an exception trying to read the file.


More information about the Python-list mailing list