[ python-Bugs-926427 ] OEM codepage chars in mbcs filenames can be misinterpreted

SourceForge.net noreply at sourceforge.net
Wed Mar 31 15:11:51 EST 2004


Bugs item #926427, was opened at 2004-03-31 06:04
Message generated for change (Comment added) made by loewis
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=926427&group_id=5470

Category: Python Library
>Group: 3rd Party
>Status: Closed
>Resolution: Wont Fix
Priority: 5
Submitted By: Mike Brown (mike_j_brown)
Assigned to: Nobody/Anonymous (nobody)
Summary: OEM codepage chars in mbcs filenames can be misinterpreted

Initial Comment:
My system: Windows XP, English - US locale, Python 
2.3.3

I believe the bug I am reporting here is this:

On Windows XP, when using os.listdir() with a non-
Unicode argument, characters that are not in the 
default locale's encoding (e.g. Greek capital letter 
Sigma, (U+03A3), is not in windows-1252), but that are 
in the default OEM code page (e.g. Sigma is in cp437), 
get mapped to ASCII characters other than '?'.

For example, things seem to work in a predictable way 
when I put windows-1252 characters into filenames (I 
do this in Explorer and then I see what os.listdir
(r'C:\path\to\the\dir') returns):

— (U+2014) becomes \x97
• (U+2022) becomes \x95
é (U+00E9) becomes \xe9

But things are much less predictable when I use 
characters from outside this range. I thought I'd try 
some Greek characters first. Some of them (the ones 
that happen to be in cp437, interestingly enough) come 
back as random ASCII letters:

Θ (U+0398) becomes "T"
Σ (U+03A3) becomes "S"
Φ (U+03A6) becomes "F"

Greek letters that are not in cp437 come back as 
question marks, as expected (I guess):
Τ (U+03A4) becomes "?"
Υ (U+03A5) becomes "?"

...as do some Hebrew letters and Japanese hiragana:
א (U+05D0) becomes "?"
ה (U+05D4) becomes "?"
ס (U+05E1) becomes "?"
あ (U+305F) becomes "?"
う (U+3046) becomes "?"
た (U+3042) becomes "?"

I don't know if this is something that anyone cares 
about, since the filenames are useless anyway, but it 
does seem to be unintended behavior.

(And before you ask, it's just a theoretical exercise; I 
have no urgent need to use os.listdir with non-Unicode 
directory names on Windows.)

----------------------------------------------------------------------

>Comment By: Martin v. Löwis (loewis)
Date: 2004-03-31 22:11

Message:
Logged In: YES 
user_id=21627

There is nothing we can do about this: the mapping from
characters outside the ANSI CP is done completely inside
Windows, using an undocumented algorithm. This algorithm
will typically replace characters with "similar" ones. 

E.g. U+0398 is GREEK CAPITAL LETTER THETA, which is similar
in sound to LATIN CAPITAL LETTER T. Similarity is sometimes
determined by sound, sometimes by glyph-likeness in a
typical font. If no similar character is available, Windows
puts in a question mark. The system call performing the
directory listing does not indicate whether such a mapping
has taken place.

Closing this as third-party bug.

----------------------------------------------------------------------

Comment By: Mike Brown (mike_j_brown)
Date: 2004-03-31 12:28

Message:
Logged In: YES 
user_id=371366

I've added a script that demonstrates the issue.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=926427&group_id=5470



More information about the Python-bugs-list mailing list