[ python-Bugs-926427 ] OEM codepage chars in mbcs filenames can be
misinterpreted
SourceForge.net
noreply at sourceforge.net
Wed Mar 31 15:11:51 EST 2004
Bugs item #926427, was opened at 2004-03-31 06:04
Message generated for change (Comment added) made by loewis
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=926427&group_id=5470
Category: Python Library
>Group: 3rd Party
>Status: Closed
>Resolution: Wont Fix
Priority: 5
Submitted By: Mike Brown (mike_j_brown)
Assigned to: Nobody/Anonymous (nobody)
Summary: OEM codepage chars in mbcs filenames can be misinterpreted
Initial Comment:
My system: Windows XP, English - US locale, Python
2.3.3
I believe the bug I am reporting here is this:
On Windows XP, when using os.listdir() with a non-
Unicode argument, characters that are not in the
default locale's encoding (e.g. Greek capital letter
Sigma, (U+03A3), is not in windows-1252), but that are
in the default OEM code page (e.g. Sigma is in cp437),
get mapped to ASCII characters other than '?'.
For example, things seem to work in a predictable way
when I put windows-1252 characters into filenames (I
do this in Explorer and then I see what os.listdir
(r'C:\path\to\the\dir') returns):
(U+2014) becomes \x97
(U+2022) becomes \x95
é (U+00E9) becomes \xe9
But things are much less predictable when I use
characters from outside this range. I thought I'd try
some Greek characters first. Some of them (the ones
that happen to be in cp437, interestingly enough) come
back as random ASCII letters:
Θ (U+0398) becomes "T"
Σ (U+03A3) becomes "S"
Φ (U+03A6) becomes "F"
Greek letters that are not in cp437 come back as
question marks, as expected (I guess):
Τ (U+03A4) becomes "?"
Υ (U+03A5) becomes "?"
...as do some Hebrew letters and Japanese hiragana:
א (U+05D0) becomes "?"
ה (U+05D4) becomes "?"
ס (U+05E1) becomes "?"
あ (U+305F) becomes "?"
う (U+3046) becomes "?"
た (U+3042) becomes "?"
I don't know if this is something that anyone cares
about, since the filenames are useless anyway, but it
does seem to be unintended behavior.
(And before you ask, it's just a theoretical exercise; I
have no urgent need to use os.listdir with non-Unicode
directory names on Windows.)
----------------------------------------------------------------------
>Comment By: Martin v. Löwis (loewis)
Date: 2004-03-31 22:11
Message:
Logged In: YES
user_id=21627
There is nothing we can do about this: the mapping from
characters outside the ANSI CP is done completely inside
Windows, using an undocumented algorithm. This algorithm
will typically replace characters with "similar" ones.
E.g. U+0398 is GREEK CAPITAL LETTER THETA, which is similar
in sound to LATIN CAPITAL LETTER T. Similarity is sometimes
determined by sound, sometimes by glyph-likeness in a
typical font. If no similar character is available, Windows
puts in a question mark. The system call performing the
directory listing does not indicate whether such a mapping
has taken place.
Closing this as third-party bug.
----------------------------------------------------------------------
Comment By: Mike Brown (mike_j_brown)
Date: 2004-03-31 12:28
Message:
Logged In: YES
user_id=371366
I've added a script that demonstrates the issue.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=926427&group_id=5470
More information about the Python-bugs-list
mailing list