[Python-Dev] My work on Python3 and non-ascii paths is done
victor.stinner at haypocalc.com
Wed Oct 20 02:11:35 CEST 2010
Le mardi 19 octobre 2010 16:12:56, Barry Warsaw a écrit :
> Going forward, is there adequate documentation, guidelines, and safeguards
> for future coders so that they Do The Right Thing with new code? Perhaps
> a short How To in the standard documentation would be helpful, with links
> to it from any old/bad API calls?
Hum, as usual, I suggest to decode all inputs to unicode as early as possible,
and encode back to bytes (or other native format) at the last moment. For
filenames, it means that PyUnicode_FSDecoder() is better than
PyUnicode_FSConverter(), because it gives an unicode object (instead of byte
string) and so the function will support unencodable characters.
Use PyUnicode_EncodeFSDefault() / PyUnicode_DecodeFSDefault() and
os.fsencode() / os.fsdecode() to encode/decode filenames instead of your own
function, to support the PEP 383 (undecodable bytes <=> surrogate characters).
Be also careful to support undecodable bytes (on OSes other than Windows), eg.
try a filename with a non-ASCII character with the C locale (ASCII locale
encoding). Even with utf-8 filesystem encoding, this problem may occurs with a
system not correclty configured (eg. USB key with the FAT fileystem using the
If you would like to avoid all encoding issues on filenames on UNIX/BSD, use
bytes: os.environb, os.listdir(b'.'), os.getcwdb(), etc.
Be careful with the utf-8 codec: its default mode (strict error handler)
refuses to encode surrogate characters. Eg. print(filename) may raise a
UnicodeEncodeError. Use repr(filename) to escape surrogate characters.
I plan to fix Python documentation: specify the encoding used to decode all
byte string arguments of the C API. I already wrote a draft patch: issue
#9738. This lack of documentation was a big problem for me, because I had to
follow the function calls to get the encoding.
More information about the Python-Dev