[Python-Dev] My work on Python3 and non-ascii paths is done

Wed Oct 20 02:11:35 CEST 2010

Le mardi 19 octobre 2010 16:12:56, Barry Warsaw a écrit :
> Going forward, is there adequate documentation, guidelines, and safeguards
> for future coders so that they Do The Right Thing with new code?  Perhaps
> a short How To in the standard documentation would be helpful, with links
> to it from any old/bad API calls?

Hum, as usual, I suggest to decode all inputs to unicode as early as possible, 
and encode back to bytes (or other native format) at the last moment. For 
filenames, it means that PyUnicode_FSDecoder() is better than 
PyUnicode_FSConverter(), because it gives an unicode object (instead of byte 
string) and so the function will support unencodable characters.

Use PyUnicode_EncodeFSDefault() / PyUnicode_DecodeFSDefault() and 
os.fsencode() / os.fsdecode() to encode/decode filenames instead of your own 
function, to support the PEP 383 (undecodable bytes <=> surrogate characters).

Be also careful to support undecodable bytes (on OSes other than Windows), eg. 
try a filename with a non-ASCII character with the C locale (ASCII locale 
encoding). Even with utf-8 filesystem encoding, this problem may occurs with a 
system not correclty configured (eg. USB key with the FAT fileystem using the 
"wrong" encoding).

If you would like to avoid all encoding issues on filenames on UNIX/BSD, use 
bytes: os.environb, os.listdir(b'.'), os.getcwdb(), etc.

Be careful with the utf-8 codec: its default mode (strict error handler) 
refuses to encode surrogate characters. Eg. print(filename) may raise a 
UnicodeEncodeError. Use repr(filename) to escape surrogate characters.

--

I plan to fix Python documentation: specify the encoding used to decode all 
byte string arguments of the C API. I already wrote a draft patch: issue 
#9738. This lack of documentation was a big problem for me, because I had to 
follow the function calls to get the encoding.

-- 
Victor Stinner
http://www.haypocalc.com/