Full unicode support for the import machinery

Hi, I'm trying to fix Python to support undecodable bytes in the Python path since some months ago. My first try was really huge and sometimes ugly. When it was possible, I extracted some short and simple patches and applied them to py3k (sometimes with an issue, sometimes directly in the svn). When it was no more possible to split the big patch, I restarted the work from scratch. The main change from my previous try is that I changed import.c to use unicode strings instead of byte strings. With the surrogate hack (PEP 383), unicode is a superset of bytes and so it is "forward compatible". I just created a branch called "import_unicode" (based on py3k) including all my patches. It's still a work in progress. It is possible to start Python installed in an undecodable path (eg. directory with an non-ASCII character with C locale for Linux), which is an huge progress, but some tests are still failing. The last biggest problem is that code object filenames are not reencoded after that the file system encoding is changed (but sys.path and sys.modules filenames are reencoded). I think that I will register all code objects into a list to be able to reencode their filename attribute (and then drop the list). I created an svn branch because I think that it's easier to review short commits than one unique huge patch. The branch also helps me to share the branch between different computers, and allow other people to review the commits (and/or contribute!). Some people will maybe understand better my work with the "whole picture" :-) -- There are at least 4 issues related to this work: #3080: Full unicode import system #4352: imp.find_module() fails with a UnicodeDecodeError when called with non-ASCII search paths #8611: Python3 doesn't support locale different than utf8 and an non-ASCII path (POSIX) #8988: import + coding = failure (3.1.2/win32) -- Some examples of previous issues related to my secret goal (patch import machinery): #8391: os.execvpe() doesn't support surrogates in env #8393: subprocess: support undecodable current working directory on POSIX OS #8412: os.system() doesn't support surrogates nor bytes #8485: Don't accept bytearray as filenames, or simplify the API # 8514: Add fsencode() functions to os module #8610: Python3/POSIX: errors if file system encoding is None (-> create initfsencoding() in pythonrun.c) #8715: Create PyUnicode_EncodeFSDefault() function ... -- Victor Stinner http://www.haypocalc.com/

On Fri, Jul 9, 2010 at 10:11 AM, Victor Stinner <victor.stinner@haypocalc.com> wrote:
I created an svn branch because I think that it's easier to review short commits than one unique huge patch. The branch also helps me to share the branch between different computers, and allow other people to review the commits (and/or contribute!).
Thanks for doing that, it does indeed make it much easier to follow your train of thought. The overall approach looks sane, and while I haven't done a line-by-line review of every patch on the branch, the ones I did examine in detail all looked correct. I'll try to keep up as you make more changes. You're a brave soul, venturing into that there-is-no-Unicode-there-is-only-ASCII maze, but you've already made substantial improvements. The addition of new more Unicode friendly C APIs for errors and warnings should be of general use outside this work as well (but given where you're up to, I don't advocate trying to cherry pick them off the branch). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Le vendredi 09 juillet 2010 02:11:35, Victor Stinner a écrit :
I'm trying to fix Python to support undecodable bytes in the Python path (...)
My work is mostly done. I posted a patch on Rietveld and opened an issue. http://bugs.python.org/issue9425 http://codereview.appspot.com/1874048 -- Victor Stinner http://www.haypocalc.com/
participants (2)
-
Nick Coghlan
-
Victor Stinner