[Distutils] Problem Report

Mon Aug 17 21:37:17 CEST 2015

On 17 August 2015 at 18:12, Erik Bray <erik.m.bray at gmail.com> wrote:
> On Thu, Aug 13, 2015 at 4:42 AM, 俞博文 <stevenybw at hotmail.com> wrote:
>> Dear Maintainers:
>>
>> This problem occurred when
>> 1. Windows platform
>> 2. Python is installed on non-Latin path (for example: path contains Chinese
>> character).
>> 3. try to "pip install theano"
>>
>> And I found the problem is in distutils.command.build_scripts module's
>> copy_scripts function, on line 106
>>
>>                     executable = os.fsencode(executable)
>>                     shebang = b"#!" + executable + post_interp + b"\n"
>>                     try:
>>                         shebang.decode('utf-8')
>>
>> actually os.fsencode will encode the path into GBK encoding on windows, it's
>> certainly that will fail to decode via utf-8.
>>
>> Solution:
>>
>> #executable = os.fsencode(executable) (delete this line)
>> executable = executable.encode('utf-8')
>>
>> Theano successfully installed after this patch.
>
> Hi,
>
> This is a bit tricky--I think, from the *nix perspective, using
> os.fsencode() looks like the correct approach here.  However, if
> sys.getfilesystemencoding() != 'utf-8', and if the result of
> os.fsencode(executable) is not decodable as utf-8, then that's going
> to be a problem for the Python interpreter which begins reading a file
> as utf-8 until it gets to the coding token.
>
> Unfortunately this is a bit contradictory--if the path to the
> interpreter in the local filesystem encoding is not UTF-8 it is
> impossible to parse that file in Python.  On Windows this shouldn't
> matter--I agree with your patch, that it should just write the shebang
> line in UTF-8.  However, on *nix systems it really should be using
> os.fsencode, I think.
>
> I wonder if this was brought up in the discussion around PEP-263.  I
> feel like as long as the file encoding is declared to be the same as
> whatever encoding was used the write the shebang line, that this
> should be valid.  However, the Python interpreter still tries to
> interpret the shebang line as UTF-8, and hence falls over in your
> case.  This is unfortunate...

There are a number of questions here, which I don't currently have
time to dig into, I'm afraid:

1. The original post specifies Windows, so I'll stick to that. Unix is
a whole other situation, and I won't cover that as I have no expertise
there. But it will need reviewing by someone who does know.
2. Where is the shebang being used? I can think of at least 3
possibilities, and they are all parsed with different code. If it's
written to a .py file executed by the user (via the launcher) it
should be UTF-8 as that's what the launcher uses. If it's written to
the embedded python script in a pip (distlib) single-file exe wrapper,
it should probably also use UTF-8 as the distlib wrappers use code
derived from the launcher code (I believe) and therefore probably also
uses UTF-8. If it's an old-style setuptools 2-file exe wrapper (.exe
and -script.py) then it should use whatever that exe requires - I have
no idea what that might be, but UTF-8 is still the only really sane
choice, it's just that the setuptools wrapper was written some time
ago and may not have made that choice. Someone should check.
3. Long story short, use UTF-8, but you may need to check the code
that interprets the shebang just to be sure. Any actual patch needs to
be conditional on the OS as well (unless it turns out that UTF-8 is
the right answer everywhere, which frankly I doubt...)

Paul