[issue9561] distutils: set encoding to utf-8 for input and output files

Éric Araujo report at bugs.python.org
Mon Sep 13 02:49:14 CEST 2010


Éric Araujo <merwok at netwok.org> added the comment:

[Toshio, I made you nosy for a question about RPM .spec files]

>> - PKG-INFO (METADATA in distutil2), that already uses a trick to support
>> Unicode, but your change would replace it in a better way;
> Which "trick"?

Some values are explicitly allowed to use Unicode and are encoded to UTF-8
when queried.

>> - MANIFEST, which with your fix would gain the ability to handle non-ASCII
>> paths, which is a feature or a bugfix depending on your point of view;
> Wait. Non encodable bytes is a separated issue. I would like to work on the
> first problem: distutils in Python3 uses open() without encoding argument and
> so the encoding depends on the user's locale. Said differently: if you produce
> a file with distutils on a computer, you cannot be sure that the file can be
> read with the same version of Python on other computer (if the locale encoding
> is different). Eg. Windows uses mbcs encoding whereas utf-8 is the preferred
> encoding on Linux.
>
> What is the encoding of the MANIFEST file?

Python’s default encoding, unfortunately.  Try listing “napoléon” in a MANIFEST
file and you’ll get a UnicodeEncodeError because the file wants ASCII.

>> - .def files, used by the compilers for the C linking step; I don’t know if
>> it’s appropriate to allow UTF-8 there.
>
> I don't know these files.

So we’ll have to get advice from someone well-versed in C linking.

>> - RPM spec files, which use ASCII or UTF-8 according to
>> http://en.opensuse.org/openSUSE:Specfile_guidelines#Specfile_Encoding but
>> it’s not confirmed in
>> http://www.rpm.org/max-rpm/s1-rpm-build-creating-spec-file.html (linked
>> from the LSB site), so there’s no guarantee this works for all RPM
>> platforms. This sort of platform-specific thing is the reason why RPM
>> support has been removed in distutils2.
> UTF-8 is a superset of ASCII. If you use utf-8 but only write ascii
> characters, your output file will be written to utf-8... but it will be also
> encoded to ascii. It's magical :-)

I know that, but it does not answer the question:  Is it okay for these files
to use UTF-8?

>> - record and .pth files created by the install command.
> .pth contain directory names which can be non-ASCII.

Agreed.

>> I agree that there is something to be fixed, but I don’t know if they can
>> be fixed in distutils. Unicode in PKG-INFO is unrelated to files, whereas
>> there are files or directories in MANIFEST, spec, record and .pth.
> You can use non-ASCII characters for other topics than filenames. Eg. in a
> description of a package :-)

See above: The description of a distribution is in UTF-8.  Note that I don’t
really understand my comment anymore; I now think that this should be fixed
in distutils with the least intrusive change possible.

>> If this is going to be fixed, write_file should not use UTF-8 unconditionally
>> but grow a keyword argument IMO, so that use cases requiring ASCII
>> continue to work.
> As written before, UTF-8 is a superset of ASCII. If you read a file using utf-8
> encoding, you will be able to read ascii files. But if you use utf-8 and write
> non-ascii characters, old version of distutils using ascii or other encoding
> will not be able to read these files.

That’s what I meant: Don’t make write_file always use UTF-8 since some use cases are restricted to ASCII.

> About the keyword solution: yes, it would be a smooth way to fix this issue.

Let’s do it.  (Make sys.getdefaultencoding() its default value for compat.)

>> When you say “patch *all* functions reading files”, I guess you mean all
>> functions that read distutils files, i.e. MANIFEST and PKG-INFO.
> I don't know distutils to answer to my own question.

You patch writing files, I’ll handle reading files :)

----------
nosy: +a.badger

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue9561>
_______________________________________


More information about the Python-bugs-list mailing list