Re: [Distutils] Changing the separator from - to ~ and allow all Unicode alphanumerics in package names...

10 Nov 2012

      On Fri, Nov 09, 2012 at 09:38:54PM -0500, Daniel Holth wrote:
...
Although I think the ~ is a very ugly -, it could be useful to change the
separator to something less commonly used than the -.
It would be useful to be able to use the hyphen - in the version of a package
(for semver) and elsewhere. Using it as the separator could make parsing the
file name a bit trickier than is healthy.
items 10 and 11 of semver are problematic.  Other people who consume
versions, for instance Linux distributions, have a history of using dashes
as a separator.  They have to deal with stripping hyphens out of versions
that make use them.

The fact that distutils/setuptools also treats hyphens as separators is
a good thing for these audiences.

[..]
...
If we do this, I
would like to allow Unicode package names at the same time. safe_name(), the
pkg_resources function that escapes package names for file names, would become
re.sub(u"[^\w.]+", "_", u"package-name", flags=re.U)
In other words, the rule for package names would be that they can contain any
Unicode alphanumeric or _ or dot. Right now package names cannot practically
contain non-ASCII because the setuptools installation will fold it all to _ and
installation metadata will collide on the disk.
I consider the limitation of package names to non-ascii to be a blessing in
disguise.  In python3, unicode module names are possible but not portable
between systems.  This is because the non-ascii module names inside of a python
file are abstract text but the representation on the filesystem is whatever
the user's locale is.  The consensus on python-dev when this was brought up
seemed to be that using non-ascii in your local locale was important for
learning to use python.  But distributing non-ascii modules to other people
was a bad idea.  (If you have the attention span for long threads, 
http://mail.python.org/pipermail/python-dev/2011-January/107467.html
Note that the threading was broken several times but the subject line stayed
the same.)

Description of the non-ascii module problem for people who want a summary:

I have a python3 program that has::
  #!/usr/bin/python3 -tt
  # -*- coding: utf-8 -*-
  import café
  café.do_something()

python3 reads this file in and represents café as an abstract text type
because I wrote it using utf-8 encoding and it can therefore decode the
file's contents to its internal representation.  However it then has to find
the café module on disk.  In my environment, I have LC_ALL=en_US.utf8.
python3 finds the file café.py and uses that to satisfy the import.

However, I have a colleague that does work with me.  He has access to my
program over a shared filesystem (or distributed to him via a git checkout
or copied via an sdist, etc).  His locale uses latin-1 (ISO8859-1) as his
encoding (For instance, LC_ALL=en_US.ISO8859-1).  When he runs my program,
python3 is still able to read the application file itself (due to the piece
of the file that specifies it's encoded in utf-8) but when it searches for
a file to satisfy café on the disk it runs into probelsm because the café.py
filename is not encoded using latin-1.

Other scenarios where the files are being shared were discussed in the
thread I mentioned but I won't go into all of them in this message...
hopefully you can generalize this example to how it will cause problems on
pypi, with pre-packaged modules on the system vs user's modules, etc.

-Toshio

Re: [Distutils] Changing the separator from - to ~ and allow all Unicode alphanumerics in package names...

Toshio Kuratomi