Changing the separator from - to ~ and allow all Unicode alphanumerics in package names...

Although I think the ~ is a very ugly -, it could be useful to change the separator to something less commonly used than the -.
It would be useful to be able to use the hyphen - in the version of a package (for semver) and elsewhere. Using it as the separator could make parsing the file name a bit trickier than is healthy.
This change would affect PEP 376 which reads:
This distinct directory is named as follows::
name + '-' + version + '.dist-info'
python_package-1.0.0_four+seven.dist-info
with today's hyphen/underscore folding: re.sub('[^A-Za-z0-9.]+', '-', version), could become
python-package~1.0.0-four+seven.dist-info
It would also affect pip, setuptools, and the wheel peps. If we do this, I would like to allow Unicode package names at the same time. safe_name(), the pkg_resources function that escapes package names for file names, would become
re.sub(u"[^\w.]+", "_", u"package-name", flags=re.U)
In other words, the rule for package names would be that they can contain any Unicode alphanumeric or _ or dot. Right now package names cannot practically contain non-ASCII because the setuptools installation will fold it all to _ and installation metadata will collide on the disk.
safe_version(), presently the same as safe_name() would also need to allow + for semver.
Does anyone have the energy to actually implement a proof-of-concept?

On Fri, Nov 09, 2012 at 09:38:54PM -0500, Daniel Holth wrote:
Although I think the ~ is a very ugly -, it could be useful to change the separator to something less commonly used than the -.
It would be useful to be able to use the hyphen - in the version of a package (for semver) and elsewhere. Using it as the separator could make parsing the file name a bit trickier than is healthy.
items 10 and 11 of semver are problematic. Other people who consume versions, for instance Linux distributions, have a history of using dashes as a separator. They have to deal with stripping hyphens out of versions that make use them.
The fact that distutils/setuptools also treats hyphens as separators is a good thing for these audiences.
[..]
If we do this, I would like to allow Unicode package names at the same time. safe_name(), the pkg_resources function that escapes package names for file names, would become
re.sub(u"[^\w.]+", "_", u"package-name", flags=re.U)
In other words, the rule for package names would be that they can contain any Unicode alphanumeric or _ or dot. Right now package names cannot practically contain non-ASCII because the setuptools installation will fold it all to _ and installation metadata will collide on the disk.
I consider the limitation of package names to non-ascii to be a blessing in disguise. In python3, unicode module names are possible but not portable between systems. This is because the non-ascii module names inside of a python file are abstract text but the representation on the filesystem is whatever the user's locale is. The consensus on python-dev when this was brought up seemed to be that using non-ascii in your local locale was important for learning to use python. But distributing non-ascii modules to other people was a bad idea. (If you have the attention span for long threads, http://mail.python.org/pipermail/python-dev/2011-January/107467.html Note that the threading was broken several times but the subject line stayed the same.)
Description of the non-ascii module problem for people who want a summary:
I have a python3 program that has:: #!/usr/bin/python3 -tt # -*- coding: utf-8 -*- import café café.do_something()
python3 reads this file in and represents café as an abstract text type because I wrote it using utf-8 encoding and it can therefore decode the file's contents to its internal representation. However it then has to find the café module on disk. In my environment, I have LC_ALL=en_US.utf8. python3 finds the file café.py and uses that to satisfy the import.
However, I have a colleague that does work with me. He has access to my program over a shared filesystem (or distributed to him via a git checkout or copied via an sdist, etc). His locale uses latin-1 (ISO8859-1) as his encoding (For instance, LC_ALL=en_US.ISO8859-1). When he runs my program, python3 is still able to read the application file itself (due to the piece of the file that specifies it's encoded in utf-8) but when it searches for a file to satisfy café on the disk it runs into probelsm because the café.py filename is not encoded using latin-1.
Other scenarios where the files are being shared were discussed in the thread I mentioned but I won't go into all of them in this message... hopefully you can generalize this example to how it will cause problems on pypi, with pre-packaged modules on the system vs user's modules, etc.
-Toshio

On 11/10/2012 03:38 AM, Daniel Holth wrote:
Although I think the ~ is a very ugly -, it could be useful to change the separator to something less commonly used than the -.
It would be useful to be able to use the hyphen - in the version of a package (for semver) and elsewhere. Using it as the separator could make parsing the file name a bit trickier than is healthy.
This change would affect PEP 376 which reads:
This distinct directory is named as follows::
name + '-' + version + '.dist-info'
python_package-1.0.0_four+seven.dist-info
with today's hyphen/underscore folding: re.sub('[^A-Za-z0-9.]+', '-', version), could become
python-package~1.0.0-four+seven.dist-info
It would also affect pip, setuptools, and the wheel peps. If we do this, I would like to allow Unicode package names at the same time. safe_name(), the pkg_resources function that escapes package names for file names, would become
re.sub(u"[^\w.]+", "_", u"package-name", flags=re.U)
In other words, the rule for package names would be that they can contain any Unicode alphanumeric or _ or dot. Right now package names cannot practically contain non-ASCII because the setuptools installation will fold it all to _ and installation metadata will collide on the disk.
safe_version(), presently the same as safe_name() would also need to allow + for semver.
How about, rather than trying to create a complicated 1:1 mapping between metadata and filename, just use a hash of the metadata?
There should still be a "user-readable" part of the filename, but it now only has to be a one-way function that is allowed to collide. So if you do for some reason have a collision in the user-friendly part the hash still saves you,
python-package-1.0.0-fa43534fbab3434534aba python-package-1.0.0-3423432534abbcaba3423
Dag Sverre

I consider the limitation of package names to non-ascii to be a blessing in disguise. In python3, unicode module names are possible but not portable between systems. This is because the non-ascii module names inside of a python file are abstract text but the representation on the filesystem is whatever the user's locale is. The consensus on python-dev when this was brought up seemed to be that using non-ascii in your local locale was important for learning to use python. But distributing non-ascii modules to other people was a bad idea. (If you have the attention span for long threads, http://mail.python.org/pipermail/python-dev/2011-January/107467.html Note that the threading was broken several times but the subject line stayed the same.)
Description of the non-ascii module problem for people who want a summary:
I have a python3 program that has:: #!/usr/bin/python3 -tt # -*- coding: utf-8 -*- import café café.do_something()
python3 reads this file in and represents café as an abstract text type because I wrote it using utf-8 encoding and it can therefore decode the file's contents to its internal representation. However it then has to find the café module on disk. In my environment, I have LC_ALL=en_US.utf8. python3 finds the file café.py and uses that to satisfy the import.
However, I have a colleague that does work with me. He has access to my program over a shared filesystem (or distributed to him via a git checkout or copied via an sdist, etc). His locale uses latin-1 (ISO8859-1) as his encoding (For instance, LC_ALL=en_US.ISO8859-1). When he runs my program, python3 is still able to read the application file itself (due to the piece of the file that specifies it's encoded in utf-8) but when it searches for a file to satisfy café on the disk it runs into probelsm because the café.py filename is not encoded using latin-1.
Horrifying. All codecs that are not utf-8 should be banned, except on Windows. Or at least warn("Your Unicode is broken"); in fact, just put that in site.py unconditionally.
However remember that a non-ASCII pypi name ☃ could still be just "import snowman". Only the .dist-info directory ☃-1.0.0.dist-info would necessarily contain the higher Unicode characters.
I will keep the - and document the - to _ folding convention. - turns into _ when going into a filename, and _ turns back into - when parsed out of a filename.
The alternative to putting the metadata in the filename which btw isn't that big of a problem, is to have indexed metadata. IIUC apt-get and yum work this way and the filename does not matter at all. The tradeoff is of course that you have to generate the index. The simple index is a significant convenience of easy_install derived systems.
Daniel Holth

On Mon, Nov 12, 2012 at 02:34:14PM -0500, Daniel Holth wrote:
Horrifying. All codecs that are not utf-8 should be banned, except on Windows.
<nod> I made that argument on python-dev but it didn't win the necessary people over. I don't recall why so you'd have to look at the thread to see what's already been argued.
(This is assuming that we aren't talking about locale settings in general but only reading the module filenames off of the filesystem. Banning non-utf8 locales isn't a good idea since there are areas of the owrld where utf-8 isn't going to be adopted anytime soon.)
Or at least warn("Your Unicode is broken"); in fact, just put that in site.py unconditionally.
If python itself adds that to site.py, that would be great. But individual sites adding things to site.py only makes python code written at one site non-portable.
However remember that a non-ASCII pypi name ☃ could still be just "import snowman". Only the .dist-info directory ☃-1.0.0.dist-info would necessarily contain the higher Unicode characters.
<nod> I wasn't thinking about that. If you specify that the metadata directories (if they contain the unicode characters) must be encoded in utf-8 (or at least, must be in a specific encoding on a specific platform), then that would work. Be sure to specify the encoding and use it explicitly, when decoding filenames rather than the implicit d4ecoding which relies on the locale, though (I advise having unittests where the locale is set to something non-utf-8 (C locale works well) to test this or someone who doesn't remember this conversation will make a mistake someday). If you rely on the implicit conversion with locale, you'll eventually end up back in the mess of having bytes that you don't know what to do with.
I will keep the - and document the - to _ folding convention. - turns into _ when going into a filename, and _ turns back into - when parsed out of a filename.
Cool. Thanks.
The alternative to putting the metadata in the filename which btw isn't that big of a problem, is to have indexed metadata. IIUC apt-get and yum work this way and the filename does not matter at all. The tradeoff is of course that you have to generate the index. The simple index is a significant convenience of easy_install derived systems.
<nod>. I've liked the idea of putting metadata about all installed modules into a separate index. It makes possible writing a new import mechanism that uses the index to more efficiently load of modules on systems with large sys.path's and make mulitple versions of a module on a system easier to implement.
However, there are some things to consider:
* The python module case will be a bit more complex than yum and apt because you'll need to keep per-user databases and per-system databases (so that there's a place for user's to keep the metadata for modules that they install into user-writable directories). * User's will need to run commands to install, update, and remove the metadata from those indexes. * yum also need to deal with non-utf-8 data. But some of those are due to legacy concerns and others are due to filenames. - Legacy: package names, package descriptions, etc, in those worlds can contain non-utf8 data because the underlying systems (rpm and dpkg) predate unicode. For package descriptions, I know that yum continues to store pure bytes and translate it to a "sensible" representation when it loads. For package names I'm unsure. The major distributions that yum works for specify that package names must be utf-8 so yum may specify utf-8. OTOH, yum is distro agnostic and $random_rpm_from_the_internet can still use random byts in its package name so yum may still have to deal with bytes here. - filenames: those are still bytes becaues there's nothing that enforces utf-8. If you're keeping a list of filenames in the metadata, you still have to deal with those bytes somehow. So yum and python packaging tools would still have to make decisions about what to do with those. For yum, it stores the bytes and has to operate on bytes and convert to unicode (as best it can) when displaying data. python packaging tools can take a different path but they will need to make explicit assertions about their treatment of encodings to do so. + For instance, they could assert that all filenames must be utf-8 -- anyting else is an error and cannot be packaged. + A more complex example would be to store utf-8 in internal package metadata but have the capability to translate from the user's locale settings when reading off the filesystem. Then create utf-8 filenames when writing out. This gets a bit dodgy since the user can create the package, then install it on their system and the installed package would fail to find modules because they're no longer in the user's locale.) + A third example which I currently view as un-workable is to read from the filesystem with the user's locale. Store utf-8 in the metadata, and then translate to the user's locale on output. This doesn't work for multi-user systems and will fail when a module has characters that aren't available in the user's locale.
-Toshio

On Mon, Nov 12, 2012 at 3:20 PM, Toshio Kuratomi a.badger@gmail.com wrote:
Or at least warn("Your Unicode is broken"); in fact, just put that in
site.py
unconditionally.
If python itself adds that to site.py, that would be great. But individual sites adding things to site.py only makes python code written at one site non-portable.
It is a joke. Python would just print "Your Unicode is broken" on startup, just to let you know, regardless of your platform or LOCALE.
However remember that a non-ASCII pypi name ☃ could still be just "import
snowman". Only the .dist-info directory ☃-1.0.0.dist-info would
necessarily
contain the higher Unicode characters.
<nod> I wasn't thinking about that. If you specify that the metadata directories (if they contain the unicode characters) must be encoded in utf-8 (or at least, must be in a specific encoding on a specific platform), then that would work. Be sure to specify the encoding and use it explicitly, when decoding filenames rather than the implicit d4ecoding which relies on the locale, though (I advise having unittests where the locale is set to something non-utf-8 (C locale works well) to test this or someone who doesn't remember this conversation will make a mistake someday). If you rely on the implicit conversion with locale, you'll eventually end up back in the mess of having bytes that you don't know what to do with.
I will keep the - and document the - to _ folding convention. - turns
into _
when going into a filename, and _ turns back into - when parsed out of a filename.
Cool. Thanks.
The alternative to putting the metadata in the filename which btw isn't
that
big of a problem, is to have indexed metadata. IIUC apt-get and yum work
this
way and the filename does not matter at all. The tradeoff is of course
that you
have to generate the index. The simple index is a significant
convenience of
easy_install derived systems.
<nod>. I've liked the idea of putting metadata about all installed modules into a separate index. It makes possible writing a new import mechanism that uses the index to more efficiently load of modules on systems with large sys.path's and make mulitple versions of a module on a system easier to implement.
However, there are some things to consider:
I was actually thinking of the server (pypi) side.
It would also be worthwhile to define an install-side hook, minimally "packaging.reindex()", or "reindex(list of changed packages)". By default it would do nothing because the default implementation would look at all the .dist-info directories every time, but you could plug in a more complicated implementation. It would be, by design, less flexible than the current "anything that has an info directory on the path is installed automatically" system.
participants (3)
-
Dag Sverre Seljebotn
-
Daniel Holth
-
Toshio Kuratomi