[Distutils] Changing the separator from - to ~ and allow all Unicode alphanumerics in package names...

Mon Nov 12 21:20:26 CET 2012

On Mon, Nov 12, 2012 at 02:34:14PM -0500, Daniel Holth wrote:
> 
> 
> Horrifying. All codecs that are not utf-8 should be banned, except on Windows.
>
<nod>  I made that argument on python-dev but it didn't win the necessary
people over.  I don't recall why so you'd have to look at the thread to see
what's already been argued.

(This is assuming that we aren't talking about locale settings in general
but only reading the module filenames off of the filesystem.  Banning non-utf8
locales isn't a good idea since there are areas of the owrld where utf-8
isn't going to be adopted anytime soon.)

> Or at least warn("Your Unicode is broken"); in fact, just put that in site.py
> unconditionally.
> 
If python itself adds that to site.py, that would be great.  But individual
sites adding things to site.py only makes python code written at one site
non-portable.

> However remember that a non-ASCII pypi name ☃ could still be just "import
> snowman". Only the .dist-info directory ☃-1.0.0.dist-info would necessarily
> contain the higher Unicode characters.
>
<nod>  I wasn't thinking about that.  If you specify that the metadata
directories (if they contain the unicode characters) must be encoded in
utf-8 (or at least, must be in a specific encoding on a specific platform),
then that would work.  Be sure to specify the encoding and use it
explicitly, when decoding filenames rather than the implicit d4ecoding which
relies on the locale, though (I advise having unittests where the locale is
set to something non-utf-8 (C locale works well) to test this or someone who
doesn't remember this conversation will make a mistake someday).  If you
rely on the implicit conversion with locale, you'll eventually end up back
in the mess of having bytes that you don't know what to do with.

> I will keep the - and document the - to _ folding convention. - turns into _
> when going into a filename, and _ turns back into - when parsed out of a
> filename.
> 
Cool.  Thanks.

> The alternative to putting the metadata in the filename which btw isn't that
> big of a problem, is to have indexed metadata. IIUC apt-get and yum work this
> way and the filename does not matter at all. The tradeoff is of course that you
> have to generate the index. The simple index is a significant convenience of
> easy_install derived systems.
> 
<nod>.  I've liked the idea of putting metadata about all installed modules
into a separate index.  It makes possible writing a new import mechanism
that uses the index to more efficiently load of modules on systems with
large sys.path's and make mulitple versions of a module on a system easier
to implement.

However, there are some things to consider:

* The python module case will be a bit more complex than yum and apt because
  you'll need to keep per-user databases and per-system databases (so that
  there's a place for user's to keep the metadata for modules that they
  install into user-writable directories).
* User's will need to run commands to install, update, and remove the
  metadata from those indexes.
* yum also need to deal with non-utf-8 data.  But some of those are
  due to legacy concerns and others are due to filenames.
  - Legacy: package names, package descriptions, etc, in those worlds can
    contain non-utf8 data because the underlying systems (rpm and dpkg)
    predate unicode.  For package descriptions, I know that yum continues to
    store pure bytes and translate it to a "sensible" representation when it
    loads.  For package names I'm unsure.  The major distributions that yum
    works for specify that package names must be utf-8 so yum may specify
    utf-8.  OTOH, yum is distro agnostic and $random_rpm_from_the_internet
    can still use random byts in its package name so yum may still have to
    deal with bytes here.
  - filenames: those are still bytes becaues there's nothing that enforces
    utf-8.  If you're keeping a list of filenames in the metadata, you still
    have to deal with those bytes somehow.  So yum and python packaging
    tools would still have to make decisions about what to do with those.
    For yum, it stores the bytes and has to operate on bytes and convert to
    unicode (as best it can) when displaying data.  python packaging tools
    can take a different path but they will need to make explicit assertions
    about their treatment of encodings to do so.
    + For instance, they could assert that all filenames must be utf-8 --
      anyting else is an error and cannot be packaged.
    + A more complex example would be to store utf-8 in internal package
      metadata but have the capability to translate from the user's locale
      settings when reading off the filesystem.  Then create utf-8 filenames
      when writing out.  This gets a bit dodgy since the user can create the
      package, then install it on their system and the installed package
      would fail to find modules because they're no longer in the user's
      locale.)
    + A third example which I currently view as un-workable is to read from
      the filesystem with the user's locale.  Store utf-8 in the metadata,
      and then translate to the user's locale on output.  This doesn't work
      for multi-user systems and will fail when a module has characters that
      aren't available in the user's locale.

-Toshio
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/distutils-sig/attachments/20121112/cd93e832/attachment.pgp>