On Mon, Nov 12, 2012 at 02:34:14PM -0500, Daniel Holth wrote:
Horrifying. All codecs that are not utf-8 should be banned, except on Windows.
<nod> I made that argument on python-dev but it didn't win the necessary people over. I don't recall why so you'd have to look at the thread to see what's already been argued. (This is assuming that we aren't talking about locale settings in general but only reading the module filenames off of the filesystem. Banning non-utf8 locales isn't a good idea since there are areas of the owrld where utf-8 isn't going to be adopted anytime soon.)
Or at least warn("Your Unicode is broken"); in fact, just put that in site.py unconditionally.
If python itself adds that to site.py, that would be great. But individual sites adding things to site.py only makes python code written at one site non-portable.
However remember that a non-ASCII pypi name ☃ could still be just "import snowman". Only the .dist-info directory ☃-1.0.0.dist-info would necessarily contain the higher Unicode characters.
<nod> I wasn't thinking about that. If you specify that the metadata directories (if they contain the unicode characters) must be encoded in utf-8 (or at least, must be in a specific encoding on a specific platform), then that would work. Be sure to specify the encoding and use it explicitly, when decoding filenames rather than the implicit d4ecoding which relies on the locale, though (I advise having unittests where the locale is set to something non-utf-8 (C locale works well) to test this or someone who doesn't remember this conversation will make a mistake someday). If you rely on the implicit conversion with locale, you'll eventually end up back in the mess of having bytes that you don't know what to do with.
I will keep the - and document the - to _ folding convention. - turns into _ when going into a filename, and _ turns back into - when parsed out of a filename.
Cool. Thanks.
The alternative to putting the metadata in the filename which btw isn't that big of a problem, is to have indexed metadata. IIUC apt-get and yum work this way and the filename does not matter at all. The tradeoff is of course that you have to generate the index. The simple index is a significant convenience of easy_install derived systems.
<nod>. I've liked the idea of putting metadata about all installed modules into a separate index. It makes possible writing a new import mechanism that uses the index to more efficiently load of modules on systems with large sys.path's and make mulitple versions of a module on a system easier to implement. However, there are some things to consider: * The python module case will be a bit more complex than yum and apt because you'll need to keep per-user databases and per-system databases (so that there's a place for user's to keep the metadata for modules that they install into user-writable directories). * User's will need to run commands to install, update, and remove the metadata from those indexes. * yum also need to deal with non-utf-8 data. But some of those are due to legacy concerns and others are due to filenames. - Legacy: package names, package descriptions, etc, in those worlds can contain non-utf8 data because the underlying systems (rpm and dpkg) predate unicode. For package descriptions, I know that yum continues to store pure bytes and translate it to a "sensible" representation when it loads. For package names I'm unsure. The major distributions that yum works for specify that package names must be utf-8 so yum may specify utf-8. OTOH, yum is distro agnostic and $random_rpm_from_the_internet can still use random byts in its package name so yum may still have to deal with bytes here. - filenames: those are still bytes becaues there's nothing that enforces utf-8. If you're keeping a list of filenames in the metadata, you still have to deal with those bytes somehow. So yum and python packaging tools would still have to make decisions about what to do with those. For yum, it stores the bytes and has to operate on bytes and convert to unicode (as best it can) when displaying data. python packaging tools can take a different path but they will need to make explicit assertions about their treatment of encodings to do so. + For instance, they could assert that all filenames must be utf-8 -- anyting else is an error and cannot be packaged. + A more complex example would be to store utf-8 in internal package metadata but have the capability to translate from the user's locale settings when reading off the filesystem. Then create utf-8 filenames when writing out. This gets a bit dodgy since the user can create the package, then install it on their system and the installed package would fail to find modules because they're no longer in the user's locale.) + A third example which I currently view as un-workable is to read from the filesystem with the user's locale. Store utf-8 in the metadata, and then translate to the user's locale on output. This doesn't work for multi-user systems and will fail when a module has characters that aren't available in the user's locale. -Toshio