Re: [Python-Dev] Some thoughts on the codecs...

I would suggest adding the Dos, Windows and Macintosh standard 8-bit charsets (their equivalents of latin-1) too, as documents in these encoding are pretty ubiquitous. But maybe these should only be added on the respective platforms. -- Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++ | ++++ if you agree copy these lines to your sig ++++ | see

Jack Jansen wrote:
Good idea. What code pages would that be ? -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: Python Pages:

[Jack Jansen]
Good idea. What code pages would that be ?
I'm not clear on what's being suggested; e.g., Windows supports *many* different "code pages". CP 1252 is default in the U.S., and is an extension of Latin-1. See e.g. which appears to be up-to-date (has 0x80 as the euro symbol, Unicode U+20AC -- although whether your version of U.S. Windows actually has this depends on whether you installed the service pack that added it!). See for the closest DOS got.

Leave JISXXX and the CJK stuff out. If you get into Japanese, you really need to cover ShiftJIS, EUC-JP and JIS, they are big, and
[Andy writes:] there [Then Marc relpies:]
2. give more information to the unicodec registry: one could register classes instead of instances which the Unicode
[Jack chimes in with:]
[And the conversation twisted around to Greg noting:]
This is leading me to conclude that our "codec registry" should be the file system, and Python modules. Would it be possible to define a "standard package" called "encodings", and when we need an encoding, we simply attempt to load a module from that package? The key benefits I see are: * No need to load modules simply to register a codec (which would make the number of open calls even higher, and the startup time even slower.) This makes it truly demand-loading of the codecs, rather than explicit load-and-register. * Making language specific distributions becomes simple - simply select a different set of modules from the "encodings" directory. The Python source distribution has them all, but (say) the Windows binary installer selects only a few. The Japanese binary installer for Windows installs a few more. * Installing new codecs becomes trivial - no need to hack etc - simply copy the new "codec module" to the encodings directory and you are done. * No serious problem for GMcM's installer nor for freeze We would probably need to assume that certain codes exist for _all_ platforms and language - but this is no different to assuming that "" also exists for all platforms. Is this worthy of consideration? Mark.

On Wed, 17 Nov 1999 08:54:15 +1100, you wrote:
Exactly what I am aiming for. The real icing on the cake would be a small state machine or some helper functions in C which made it possible to write fast codecs in pure Python, but that can come a bit later when we have examples up and running. - Andy

On Wed, 17 Nov 1999, Mark Hammond wrote:
Absolutely! You will need to provide a way for a module (in the "codec" package) to state *beforehand* that it should be loaded for the X, Y, and Z encodings. This might be in terms of little "info" files that get dropped into the package. The module scans the directory for the info files and loads them to build an encoding => module-name mapping. The alternative would be to have stub modules like: import unicodec def encode_1(...) ... def encode_2(...) ... ... unicodec.register('iso-8859-1', encode_1, decode_1) unicodec.register('iso-8859-2', encode_2, decode_2) ... import iso-8859-1 I believe that encoding names are legitimate file names, but they aren't necessarily Python identifiers. That kind of bungs up "import codec.iso-8859-1". The codec package would need to programmatically import the modules. Clients should not be directly importing the modules, so I don't see a difficult here. [ if we do decide to allow clients access to the modules, then maybe they have to arrive through a "helper" module that has a nice name, or the codec package provides a "module = code.load('iso-8859-1')" idiom. ] Cheers, -g -- Greg Stein,

The alternative would be to have stub modules like:
Actually, I was thinking even more radically - drop the codec registry all together, and use modules with "well-known" names (a slight precedent, but Python isnt adverse to well-known names in general) eg: import unicodec def encode(...): ... def decode(...): ... from iso-8859-1 import * The codec registry then is trivial, and effectively does not exist (cant get much more trivial than something that doesnt exist :-): def getencoder(encoding): mod = __import__( "encodings." + encoding ) return getattr(mod, "encode")
Agreed - clients should never need to import them, and codecs that wish to import other codes could use "__import__" Of course, I am not adverse to the idea of a registry as well and having the modules manually register themselves - but it doesnt seem to buy much, and the logic for getting a codec becomes more complex - ie, it needs to determine the module to import, then look in the registry - if it needs to determine the module anyway, why not just get it from the module and be done with it? Mark.

Mark Hammond wrote:
Why not... using the new registry scheme I proposed in the thread "Codecs and StreamCodecs" you could implement this via factory_functions and lazy imports (with the encoding name folded to make up a proper Python identifier, e.g. hyphens get converted to '' and spaces to '_'). I'd suggest grouping encodings: [encodings] [iso} [iso88591] [iso88592] [jis] ... [cyrillic] ... [misc] The unicodec registry could then query encodings.get(encoding,action) and the package would take care of the rest. Note that the "walk-me-up-scotty" import patch would probably be nice in this situation too, e.g. to reach the modules in [misc] or in higher levels such the ones in [iso] from [iso88591]. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 44 days left Business: Python Pages:

On Wed, 17 Nov 1999, M.-A. Lemburg wrote:
WHY?!?! This is taking a simple solution and making it complicated. I see no benefit to the creating yet-another-level-of-hierarchy. Why should they be grouped? Leave the modules just under "encodings" and be done with it. Cheers, -g -- Greg Stein,

Agreed. Tim Peters once remarked that Python likes shallow encodings (or perhaps that *I* like them :-). This is one such case where I would strongly urge for the simplicity of a shallow hierarchy. --Guido van Rossum (home page:

Greg Stein wrote:
Nevermind, was just an idea... -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: Python Pages:

Jack Jansen wrote:
Good idea. What code pages would that be ? -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: Python Pages:

[Jack Jansen]
Good idea. What code pages would that be ?
I'm not clear on what's being suggested; e.g., Windows supports *many* different "code pages". CP 1252 is default in the U.S., and is an extension of Latin-1. See e.g. which appears to be up-to-date (has 0x80 as the euro symbol, Unicode U+20AC -- although whether your version of U.S. Windows actually has this depends on whether you installed the service pack that added it!). See for the closest DOS got.

Leave JISXXX and the CJK stuff out. If you get into Japanese, you really need to cover ShiftJIS, EUC-JP and JIS, they are big, and
[Andy writes:] there [Then Marc relpies:]
2. give more information to the unicodec registry: one could register classes instead of instances which the Unicode
[Jack chimes in with:]
[And the conversation twisted around to Greg noting:]
This is leading me to conclude that our "codec registry" should be the file system, and Python modules. Would it be possible to define a "standard package" called "encodings", and when we need an encoding, we simply attempt to load a module from that package? The key benefits I see are: * No need to load modules simply to register a codec (which would make the number of open calls even higher, and the startup time even slower.) This makes it truly demand-loading of the codecs, rather than explicit load-and-register. * Making language specific distributions becomes simple - simply select a different set of modules from the "encodings" directory. The Python source distribution has them all, but (say) the Windows binary installer selects only a few. The Japanese binary installer for Windows installs a few more. * Installing new codecs becomes trivial - no need to hack etc - simply copy the new "codec module" to the encodings directory and you are done. * No serious problem for GMcM's installer nor for freeze We would probably need to assume that certain codes exist for _all_ platforms and language - but this is no different to assuming that "" also exists for all platforms. Is this worthy of consideration? Mark.

On Wed, 17 Nov 1999 08:54:15 +1100, you wrote:
Exactly what I am aiming for. The real icing on the cake would be a small state machine or some helper functions in C which made it possible to write fast codecs in pure Python, but that can come a bit later when we have examples up and running. - Andy

On Wed, 17 Nov 1999, Mark Hammond wrote:
Absolutely! You will need to provide a way for a module (in the "codec" package) to state *beforehand* that it should be loaded for the X, Y, and Z encodings. This might be in terms of little "info" files that get dropped into the package. The module scans the directory for the info files and loads them to build an encoding => module-name mapping. The alternative would be to have stub modules like: import unicodec def encode_1(...) ... def encode_2(...) ... ... unicodec.register('iso-8859-1', encode_1, decode_1) unicodec.register('iso-8859-2', encode_2, decode_2) ... import iso-8859-1 I believe that encoding names are legitimate file names, but they aren't necessarily Python identifiers. That kind of bungs up "import codec.iso-8859-1". The codec package would need to programmatically import the modules. Clients should not be directly importing the modules, so I don't see a difficult here. [ if we do decide to allow clients access to the modules, then maybe they have to arrive through a "helper" module that has a nice name, or the codec package provides a "module = code.load('iso-8859-1')" idiom. ] Cheers, -g -- Greg Stein,

The alternative would be to have stub modules like:
Actually, I was thinking even more radically - drop the codec registry all together, and use modules with "well-known" names (a slight precedent, but Python isnt adverse to well-known names in general) eg: import unicodec def encode(...): ... def decode(...): ... from iso-8859-1 import * The codec registry then is trivial, and effectively does not exist (cant get much more trivial than something that doesnt exist :-): def getencoder(encoding): mod = __import__( "encodings." + encoding ) return getattr(mod, "encode")
Agreed - clients should never need to import them, and codecs that wish to import other codes could use "__import__" Of course, I am not adverse to the idea of a registry as well and having the modules manually register themselves - but it doesnt seem to buy much, and the logic for getting a codec becomes more complex - ie, it needs to determine the module to import, then look in the registry - if it needs to determine the module anyway, why not just get it from the module and be done with it? Mark.

Mark Hammond wrote:
Why not... using the new registry scheme I proposed in the thread "Codecs and StreamCodecs" you could implement this via factory_functions and lazy imports (with the encoding name folded to make up a proper Python identifier, e.g. hyphens get converted to '' and spaces to '_'). I'd suggest grouping encodings: [encodings] [iso} [iso88591] [iso88592] [jis] ... [cyrillic] ... [misc] The unicodec registry could then query encodings.get(encoding,action) and the package would take care of the rest. Note that the "walk-me-up-scotty" import patch would probably be nice in this situation too, e.g. to reach the modules in [misc] or in higher levels such the ones in [iso] from [iso88591]. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 44 days left Business: Python Pages:

On Wed, 17 Nov 1999, M.-A. Lemburg wrote:
WHY?!?! This is taking a simple solution and making it complicated. I see no benefit to the creating yet-another-level-of-hierarchy. Why should they be grouped? Leave the modules just under "encodings" and be done with it. Cheers, -g -- Greg Stein,

Agreed. Tim Peters once remarked that Python likes shallow encodings (or perhaps that *I* like them :-). This is one such case where I would strongly urge for the simplicity of a shallow hierarchy. --Guido van Rossum (home page:

Greg Stein wrote:
Nevermind, was just an idea... -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: Python Pages:
participants (7)
Greg Stein
Guido van Rossum
Jack Jansen
M.-A. Lemburg
Mark Hammond
Tim Peters