Re: [Python-Dev] Some thoughts on the codecs...

I would suggest adding the Dos, Windows and Macintosh standard 8-bit charsets (their equivalents of latin-1) too, as documents in these encoding are pretty ubiquitous. But maybe these should only be added on the respective platforms. -- Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++ Jack.Jansen@oratrix.com | ++++ if you agree copy these lines to your sig ++++ www.oratrix.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm

Jack Jansen wrote:
Good idea. What code pages would that be ? -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

[Jack Jansen]
[MAL]
Good idea. What code pages would that be ?
I'm not clear on what's being suggested; e.g., Windows supports *many* different "code pages". CP 1252 is default in the U.S., and is an extension of Latin-1. See e.g. ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT which appears to be up-to-date (has 0x80 as the euro symbol, Unicode U+20AC -- although whether your version of U.S. Windows actually has this depends on whether you installed the service pack that added it!). See ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP850.TXT for the closest DOS got.

Leave JISXXX and the CJK stuff out. If you get into Japanese, you really need to cover ShiftJIS, EUC-JP and JIS, they are big, and
[Andy writes:] there [Then Marc relpies:]
2. give more information to the unicodec registry: one could register classes instead of instances which the Unicode
[Jack chimes in with:]
[And the conversation twisted around to Greg noting:]
This is leading me to conclude that our "codec registry" should be the file system, and Python modules. Would it be possible to define a "standard package" called "encodings", and when we need an encoding, we simply attempt to load a module from that package? The key benefits I see are: * No need to load modules simply to register a codec (which would make the number of open calls even higher, and the startup time even slower.) This makes it truly demand-loading of the codecs, rather than explicit load-and-register. * Making language specific distributions becomes simple - simply select a different set of modules from the "encodings" directory. The Python source distribution has them all, but (say) the Windows binary installer selects only a few. The Japanese binary installer for Windows installs a few more. * Installing new codecs becomes trivial - no need to hack site.py etc - simply copy the new "codec module" to the encodings directory and you are done. * No serious problem for GMcM's installer nor for freeze We would probably need to assume that certain codes exist for _all_ platforms and language - but this is no different to assuming that "exceptions.py" also exists for all platforms. Is this worthy of consideration? Mark.

On Wed, 17 Nov 1999 08:54:15 +1100, you wrote:
Exactly what I am aiming for. The real icing on the cake would be a small state machine or some helper functions in C which made it possible to write fast codecs in pure Python, but that can come a bit later when we have examples up and running. - Andy

On Wed, 17 Nov 1999, Mark Hammond wrote:
Absolutely! You will need to provide a way for a module (in the "codec" package) to state *beforehand* that it should be loaded for the X, Y, and Z encodings. This might be in terms of little "info" files that get dropped into the package. The __init__.py module scans the directory for the info files and loads them to build an encoding => module-name mapping. The alternative would be to have stub modules like: iso-8859-1.py: import unicodec def encode_1(...) ... def encode_2(...) ... ... unicodec.register('iso-8859-1', encode_1, decode_1) unicodec.register('iso-8859-2', encode_2, decode_2) ... iso-8859-2.py: import iso-8859-1 I believe that encoding names are legitimate file names, but they aren't necessarily Python identifiers. That kind of bungs up "import codec.iso-8859-1". The codec package would need to programmatically import the modules. Clients should not be directly importing the modules, so I don't see a difficult here. [ if we do decide to allow clients access to the modules, then maybe they have to arrive through a "helper" module that has a nice name, or the codec package provides a "module = code.load('iso-8859-1')" idiom. ] Cheers, -g -- Greg Stein, http://www.lyra.org/

The alternative would be to have stub modules like:
Actually, I was thinking even more radically - drop the codec registry all together, and use modules with "well-known" names (a slight precedent, but Python isnt adverse to well-known names in general) eg: iso-8859-1.py: import unicodec def encode(...): ... def decode(...): ... iso-8859-2.py: from iso-8859-1 import * The codec registry then is trivial, and effectively does not exist (cant get much more trivial than something that doesnt exist :-): def getencoder(encoding): mod = __import__( "encodings." + encoding ) return getattr(mod, "encode")
Agreed - clients should never need to import them, and codecs that wish to import other codes could use "__import__" Of course, I am not adverse to the idea of a registry as well and having the modules manually register themselves - but it doesnt seem to buy much, and the logic for getting a codec becomes more complex - ie, it needs to determine the module to import, then look in the registry - if it needs to determine the module anyway, why not just get it from the module and be done with it? Mark.

Mark Hammond wrote:
Why not... using the new registry scheme I proposed in the thread "Codecs and StreamCodecs" you could implement this via factory_functions and lazy imports (with the encoding name folded to make up a proper Python identifier, e.g. hyphens get converted to '' and spaces to '_'). I'd suggest grouping encodings: [encodings] [iso} [iso88591] [iso88592] [jis] ... [cyrillic] ... [misc] The unicodec registry could then query encodings.get(encoding,action) and the package would take care of the rest. Note that the "walk-me-up-scotty" import patch would probably be nice in this situation too, e.g. to reach the modules in [misc] or in higher levels such the ones in [iso] from [iso88591]. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 44 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

On Wed, 17 Nov 1999, M.-A. Lemburg wrote:
WHY?!?! This is taking a simple solution and making it complicated. I see no benefit to the creating yet-another-level-of-hierarchy. Why should they be grouped? Leave the modules just under "encodings" and be done with it. Cheers, -g -- Greg Stein, http://www.lyra.org/

Agreed. Tim Peters once remarked that Python likes shallow encodings (or perhaps that *I* like them :-). This is one such case where I would strongly urge for the simplicity of a shallow hierarchy. --Guido van Rossum (home page: http://www.python.org/~guido/)

Greg Stein wrote:
Nevermind, was just an idea... -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Jack Jansen wrote:
Good idea. What code pages would that be ? -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

[Jack Jansen]
[MAL]
Good idea. What code pages would that be ?
I'm not clear on what's being suggested; e.g., Windows supports *many* different "code pages". CP 1252 is default in the U.S., and is an extension of Latin-1. See e.g. ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT which appears to be up-to-date (has 0x80 as the euro symbol, Unicode U+20AC -- although whether your version of U.S. Windows actually has this depends on whether you installed the service pack that added it!). See ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP850.TXT for the closest DOS got.

Leave JISXXX and the CJK stuff out. If you get into Japanese, you really need to cover ShiftJIS, EUC-JP and JIS, they are big, and
[Andy writes:] there [Then Marc relpies:]
2. give more information to the unicodec registry: one could register classes instead of instances which the Unicode
[Jack chimes in with:]
[And the conversation twisted around to Greg noting:]
This is leading me to conclude that our "codec registry" should be the file system, and Python modules. Would it be possible to define a "standard package" called "encodings", and when we need an encoding, we simply attempt to load a module from that package? The key benefits I see are: * No need to load modules simply to register a codec (which would make the number of open calls even higher, and the startup time even slower.) This makes it truly demand-loading of the codecs, rather than explicit load-and-register. * Making language specific distributions becomes simple - simply select a different set of modules from the "encodings" directory. The Python source distribution has them all, but (say) the Windows binary installer selects only a few. The Japanese binary installer for Windows installs a few more. * Installing new codecs becomes trivial - no need to hack site.py etc - simply copy the new "codec module" to the encodings directory and you are done. * No serious problem for GMcM's installer nor for freeze We would probably need to assume that certain codes exist for _all_ platforms and language - but this is no different to assuming that "exceptions.py" also exists for all platforms. Is this worthy of consideration? Mark.

On Wed, 17 Nov 1999 08:54:15 +1100, you wrote:
Exactly what I am aiming for. The real icing on the cake would be a small state machine or some helper functions in C which made it possible to write fast codecs in pure Python, but that can come a bit later when we have examples up and running. - Andy

On Wed, 17 Nov 1999, Mark Hammond wrote:
Absolutely! You will need to provide a way for a module (in the "codec" package) to state *beforehand* that it should be loaded for the X, Y, and Z encodings. This might be in terms of little "info" files that get dropped into the package. The __init__.py module scans the directory for the info files and loads them to build an encoding => module-name mapping. The alternative would be to have stub modules like: iso-8859-1.py: import unicodec def encode_1(...) ... def encode_2(...) ... ... unicodec.register('iso-8859-1', encode_1, decode_1) unicodec.register('iso-8859-2', encode_2, decode_2) ... iso-8859-2.py: import iso-8859-1 I believe that encoding names are legitimate file names, but they aren't necessarily Python identifiers. That kind of bungs up "import codec.iso-8859-1". The codec package would need to programmatically import the modules. Clients should not be directly importing the modules, so I don't see a difficult here. [ if we do decide to allow clients access to the modules, then maybe they have to arrive through a "helper" module that has a nice name, or the codec package provides a "module = code.load('iso-8859-1')" idiom. ] Cheers, -g -- Greg Stein, http://www.lyra.org/

The alternative would be to have stub modules like:
Actually, I was thinking even more radically - drop the codec registry all together, and use modules with "well-known" names (a slight precedent, but Python isnt adverse to well-known names in general) eg: iso-8859-1.py: import unicodec def encode(...): ... def decode(...): ... iso-8859-2.py: from iso-8859-1 import * The codec registry then is trivial, and effectively does not exist (cant get much more trivial than something that doesnt exist :-): def getencoder(encoding): mod = __import__( "encodings." + encoding ) return getattr(mod, "encode")
Agreed - clients should never need to import them, and codecs that wish to import other codes could use "__import__" Of course, I am not adverse to the idea of a registry as well and having the modules manually register themselves - but it doesnt seem to buy much, and the logic for getting a codec becomes more complex - ie, it needs to determine the module to import, then look in the registry - if it needs to determine the module anyway, why not just get it from the module and be done with it? Mark.

Mark Hammond wrote:
Why not... using the new registry scheme I proposed in the thread "Codecs and StreamCodecs" you could implement this via factory_functions and lazy imports (with the encoding name folded to make up a proper Python identifier, e.g. hyphens get converted to '' and spaces to '_'). I'd suggest grouping encodings: [encodings] [iso} [iso88591] [iso88592] [jis] ... [cyrillic] ... [misc] The unicodec registry could then query encodings.get(encoding,action) and the package would take care of the rest. Note that the "walk-me-up-scotty" import patch would probably be nice in this situation too, e.g. to reach the modules in [misc] or in higher levels such the ones in [iso] from [iso88591]. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 44 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

On Wed, 17 Nov 1999, M.-A. Lemburg wrote:
WHY?!?! This is taking a simple solution and making it complicated. I see no benefit to the creating yet-another-level-of-hierarchy. Why should they be grouped? Leave the modules just under "encodings" and be done with it. Cheers, -g -- Greg Stein, http://www.lyra.org/

Agreed. Tim Peters once remarked that Python likes shallow encodings (or perhaps that *I* like them :-). This is one such case where I would strongly urge for the simplicity of a shallow hierarchy. --Guido van Rossum (home page: http://www.python.org/~guido/)

Greg Stein wrote:
Nevermind, was just an idea... -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
participants (7)
-
andy@robanal.demon.co.uk
-
Greg Stein
-
Guido van Rossum
-
Jack Jansen
-
M.-A. Lemburg
-
Mark Hammond
-
Tim Peters