Some thoughts on the codecs... 1. Stream interface At the moment a codec has dump and load methods which read a (slice of a) stream into a string in memory and vice versa. As the proposal notes, this could lead to errors if you take a slice out of a stream. This is not just due to character truncation; some Asian encodings are modal and have shift-in and shift-out sequences as they move from Western single-byte characters to double-byte ones. It also seems a bit pointless to me as the source (or target) is still a Unicode string in memory. This is a real problem - a filter to convert big files between two encodings should be possible without knowledge of the particular encoding, as should one on the input/output of some server. We can still give a default implementation for single-byte encodings. What's a good API for real stream conversion? just Codec.encodeStream(infile, outfile) ? or is it more useful to feed the codec with data a chunk at a time? 2. Data driven codecs I really like codecs being objects, and believe we could build support for a lot more encodings, a lot sooner than is otherwise possible, by making them data driven rather making each one compiled C code with static mapping tables. What do people think about the approach below? First of all, the ISO8859-1 series are straight mappings to Unicode code points. So one Python script could parse these files and build the mapping table, and a very small data file could hold these encodings. A compiled helper function analogous to string.translate() could deal with most of them. Secondly, the double-byte ones involve a mixture of algorithms and data. The worst cases I know are modal encodings which need a single-byte lookup table, a double-byte lookup table, and have some very simple rules about escape sequences in between them. A simple state machine could still handle these (and the single-byte mappings above become extra-simple special cases); I could imagine feeding it a totally data-driven set of rules. Third, we can massively compress the mapping tables using a notation which just lists contiguous ranges; and very often there are relationships between encodings. For example, "cpXYZ is just like cpXYY but with an extra 'smiley' at 0XFE32". In these cases, a script can build a family of related codecs in an auditable manner. 3. What encodings to distribute? The only clean answers to this are 'almost none', or 'everything that Unicode 3.0 has a mapping for'. The latter is going to add some weight to the distribution. What are people's feelings? Do we ship any at all apart from the Unicode ones? Should new encodings be downloadable from www.python.org? Should there be an optional package outside the main distribution? Thanks, Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com
Andy Robinson wrote:
Some thoughts on the codecs...
1. Stream interface At the moment a codec has dump and load methods which read a (slice of a) stream into a string in memory and vice versa. As the proposal notes, this could lead to errors if you take a slice out of a stream. This is not just due to character truncation; some Asian encodings are modal and have shift-in and shift-out sequences as they move from Western single-byte characters to double-byte ones. It also seems a bit pointless to me as the source (or target) is still a Unicode string in memory.
This is a real problem - a filter to convert big files between two encodings should be possible without knowledge of the particular encoding, as should one on the input/output of some server. We can still give a default implementation for single-byte encodings.
What's a good API for real stream conversion? just Codec.encodeStream(infile, outfile) ? or is it more useful to feed the codec with data a chunk at a time?
The idea was to use Unicode as intermediate for all encoding conversions. What you invision here are stream recoders. The can easily be implemented as an useful addition to the Codec subclasses, but I don't think that these have to go into the core.
2. Data driven codecs I really like codecs being objects, and believe we could build support for a lot more encodings, a lot sooner than is otherwise possible, by making them data driven rather making each one compiled C code with static mapping tables. What do people think about the approach below?
First of all, the ISO8859-1 series are straight mappings to Unicode code points. So one Python script could parse these files and build the mapping table, and a very small data file could hold these encodings. A compiled helper function analogous to string.translate() could deal with most of them.
The problem with these large tables is that currently Python modules are not shared among processes since every process builds its own table. Static C data has the advantage of being shareable at the OS level. You can of course implement Python based lookup tables, but these should be too large...
Secondly, the double-byte ones involve a mixture of algorithms and data. The worst cases I know are modal encodings which need a single-byte lookup table, a double-byte lookup table, and have some very simple rules about escape sequences in between them. A simple state machine could still handle these (and the single-byte mappings above become extra-simple special cases); I could imagine feeding it a totally data-driven set of rules.
Third, we can massively compress the mapping tables using a notation which just lists contiguous ranges; and very often there are relationships between encodings. For example, "cpXYZ is just like cpXYY but with an extra 'smiley' at 0XFE32". In these cases, a script can build a family of related codecs in an auditable manner.
These are all great ideas, but I think they unnecessarily complicate the proposal.
3. What encodings to distribute? The only clean answers to this are 'almost none', or 'everything that Unicode 3.0 has a mapping for'. The latter is going to add some weight to the distribution. What are people's feelings? Do we ship any at all apart from the Unicode ones? Should new encodings be downloadable from www.python.org? Should there be an optional package outside the main distribution?
Since Codecs can be registered at runtime, there is quite some potential there for extension writers coding their own fast codecs. E.g. one could use mxTextTools as codec engine working at C speeds. I would propose to only add some very basic encodings to the standard distribution, e.g. the ones mentioned under Standard Codecs in the proposal: 'utf-8': 8-bit variable length encoding 'utf-16': 16-bit variable length encoding (litte/big endian) 'utf-16-le': utf-16 but explicitly little endian 'utf-16-be': utf-16 but explicitly big endian 'ascii': 7-bit ASCII codepage 'latin-1': Latin-1 codepage 'html-entities': Latin-1 + HTML entities; see htmlentitydefs.py from the standard Pythin Lib 'jis' (a popular version XXX): Japanese character encoding 'unicode-escape': See Unicode Constructors for a definition 'native': Dump of the Internal Format used by Python Perhaps not even 'html-entities' (even though it would make a cool replacement for cgi.escape()) and maybe we should also place the JIS encoding into a separate Unicode package. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 46 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
Andy Robinson wrote:
Some thoughts on the codecs...
1. Stream interface At the moment a codec has dump and load methods which read a (slice of a) stream into a string in memory and vice versa. As the proposal notes, this could lead to errors if you take a slice out of a stream. This is not just due to character truncation; some Asian encodings are modal and have shift-in and shift-out sequences as they move from Western single-byte characters to double-byte ones. It also seems a bit pointless to me as the source (or target) is still a Unicode string in memory.
This is a real problem - a filter to convert big files between two encodings should be possible without knowledge of the particular encoding, as should one on the input/output of some server. We can still give a default implementation for single-byte encodings.
What's a good API for real stream conversion? just Codec.encodeStream(infile, outfile) ? or is it more useful to feed the codec with data a chunk at a time?
M.-A. Lemburg responds:
The idea was to use Unicode as intermediate for all encoding conversions.
What you invision here are stream recoders. The can easily be implemented as an useful addition to the Codec subclasses, but I don't think that these have to go into the core.
What I wanted was a codec API that acts somewhat like a buffered file; the buffer makes it possible to efficient handle shift states. This is not exactly what Andy shows, but it's not what Marc's current spec has either. I had thought something more like what Java does: an output stream codec's constructor takes a writable file object and the object returned by the constructor has a write() method, a flush() method and a close() method. It acts like a buffering interface to the underlying file; this allows it to generate the minimal number of shift sequeuces. Similar for input stream codecs. Andy's file translation example could then be written as follows: # assuming variables input_file, input_encoding, output_file, # output_encoding, and constant BUFFER_SIZE f = open(input_file, "rb") f1 = unicodec.codecs[input_encoding].stream_reader(f) g = open(output_file, "wb") g1 = unicodec.codecs[output_encoding].stream_writer(f) while 1: buffer = f1.read(BUFFER_SIZE) if not buffer: break f2.write(buffer) f2.close() f1.close() Note that we could possibly make these the only API that a codec needs to provide; the string object <--> unicode object conversions can be done using this and the cStringIO module. (On the other hand it seems a common case that would be quite useful.)
2. Data driven codecs I really like codecs being objects, and believe we could build support for a lot more encodings, a lot sooner than is otherwise possible, by making them data driven rather making each one compiled C code with static mapping tables. What do people think about the approach below?
First of all, the ISO8859-1 series are straight mappings to Unicode code points. So one Python script could parse these files and build the mapping table, and a very small data file could hold these encodings. A compiled helper function analogous to string.translate() could deal with most of them.
The problem with these large tables is that currently Python modules are not shared among processes since every process builds its own table.
Static C data has the advantage of being shareable at the OS level.
Don't worry about it. 128K is too small to care, I think...
You can of course implement Python based lookup tables, but these should be too large...
Secondly, the double-byte ones involve a mixture of algorithms and data. The worst cases I know are modal encodings which need a single-byte lookup table, a double-byte lookup table, and have some very simple rules about escape sequences in between them. A simple state machine could still handle these (and the single-byte mappings above become extra-simple special cases); I could imagine feeding it a totally data-driven set of rules.
Third, we can massively compress the mapping tables using a notation which just lists contiguous ranges; and very often there are relationships between encodings. For example, "cpXYZ is just like cpXYY but with an extra 'smiley' at 0XFE32". In these cases, a script can build a family of related codecs in an auditable manner.
These are all great ideas, but I think they unnecessarily complicate the proposal.
Agreed, let's leave the *implementation* of codecs out of the current efforts. However I want to make sure that the *interface* to codecs is defined right, because changing it will be expensive. (This is Linus Torvald's philosophy on drivers -- he doesn't care about bugs in drivers, as they will get fixed; however he greatly cares about defining the driver APIs correctly.)
3. What encodings to distribute? The only clean answers to this are 'almost none', or 'everything that Unicode 3.0 has a mapping for'. The latter is going to add some weight to the distribution. What are people's feelings? Do we ship any at all apart from the Unicode ones? Should new encodings be downloadable from www.python.org? Should there be an optional package outside the main distribution?
Since Codecs can be registered at runtime, there is quite some potential there for extension writers coding their own fast codecs. E.g. one could use mxTextTools as codec engine working at C speeds.
(Do you think you'll be able to extort some money from HP for these? :-)
I would propose to only add some very basic encodings to the standard distribution, e.g. the ones mentioned under Standard Codecs in the proposal:
'utf-8': 8-bit variable length encoding 'utf-16': 16-bit variable length encoding (litte/big endian) 'utf-16-le': utf-16 but explicitly little endian 'utf-16-be': utf-16 but explicitly big endian 'ascii': 7-bit ASCII codepage 'latin-1': Latin-1 codepage 'html-entities': Latin-1 + HTML entities; see htmlentitydefs.py from the standard Pythin Lib 'jis' (a popular version XXX): Japanese character encoding 'unicode-escape': See Unicode Constructors for a definition 'native': Dump of the Internal Format used by Python
Perhaps not even 'html-entities' (even though it would make a cool replacement for cgi.escape()) and maybe we should also place the JIS encoding into a separate Unicode package.
I'd drop html-entities, it seems too cutesie. (And who uses these anyway, outside browsers?) For JIS (shift-JIS?) I hope that Andy can help us with some pointers and validation. And unicode-escape: now that you mention it, this is a section of the proposal that I don't understand. I quote it here: | Python should provide a built-in constructor for Unicode strings which | is available through __builtins__: | | u = unicode(<encoded Python string>[,<encoding name>=<default encoding>]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ What do you mean by this notation? Since encoding names are not always legal Python identifiers (most contain hyphens), I don't understand what you really meant here. Do you mean to say that it has to be a keyword argument? I would disagree; and then I would have expected the notation [,encoding=<default encoding>]. | With the 'unicode-escape' encoding being defined as: | | u = u'<unicode-escape encoded Python string>' | | · for single characters (and this includes all \XXX sequences except \uXXXX), | take the ordinal and interpret it as Unicode ordinal; | | · for \uXXXX sequences, insert the Unicode character with ordinal 0xXXXX | instead, e.g. \u03C0 to represent the character Pi. I've looked at this several times and I don't see the difference between the two bullets. (Ironically, you are using a non-ASCII character here that doesn't always display, depending on where I look at your mail :-). Can you give some examples? Is u'\u0020' different from u'\x20' (a space)? Does '\u0020' (no u prefix) have a meaning? Also, I remember reading Tim Peters who suggested that a "raw unicode" notation (ur"...") might be necessary, to encode regular expressions. I tend to agree. While I'm on the topic, I don't see in your proposal a description of the source file character encoding. Currently, this is undefined, and in fact can be (ab)used to enter non-ASCII in string literals. For example, a programmer named François might write a file containing this statement: print "Written by François." # (There's a cedilla in there!) (He assumes his source character encoding is Latin-1, and he doesn't want to have to type \347 when he can type a cedilla on his keyboard.) If his source file (or .pyc file!) is executed by a Japanese user, this will probably print some garbage. Using the new Unicode strings, François could change his program as follows: print unicode("Written by François.", "latin-1") Assuming that François sets his sys.stdout to use Latin-1, while the Japanese user sets his to shift-JIS (or whatever his kanjiterm uses). But when the Japanese user views François' source file, he will again see garbage. If he uses a generic tool to translate latin-1 files to shift-JIS (assuming shift-JIS has a cedilla character) the program will no longer work correctly -- the string "latin-1" has to be changed to "shift-jis". What should we do about this? The safest and most radical solution is to disallow non-ASCII source characters; François will then have to type print u"Written by Fran\u00E7ois." but, knowing François, he probably won't like this solution very much (since he didn't like the \347 version either). --Guido van Rossum (home page: http://www.python.org/~guido/)
On Mon, 15 Nov 1999 16:37:28 -0500, you wrote:
# assuming variables input_file, input_encoding, output_file, # output_encoding, and constant BUFFER_SIZE
f = open(input_file, "rb") f1 = unicodec.codecs[input_encoding].stream_reader(f) g = open(output_file, "wb") g1 = unicodec.codecs[output_encoding].stream_writer(f)
while 1: buffer = f1.read(BUFFER_SIZE) if not buffer: break f2.write(buffer)
f2.close() f1.close()
Note that we could possibly make these the only API that a codec needs to provide; the string object <--> unicode object conversions can be done using this and the cStringIO module. (On the other hand it seems a common case that would be quite useful.) Perfect. I'd keep the string ones - easy to implement but a big convenience.
The proposal also says:
For explicit handling of Unicode using files, the unicodec module could provide stream wrappers which provide transparent encoding/decoding for any open stream (file-like object):
import unicodec file = open('mytext.txt','rb') ufile = unicodec.stream(file,'utf-16') u = ufile.read() ... ufile.close()
It seems to me that if we go for stream_reader, it replaces this bit of the proposal too - no need for unicodec to provide anything. If you want to have a convenience function there to save a line or two, you could have unicodec.open(filename, mode, encoding) which returned a stream_reader. - Andy
[I'll get back on this tomorrow, just some quick notes here...] Guido van Rossum wrote:
Andy Robinson wrote:
Some thoughts on the codecs...
1. Stream interface At the moment a codec has dump and load methods which read a (slice of a) stream into a string in memory and vice versa. As the proposal notes, this could lead to errors if you take a slice out of a stream. This is not just due to character truncation; some Asian encodings are modal and have shift-in and shift-out sequences as they move from Western single-byte characters to double-byte ones. It also seems a bit pointless to me as the source (or target) is still a Unicode string in memory.
This is a real problem - a filter to convert big files between two encodings should be possible without knowledge of the particular encoding, as should one on the input/output of some server. We can still give a default implementation for single-byte encodings.
What's a good API for real stream conversion? just Codec.encodeStream(infile, outfile) ? or is it more useful to feed the codec with data a chunk at a time?
M.-A. Lemburg responds:
The idea was to use Unicode as intermediate for all encoding conversions.
What you invision here are stream recoders. The can easily be implemented as an useful addition to the Codec subclasses, but I don't think that these have to go into the core.
What I wanted was a codec API that acts somewhat like a buffered file; the buffer makes it possible to efficient handle shift states. This is not exactly what Andy shows, but it's not what Marc's current spec has either.
I had thought something more like what Java does: an output stream codec's constructor takes a writable file object and the object returned by the constructor has a write() method, a flush() method and a close() method. It acts like a buffering interface to the underlying file; this allows it to generate the minimal number of shift sequeuces. Similar for input stream codecs.
The Codecs provide implementations for encoding and decoding, they are not intended as complete wrappers for e.g. files or sockets. The unicodec module will define a generic stream wrapper (which is yet to be defined) for dealing with files, sockets, etc. It will use the codec registry to do the actual codec work.
From the proposal: """ For explicit handling of Unicode using files, the unicodec module could provide stream wrappers which provide transparent encoding/decoding for any open stream (file-like object):
import unicodec file = open('mytext.txt','rb') ufile = unicodec.stream(file,'utf-16') u = ufile.read() ... ufile.close() XXX unicodec.file(<filename>,<mode>,<encname>) could be provided as short-hand for unicodec.file(open(<filename>,<mode>),<encname>) which also assures that <mode> contains the 'b' character when needed. XXX Specify the wrapper(s)... Open issues: what to do with Python strings fed to the .write() method (may need to know the encoding of the strings) and when/if to return Python strings through the .read() method. Perhaps we need more than one type of wrapper here. """
Andy's file translation example could then be written as follows:
# assuming variables input_file, input_encoding, output_file, # output_encoding, and constant BUFFER_SIZE
f = open(input_file, "rb") f1 = unicodec.codecs[input_encoding].stream_reader(f) g = open(output_file, "wb") g1 = unicodec.codecs[output_encoding].stream_writer(f)
while 1: buffer = f1.read(BUFFER_SIZE) if not buffer: break f2.write(buffer)
f2.close() f1.close()
Note that we could possibly make these the only API that a codec needs to provide; the string object <--> unicode object conversions can be done using this and the cStringIO module. (On the other hand it seems a common case that would be quite useful.)
You wouldn't want to go via cStringIO for *every* encoding translation. The Codec interface defines two pairs of methods on purpose: one which works internally (ie. directly between strings and Unicode objects), and one which works externally (directly between a stream and Unicode objects).
2. Data driven codecs I really like codecs being objects, and believe we could build support for a lot more encodings, a lot sooner than is otherwise possible, by making them data driven rather making each one compiled C code with static mapping tables. What do people think about the approach below?
First of all, the ISO8859-1 series are straight mappings to Unicode code points. So one Python script could parse these files and build the mapping table, and a very small data file could hold these encodings. A compiled helper function analogous to string.translate() could deal with most of them.
The problem with these large tables is that currently Python modules are not shared among processes since every process builds its own table.
Static C data has the advantage of being shareable at the OS level.
Don't worry about it. 128K is too small to care, I think...
Huh ? 128K for every process using Python ? That quickly sums up to lots of megabytes lying around pretty much unused.
You can of course implement Python based lookup tables, but these should be too large...
Secondly, the double-byte ones involve a mixture of algorithms and data. The worst cases I know are modal encodings which need a single-byte lookup table, a double-byte lookup table, and have some very simple rules about escape sequences in between them. A simple state machine could still handle these (and the single-byte mappings above become extra-simple special cases); I could imagine feeding it a totally data-driven set of rules.
Third, we can massively compress the mapping tables using a notation which just lists contiguous ranges; and very often there are relationships between encodings. For example, "cpXYZ is just like cpXYY but with an extra 'smiley' at 0XFE32". In these cases, a script can build a family of related codecs in an auditable manner.
These are all great ideas, but I think they unnecessarily complicate the proposal.
Agreed, let's leave the *implementation* of codecs out of the current efforts.
However I want to make sure that the *interface* to codecs is defined right, because changing it will be expensive. (This is Linus Torvald's philosophy on drivers -- he doesn't care about bugs in drivers, as they will get fixed; however he greatly cares about defining the driver APIs correctly.)
3. What encodings to distribute? The only clean answers to this are 'almost none', or 'everything that Unicode 3.0 has a mapping for'. The latter is going to add some weight to the distribution. What are people's feelings? Do we ship any at all apart from the Unicode ones? Should new encodings be downloadable from www.python.org? Should there be an optional package outside the main distribution?
Since Codecs can be registered at runtime, there is quite some potential there for extension writers coding their own fast codecs. E.g. one could use mxTextTools as codec engine working at C speeds.
(Do you think you'll be able to extort some money from HP for these? :-)
Don't know, it depends on what their specs look like. I use mxTextTools for fast HTML file processing. It uses a small Turing machine with some extra magic and is progammable via Python tuples.
I would propose to only add some very basic encodings to the standard distribution, e.g. the ones mentioned under Standard Codecs in the proposal:
'utf-8': 8-bit variable length encoding 'utf-16': 16-bit variable length encoding (litte/big endian) 'utf-16-le': utf-16 but explicitly little endian 'utf-16-be': utf-16 but explicitly big endian 'ascii': 7-bit ASCII codepage 'latin-1': Latin-1 codepage 'html-entities': Latin-1 + HTML entities; see htmlentitydefs.py from the standard Pythin Lib 'jis' (a popular version XXX): Japanese character encoding 'unicode-escape': See Unicode Constructors for a definition 'native': Dump of the Internal Format used by Python
Perhaps not even 'html-entities' (even though it would make a cool replacement for cgi.escape()) and maybe we should also place the JIS encoding into a separate Unicode package.
I'd drop html-entities, it seems too cutesie. (And who uses these anyway, outside browsers?)
Ok.
For JIS (shift-JIS?) I hope that Andy can help us with some pointers and validation.
And unicode-escape: now that you mention it, this is a section of the proposal that I don't understand. I quote it here:
| Python should provide a built-in constructor for Unicode strings which | is available through __builtins__: | | u = unicode(<encoded Python string>[,<encoding name>=<default encoding>]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I meant this as optional second argument defaulting to whatever we define <default encoding> to mean, e.g. 'utf-8'. u = unicode("string","utf-8") == unicode("string") The <encoding name> argument must be a string identifying one of the registered codecs.
| With the 'unicode-escape' encoding being defined as: | | u = u'<unicode-escape encoded Python string>' | | · for single characters (and this includes all \XXX sequences except \uXXXX), | take the ordinal and interpret it as Unicode ordinal; | | · for \uXXXX sequences, insert the Unicode character with ordinal 0xXXXX | instead, e.g. \u03C0 to represent the character Pi.
I've looked at this several times and I don't see the difference between the two bullets. (Ironically, you are using a non-ASCII character here that doesn't always display, depending on where I look at your mail :-).
The first bullet covers the normal Python string characters and escapes, e.g. \n and \267 (the center dot ;-), while the second explains how \uXXXX is interpreted.
Can you give some examples?
Is u'\u0020' different from u'\x20' (a space)?
No, they both map to the same Unicode ordinal.
Does '\u0020' (no u prefix) have a meaning?
No, \uXXXX is only defined for u"" strings or strings that are used to build Unicode objects with this encoding: u = u'\u0020' == unicode(r'\u0020','unicode-escape') Note that writing \uXX is an error, e.g. u"\u12 " will cause cause a syntax error. Aside: I just noticed that '\x2010' doesn't give '\x20' + '10' but instead '\x10' -- is this intended ?
Also, I remember reading Tim Peters who suggested that a "raw unicode" notation (ur"...") might be necessary, to encode regular expressions. I tend to agree.
This can be had via unicode(): u = unicode(r'\a\b\c\u0020','unicode-escaped') If that's too long, define a ur() function which wraps up the above line in a function.
While I'm on the topic, I don't see in your proposal a description of the source file character encoding. Currently, this is undefined, and in fact can be (ab)used to enter non-ASCII in string literals. For example, a programmer named François might write a file containing this statement:
print "Written by François." # (There's a cedilla in there!)
(He assumes his source character encoding is Latin-1, and he doesn't want to have to type \347 when he can type a cedilla on his keyboard.)
If his source file (or .pyc file!) is executed by a Japanese user, this will probably print some garbage.
Using the new Unicode strings, François could change his program as follows:
print unicode("Written by François.", "latin-1")
Assuming that François sets his sys.stdout to use Latin-1, while the Japanese user sets his to shift-JIS (or whatever his kanjiterm uses).
But when the Japanese user views François' source file, he will again see garbage. If he uses a generic tool to translate latin-1 files to shift-JIS (assuming shift-JIS has a cedilla character) the program will no longer work correctly -- the string "latin-1" has to be changed to "shift-jis".
What should we do about this? The safest and most radical solution is to disallow non-ASCII source characters; François will then have to type
print u"Written by Fran\u00E7ois."
but, knowing François, he probably won't like this solution very much (since he didn't like the \347 version either).
I think best is to leave it undefined... as with all files, only the programmer knows what format and encoding it contains, e.g. a Japanese programmer might want to use a shift-JIS editor to enter strings directly in shift-JIS via u = unicode("...shift-JIS encoded text...","shift-jis") Of course, this is not readable using an ASCII editor, but Python will continue to produce the intended string. NLS strings don't belong into program text anyway: i10n usually takes the gettext() approach to handle these issues. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 46 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
On Mon, 15 Nov 1999 23:54:38 +0100, you wrote:
[I'll get back on this tomorrow, just some quick notes here...] The Codecs provide implementations for encoding and decoding, they are not intended as complete wrappers for e.g. files or sockets.
The unicodec module will define a generic stream wrapper (which is yet to be defined) for dealing with files, sockets, etc. It will use the codec registry to do the actual codec work.
XXX unicodec.file(<filename>,<mode>,<encname>) could be provided as short-hand for unicodec.file(open(<filename>,<mode>),<encname>) which also assures that <mode> contains the 'b' character when needed.
The Codec interface defines two pairs of methods on purpose: one which works internally (ie. directly between strings and Unicode objects), and one which works externally (directly between a stream and Unicode objects).
That's the problem Guido and I are worried about. Your present API is not enough to build stream encoders. The 'slurp it into a unicode string in one go' approach fails for big files or for network connections. And you just cannot build a generic stream reader/writer by slicing it into strings. The solution must be specific to the codec - only it knows how much to buffer, when to flip states etc. So the codec should provide proper stream reading and writing services. Unicodec can then wrap those up in labour-saving ways - I'm not fussy which but I like the one-line file-open utility. - Andy
Andy Robinson wrote:
On Mon, 15 Nov 1999 23:54:38 +0100, you wrote:
[I'll get back on this tomorrow, just some quick notes here...] The Codecs provide implementations for encoding and decoding, they are not intended as complete wrappers for e.g. files or sockets.
The unicodec module will define a generic stream wrapper (which is yet to be defined) for dealing with files, sockets, etc. It will use the codec registry to do the actual codec work.
XXX unicodec.file(<filename>,<mode>,<encname>) could be provided as short-hand for unicodec.file(open(<filename>,<mode>),<encname>) which also assures that <mode> contains the 'b' character when needed.
The Codec interface defines two pairs of methods on purpose: one which works internally (ie. directly between strings and Unicode objects), and one which works externally (directly between a stream and Unicode objects).
That's the problem Guido and I are worried about. Your present API is not enough to build stream encoders. The 'slurp it into a unicode string in one go' approach fails for big files or for network connections. And you just cannot build a generic stream reader/writer by slicing it into strings. The solution must be specific to the codec - only it knows how much to buffer, when to flip states etc.
So the codec should provide proper stream reading and writing services.
I guess I'll have to rethink the Codec specs. Some leads: 1. introduce a new StreamCodec class which is designed for handling stream encoding and decoding (and supports state) 2. give more information to the unicodec registry: one could register classes instead of instances which the Unicode imlementation would then instantiate whenever it needs to apply the conversion; since this is only needed for encodings maintaining state, the registery would only have to do the instantiation for these codecs and could use cached instances for stateless codecs.
Unicodec can then wrap those up in labour-saving ways - I'm not fussy which but I like the one-line file-open utility.
-- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
[Guido]
Does '\u0020' (no u prefix) have a meaning?
[MAL]
No, \uXXXX is only defined for u"" strings or strings that are used to build Unicode objects with this encoding:
I believe your intent is that '\u0020' be exactly those 6 characters, just as today. That is, it does have a meaning, but its meaning differs between Unicode string literals and regular string literals.
Note that writing \uXX is an error, e.g. u"\u12 " will cause cause a syntax error.
Although I believe your intent <wink> is that, just as today, '\u12' is not an error.
Aside: I just noticed that '\x2010' doesn't give '\x20' + '10' but instead '\x10' -- is this intended ?
Yes; see 2.4.1 ("String literals") of the Lang Ref. Blame the C committee for not defining \x in a platform-independent way. Note that a Python \x escape consumes *all* following hex characters, no matter how many -- and ignores all but the last two.
This [raw Unicode strings] can be had via unicode():
u = unicode(r'\a\b\c\u0020','unicode-escaped')
If that's too long, define a ur() function which wraps up the above line in a function.
As before, I think that's fine for now, but won't stand forever.
Tim Peters wrote:
[Guido]
Does '\u0020' (no u prefix) have a meaning?
[MAL]
No, \uXXXX is only defined for u"" strings or strings that are used to build Unicode objects with this encoding:
I believe your intent is that '\u0020' be exactly those 6 characters, just as today. That is, it does have a meaning, but its meaning differs between Unicode string literals and regular string literals.
Right.
Note that writing \uXX is an error, e.g. u"\u12 " will cause cause a syntax error.
Although I believe your intent <wink> is that, just as today, '\u12' is not an error.
Right again :-) "\u12" gives a 4 byte string, u"\u12" produces an exception.
Aside: I just noticed that '\x2010' doesn't give '\x20' + '10' but instead '\x10' -- is this intended ?
Yes; see 2.4.1 ("String literals") of the Lang Ref. Blame the C committee for not defining \x in a platform-independent way. Note that a Python \x escape consumes *all* following hex characters, no matter how many -- and ignores all but the last two.
Strange definition...
This [raw Unicode strings] can be had via unicode():
u = unicode(r'\a\b\c\u0020','unicode-escaped')
If that's too long, define a ur() function which wraps up the above line in a function.
As before, I think that's fine for now, but won't stand forever.
If Guido agrees to ur"", I can put that into the proposal too -- it's just that things are starting to get a little crowded for a strawman proposal ;-) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
[Guido]
... While I'm on the topic, I don't see in your proposal a description of the source file character encoding. Currently, this is undefined, and in fact can be (ab)used to enter non-ASCII in string literals. ... What should we do about this? The safest and most radical solution is to disallow non-ASCII source characters; François will then have to type
print u"Written by Fran\u00E7ois."
but, knowing François, he probably won't like this solution very much (since he didn't like the \347 version either).
So long as Python opens source files using libc text mode, it can't guarantee more than C does: the presence of any character other than tab, newline, and ASCII 32-126 inclusive renders the file contents undefined. Go beyond that, and you've got the same problem as mailers and browsers, and so also the same solution: open source files in binary mode, and add a pragma specifying the intended charset. As a practical matter, declare that Python source is Latin-1 for now, and declare any *system* that doesn't support that non-conforming <wink>. python-is-the-measure-of-all-things-ly y'rs - tim
Guido van Rossum
I had thought something more like what Java does: an output stream codec's constructor takes a writable file object and the object returned by the constructor has a write() method, a flush() method and a close() method. It acts like a buffering interface to the underlying file; this allows it to generate the minimal number of shift sequeuces. Similar for input stream codecs.
note that the html/sgml/xml parsers generally support the feed/close protocol. to be able to use these codecs in that context, we need 1) codes written according to the "data consumer model", instead of the "stream" model. class myDecoder: def __init__(self, target): self.target = target self.state = ... def feed(self, data): ... extract as much data as possible ... self.target.feed(extracted data) def close(self): ... extract what's left ... self.target.feed(additional data) self.target.close() or 2) make threads mandatory, just like in Java. or 3) add light-weight threads (ala stackless python) to the interpreter... (I vote for alternative 3, but that's another story ;-) </F>
On Tue, 16 Nov 1999 09:39:20 +0100, you wrote:
1) codes written according to the "data consumer model", instead of the "stream" model.
class myDecoder: def __init__(self, target): self.target = target self.state = ... def feed(self, data): ... extract as much data as possible ... self.target.feed(extracted data) def close(self): ... extract what's left ... self.target.feed(additional data) self.target.close()
Apart from feed() instead of write(), how is that different from a Java-like Stream writer as Guido suggested? He said:
Andy's file translation example could then be written as follows:
# assuming variables input_file, input_encoding, output_file, # output_encoding, and constant BUFFER_SIZE
f = open(input_file, "rb") f1 = unicodec.codecs[input_encoding].stream_reader(f) g = open(output_file, "wb") g1 = unicodec.codecs[output_encoding].stream_writer(f)
while 1: buffer = f1.read(BUFFER_SIZE) if not buffer: break f2.write(buffer)
f2.close() f1.close()
Note that we could possibly make these the only API that a codec needs to provide; the string object <--> unicode object conversions can be done using this and the cStringIO module. (On the other hand it seems a common case that would be quite useful.)
- Andy
On Mon, 15 Nov 1999, Guido van Rossum wrote:
...
The problem with these large tables is that currently Python modules are not shared among processes since every process builds its own table.
Static C data has the advantage of being shareable at the OS level.
Don't worry about it. 128K is too small to care, I think...
This is the reason Python starts up so slow and has a large memory footprint. There hasn't been any concern for moving stuff into shared data pages. As a result, a process must map in a bunch of vmem pages, for no other reason than to allocate Python structures in that memory and copy constants in. Go start Perl 100 times, then do the same with Python. Python is significantly slower. I've actually written a web app in PHP because another one that I did in Python had slow response time. [ yah: the Real Man Answer is to write a real/good mod_python. ] Cheers, -g -- Greg Stein, http://www.lyra.org/
On 16 November 1999, Greg Stein said:
This is the reason Python starts up so slow and has a large memory footprint. There hasn't been any concern for moving stuff into shared data pages. As a result, a process must map in a bunch of vmem pages, for no other reason than to allocate Python structures in that memory and copy constants in.
Go start Perl 100 times, then do the same with Python. Python is significantly slower. I've actually written a web app in PHP because another one that I did in Python had slow response time. [ yah: the Real Man Answer is to write a real/good mod_python. ]
I don't think this is the only factor in startup overhead. Try looking into the number of system calls for the trivial startup case of each interpreter: $ truss perl -e 1 2> perl.log $ truss python -c 1 2> python.log (This is on Solaris; I did the same thing on Linux with "strace", and on IRIX with "par -s -SS". Dunno about other Unices.) The results are interesting, and useful despite the platform and version disparities. (For the record: Python 1.5.2 on all three platforms; Perl 5.005_03 on Solaris, 5.004_05 on Linux, and 5.004_04 on IRIX. The Solaris is 2.6, using the Official CNRI Python Build by Barry, and the ditto Perl build by me; the Linux system is starship, using whatever Perl and Python the Starship Masters provide us with; the IRIX box is an elderly but well-maintained SGI Challenge running IRIX 5.3.) Also, this is with an empty PYTHONPATH. The Solaris build of Python has different prefix and exec_prefix, but on the Linux and IRIX builds, they are the same. (I think this will reflect poorly on the Solaris version.) PERLLIB, PERL5LIB, and Perl's builtin @INC should not affect startup of the trivial "1" script, so I haven't paid attention to them. First, the size of log files (in lines), i.e. number of system calls: Solaris Linux IRIX[1] Perl 88 85 70 Python 425 316 257 [1] after chopping off the summary counts from the "par" output -- ie. these really are the number of system calls, not the number of lines in the log files Next, the number of "open" calls: Solaris Linux IRIX Perl 16 10 9 Python 107 71 48 (It looks as though *all* of the Perl 'open' calls are due to the dynamic linker going through /usr/lib and/or /lib.) And the number of unsuccessful "open" calls: Solaris Linux IRIX Perl 6 1 3 Python 77 49 32 Number of "mmap" calls: Solaris Linux IRIX Perl 25 25 1 Python 36 24 1 ...nope, guess we can't blame mmap for any Perl/Python startup disparity. How about "brk": Solaris Linux IRIX Perl 6 11 12 Python 47 39 25 ...ok, looks like Greg's gripe about memory holds some water. Rerunning "truss" on Solaris with "python -S -c 1" drastically reduces the startup overhead as measured by "number of system calls". Some quick timing experiments show a drastic speedup (in wall-clock time) by adding "-S": about 37% faster under Solaris, 56% faster under Linux, and 35% under IRIX. These figures should be taken with a large grain of salt, as the Linux and IRIX systems were fairly well loaded at the time, and the wall-clock results I measured had huge variance. Still, it gets the point across. Oh, also for the record, all timings were done like: perl -e 'for $i (1 .. 100) { system "python", "-S", "-c", "1"; }' because I wanted to guarantee no shell was involved in the Python startup. Greg -- Greg Ward - software developer gward@cnri.reston.va.us Corporation for National Research Initiatives 1895 Preston White Drive voice: +1-703-620-8990 Reston, Virginia, USA 20191-5434 fax: +1-703-620-0913
Greg Ward writes:
Next, the number of "open" calls: Solaris Linux IRIX Perl 16 10 9 Python 107 71 48
Running 'python -v' explains this: amarok akuchlin>python -v # /usr/local/lib/python1.5/exceptions.pyc matches /usr/local/lib/python1.5/exceptions.py import exceptions # precompiled from /usr/local/lib/python1.5/exceptions.pyc # /usr/local/lib/python1.5/site.pyc matches /usr/local/lib/python1.5/site.py import site # precompiled from /usr/local/lib/python1.5/site.pyc # /usr/local/lib/python1.5/os.pyc matches /usr/local/lib/python1.5/os.py import os # precompiled from /usr/local/lib/python1.5/os.pyc import posix # builtin # /usr/local/lib/python1.5/posixpath.pyc matches /usr/local/lib/python1.5/posixpath.py import posixpath # precompiled from /usr/local/lib/python1.5/posixpath.pyc # /usr/local/lib/python1.5/stat.pyc matches /usr/local/lib/python1.5/stat.py import stat # precompiled from /usr/local/lib/python1.5/stat.pyc # /usr/local/lib/python1.5/UserDict.pyc matches /usr/local/lib/python1.5/UserDict.py import UserDict # precompiled from /usr/local/lib/python1.5/UserDict.pyc Python 1.5.2 (#80, May 25 1999, 18:06:07) [GCC 2.8.1] on sunos5 Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam import readline # dynamically loaded from /usr/local/lib/python1.5/lib-dynload/readline.so And each import tries several different forms of the module name: stat("/usr/local/lib/python1.5/os", 0xEFFFD5E0) Err#2 ENOENT open("/usr/local/lib/python1.5/os.so", O_RDONLY) Err#2 ENOENT open("/usr/local/lib/python1.5/osmodule.so", O_RDONLY) Err#2 ENOENT open("/usr/local/lib/python1.5/os.py", O_RDONLY) = 4 I don't see how this is fixable, unless we strip down site.py, which drags in os, which drags in os.path and stat and UserDict. -- A.M. Kuchling http://starship.python.net/crew/amk/ I'm going stir-crazy, and I've joined the ranks of the walking brain-dead, but otherwise I'm just peachy. -- Lyta Hall on parenthood, in SANDMAN #40: "Parliament of Rooks"
"AMK" == Andrew M Kuchling
writes:
AMK> I don't see how this is fixable, unless we strip down AMK> site.py, which drags in os, which drags in os.path and stat AMK> and UserDict. One approach might be to support loading modules out of jar files (or whatever) using Greg imputils. We could put the bootstrap .pyc files in this jar and teach Python to import from it first. Python installations could even craft their own modules.jar file to include whatever modules they are willing to "hard code". This, with -S might make Python start up much faster, at the small cost of some flexibility (which could be regained with a c.l. switch or other mechanism to bypass modules.jar). -Barry
"AMK" == Andrew M Kuchling
writes: AMK> I don't see how this is fixable, unless we strip down AMK> site.py, which drags in os, which drags in os.path and stat AMK> and UserDict.
One approach might be to support loading modules out of jar files (or whatever) using Greg imputils. We could put the bootstrap .pyc files in this jar and teach Python to import from it first. Python installations could even craft their own modules.jar file to include whatever modules they are willing to "hard code". This, with -S might make Python start up much faster, at the small cost of some flexibility (which could be regained with a c.l. switch or other mechanism to bypass modules.jar).
A completely different approach (which, incidentally, HP has lobbied for before; and which has been implemented by Sjoerd Mullender for one particular application) would be to cache a mapping from module names to filenames in a dbm file. For Sjoerd's app (which imported hundreds of modules) this made a huge difference. The problem is that it's hard to deal with issues like updating the cache while sharing it with other processes and even other users... But if those can be solved, this could greatly reduce the number of stats and unsuccessful opens, without having to resort to jar files. --Guido van Rossum (home page: http://www.python.org/~guido/)
On 16 November 1999, Guido van Rossum said:
A completely different approach (which, incidentally, HP has lobbied for before; and which has been implemented by Sjoerd Mullender for one particular application) would be to cache a mapping from module names to filenames in a dbm file. For Sjoerd's app (which imported hundreds of modules) this made a huge difference.
Hey, this could be a big win for Zope startup. Dunno how much of that 20-30 sec startup overhead is due to loading modules, but I'm sure it's a sizeable percentage. Any Zope-heads listening?
The problem is that it's hard to deal with issues like updating the cache while sharing it with other processes and even other users...
Probably not a concern in the case of Zope: one installation, one process, only gets started when it's explicitly shut down and restarted. HmmmMMMMmmm... Greg
Greg Ward [gward@cnri.reston.va.us] wrote:
On 16 November 1999, Guido van Rossum said:
A completely different approach (which, incidentally, HP has lobbied for before; and which has been implemented by Sjoerd Mullender for one particular application) would be to cache a mapping from module names to filenames in a dbm file. For Sjoerd's app (which imported hundreds of modules) this made a huge difference.
Hey, this could be a big win for Zope startup. Dunno how much of that 20-30 sec startup overhead is due to loading modules, but I'm sure it's a sizeable percentage. Any Zope-heads listening?
Wow, that's a huge start up that I've personally never seen. I can't imagine... even loading the Oracle libraries dynamically, which are HUGE (2Mb or so), it's only a couple seconds.
The problem is that it's hard to deal with issues like updating the cache while sharing it with other processes and even other users...
Probably not a concern in the case of Zope: one installation, one process, only gets started when it's explicitly shut down and restarted. HmmmMMMMmmm...
This doesn't reslve a lot of other users of Python howver... and Zope would always benefit, especially when you're running multiple instances on th same machine... would perhaps share more code. Chris -- | Christopher Petrilli | petrilli@amber.org
Barry A. Warsaw writes:
One approach might be to support loading modules out of jar files (or whatever) using Greg imputils. We could put the bootstrap .pyc files in this jar and teach Python to import from it first. Python installations could even craft their own modules.jar file to include whatever modules they are willing to "hard code". This, with -S might make Python start up much faster, at the small cost of some flexibility (which could be regained with a c.l. switch or other mechanism to bypass modules.jar).
Couple hundred Windows users have been doing this for months (http://starship.python.net/crew/gmcm/install.html). The .pyz files are cross-platform, although the "embedding" app would have to be redone for *nix, (and all the embedding really does is keep Python from hunting all over your disk). Yeah, it's faster. And I can put Python+Tcl/Tk+IDLE on a diskette with a little room left over. but-since-its-WIndows-it-must-be-tainted-ly y'rs - Gordon
On Tue, 16 Nov 1999, Gordon McMillan wrote:
Barry A. Warsaw writes:
One approach might be to support loading modules out of jar files (or whatever) using Greg imputils. We could put the bootstrap .pyc files in this jar and teach Python to import from it first. Python installations could even craft their own modules.jar file to include whatever modules they are willing to "hard code". This, with -S might make Python start up much faster, at the small cost of some flexibility (which could be regained with a c.l. switch or other mechanism to bypass modules.jar).
Couple hundred Windows users have been doing this for months (http://starship.python.net/crew/gmcm/install.html). The .pyz files are cross-platform, although the "embedding" app would have to be redone for *nix, (and all the embedding really does is keep Python from hunting all over your disk). Yeah, it's faster. And I can put Python+Tcl/Tk+IDLE on a diskette with a little room left over.
I've got a patch from Jim Ahlstrom to provide a "standardized" library file. I've got to review and fold that thing in (I'll post here when that is done). As Gordon states: yes, the startup time is considerably improved. The DBM approach is interesting. That could definitely be used thru an imputils Importer; it would be quite interesting to try that out. (Note that the library style approach would be even harder to deal with updates, relative to what Sjoerd saw with the DBM approach; I would guess that the "right" approach is to rebuild the library from scratch and atomically replace the thing (but that would bust people with open references...)) Certainly something to look at. Cheers, -g p.s. I also want to try mmap'ing a library and creating code objects that use PyBufferObjects (rather than PyStringObjects) that refer to portions of the mmap. Presuming the mmap is shared, there "should" be a large reduction in heap usage. Question is that I don't know the proportion of code bytes to other heap usage caused by loading a .pyc. p.p.s. I also want to try the buffer approach for frozen code. -- Greg Stein, http://www.lyra.org/
[Gordon McMillan]
... Yeah, it's faster. And I can put Python+Tcl/Tk+IDLE on a diskette with a little room left over.
That's truly remarkable (he says while waiting for the Inbox Repair Tool to finish repairing his 50Mb Outlook mail file ...)!
but-since-its-WIndows-it-must-be-tainted-ly y'rs
Indeed -- if it runs on Windows, it's a worthless piece o' crap <wink>.
Greg Ward wrote:
Go start Perl 100 times, then do the same with Python. Python is significantly slower. I've actually written a web app in PHP because another one that I did in Python had slow response time. [ yah: the Real Man Answer is to write a real/good mod_python. ]
I don't think this is the only factor in startup overhead. Try looking into the number of system calls for the trivial startup case of each interpreter:
$ truss perl -e 1 2> perl.log $ truss python -c 1 2> python.log
(This is on Solaris; I did the same thing on Linux with "strace", and on IRIX with "par -s -SS". Dunno about other Unices.) The results are interesting, and useful despite the platform and version disparities.
(For the record: Python 1.5.2 on all three platforms; Perl 5.005_03 on Solaris, 5.004_05 on Linux, and 5.004_04 on IRIX. The Solaris is 2.6, using the Official CNRI Python Build by Barry, and the ditto Perl build by me; the Linux system is starship, using whatever Perl and Python the Starship Masters provide us with; the IRIX box is an elderly but well-maintained SGI Challenge running IRIX 5.3.)
Also, this is with an empty PYTHONPATH. The Solaris build of Python has different prefix and exec_prefix, but on the Linux and IRIX builds, they are the same. (I think this will reflect poorly on the Solaris version.) PERLLIB, PERL5LIB, and Perl's builtin @INC should not affect startup of the trivial "1" script, so I haven't paid attention to them.
For kicks I've done a similar test with cgipython, the one file version of Python 1.5.2:
First, the size of log files (in lines), i.e. number of system calls:
Solaris Linux IRIX[1] Perl 88 85 70 Python 425 316 257
cgipython 182
[1] after chopping off the summary counts from the "par" output -- ie. these really are the number of system calls, not the number of lines in the log files
Next, the number of "open" calls:
Solaris Linux IRIX Perl 16 10 9 Python 107 71 48
cgipython 33
(It looks as though *all* of the Perl 'open' calls are due to the dynamic linker going through /usr/lib and/or /lib.)
And the number of unsuccessful "open" calls:
Solaris Linux IRIX Perl 6 1 3 Python 77 49 32
cgipython 28 Note that cgipython does search for sitecutomize.py.
Number of "mmap" calls:
Solaris Linux IRIX Perl 25 25 1 Python 36 24 1
cgipython 13
...nope, guess we can't blame mmap for any Perl/Python startup disparity.
How about "brk":
Solaris Linux IRIX Perl 6 11 12 Python 47 39 25
cgipython 41 (?) So at least in theory, using cgipython for the intended purpose should gain some performance. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
On Mon, 15 Nov 1999 20:20:55 +0100, you wrote:
These are all great ideas, but I think they unnecessarily complicate the proposal.
However, to claim that Python is properly internationalized, we will need a large number of multi-byte encodings to be available. It's a large amount of work, it must be provably correct, and someone's going to have to do it. So if anyone with more C expertise than me - not hard :-) - is interested I'm not suggesting putting my points in the Unicode proposal - in fact, I'm very happy we have a proposal which allows for extension, and lets us work on the encodings separately (and later).
Since Codecs can be registered at runtime, there is quite some potential there for extension writers coding their own fast codecs. E.g. one could use mxTextTools as codec engine working at C speeds. Exactly my thoughts , although I was thinking of a more slimmed down and specialized one. The right tool might be usable for things like compression algorithms too. Separate project to the Unicode stuff, but if anyone is interested, talk to me.
I would propose to only add some very basic encodings to the standard distribution, e.g. the ones mentioned under Standard Codecs in the proposal:
'utf-8': 8-bit variable length encoding 'utf-16': 16-bit variable length encoding (litte/big endian) 'utf-16-le': utf-16 but explicitly little endian 'utf-16-be': utf-16 but explicitly big endian 'ascii': 7-bit ASCII codepage 'latin-1': Latin-1 codepage 'html-entities': Latin-1 + HTML entities; see htmlentitydefs.py from the standard Pythin Lib 'jis' (a popular version XXX): Japanese character encoding 'unicode-escape': See Unicode Constructors for a definition 'native': Dump of the Internal Format used by Python
Leave JISXXX and the CJK stuff out. If you get into Japanese, you really need to cover ShiftJIS, EUC-JP and JIS, they are big, and there are lots of options about how to do it. The other ones are algorithmic and can be small and fast and fit into the core. Ditto with HTML, and maybe even escaped-unicode too. In summary, the current discussion is clearly doing the right things, but is only covering a small percentage of what needs to be done to internationalize Python fully. - Andy
In summary, the current discussion is clearly doing the right things, but is only covering a small percentage of what needs to be done to internationalize Python fully.
Agreed. So let's focus on defining interfaces that are correct and convenient so others who want to add codecs won't have to fight our architecture! Is the current architecture good enough so that the Japanese codecs will fit in it? (I'm particularly worried about the stream codecs, see my previous message.) --Guido van Rossum (home page: http://www.python.org/~guido/)
On Mon, 15 Nov 1999 16:49:26 -0500, you wrote:
In summary, the current discussion is clearly doing the right things, but is only covering a small percentage of what needs to be done to internationalize Python fully.
Agreed. So let's focus on defining interfaces that are correct and convenient so others who want to add codecs won't have to fight our architecture!
Is the current architecture good enough so that the Japanese codecs will fit in it? (I'm particularly worried about the stream codecs, see my previous message.)
No, I don't think it is good enough. We need a stream codec, and as you said the string and file interfaces can be built out of that. You guys will know better than me what the best patterns for that are... - Andy
Andy Robinson wrote:
Leave JISXXX and the CJK stuff out. If you get into Japanese, you really need to cover ShiftJIS, EUC-JP and JIS, they are big, and there are lots of options about how to do it. The other ones are algorithmic and can be small and fast and fit into the core.
Ditto with HTML, and maybe even escaped-unicode too.
So I can drop JIS ? [I won't be able to drop the escaped unicode codec because this is needed for u"" and ur"".] -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
I would propose to only add some very basic encodings to the standard distribution, e.g. the ones mentioned under Standard Codecs in the proposal:
'utf-8': 8-bit variable length encoding 'utf-16': 16-bit variable length encoding (litte/big endian) 'utf-16-le': utf-16 but explicitly little endian 'utf-16-be': utf-16 but explicitly big endian 'ascii': 7-bit ASCII codepage 'latin-1': Latin-1 codepage 'html-entities': Latin-1 + HTML entities; see htmlentitydefs.py from the standard Pythin Lib 'jis' (a popular version XXX): Japanese character encoding 'unicode-escape': See Unicode Constructors for a definition 'native': Dump of the Internal Format used by Python
since this is already very close, maybe we could adopt the naming guidelines from XML: In an encoding declaration, the values "UTF-8", "UTF-16", "ISO-10646-UCS-2", and "ISO-10646-UCS-4" should be used for the various encodings and transformations of Unicode/ISO/IEC 10646, the values "ISO-8859-1", "ISO-8859-2", ... "ISO-8859-9" should be used for the parts of ISO 8859, and the values "ISO-2022-JP", "Shift_JIS", and "EUC-JP" should be used for the various encoded forms of JIS X-0208-1997. XML processors may recognize other encodings; it is recommended that character encodings registered (as charsets) with the Internet Assigned Numbers Authority [IANA], other than those just listed, should be referred to using their registered names. Note that these registered names are defined to be case-insensitive, so processors wishing to match against them should do so in a case-insensitive way. (ie "iso-8859-1" instead of "latin-1", etc -- at least as aliases...). </F>
On Tue, 16 Nov 1999, Fredrik Lundh wrote:
... since this is already very close, maybe we could adopt the naming guidelines from XML:
In an encoding declaration, the values "UTF-8", "UTF-16", "ISO-10646-UCS-2", and "ISO-10646-UCS-4" should be used for the various encodings and transformations of Unicode/ISO/IEC 10646, the values "ISO-8859-1", "ISO-8859-2", ... "ISO-8859-9" should be used for the parts of ISO 8859, and the values "ISO-2022-JP", "Shift_JIS", and "EUC-JP" should be used for the various encoded forms of JIS X-0208-1997.
XML processors may recognize other encodings; it is recommended that character encodings registered (as charsets) with the Internet Assigned Numbers Authority [IANA], other than those just listed, should be referred to using their registered names.
Note that these registered names are defined to be case-insensitive, so processors wishing to match against them should do so in a case-insensitive way.
(ie "iso-8859-1" instead of "latin-1", etc -- at least as aliases...).
+1 (as we'd say in Apache-land... :-) -g -- Greg Stein, http://www.lyra.org/
Fredrik Lundh wrote:
I would propose to only add some very basic encodings to the standard distribution, e.g. the ones mentioned under Standard Codecs in the proposal:
'utf-8': 8-bit variable length encoding 'utf-16': 16-bit variable length encoding (litte/big endian) 'utf-16-le': utf-16 but explicitly little endian 'utf-16-be': utf-16 but explicitly big endian 'ascii': 7-bit ASCII codepage 'latin-1': Latin-1 codepage 'html-entities': Latin-1 + HTML entities; see htmlentitydefs.py from the standard Pythin Lib 'jis' (a popular version XXX): Japanese character encoding 'unicode-escape': See Unicode Constructors for a definition 'native': Dump of the Internal Format used by Python
since this is already very close, maybe we could adopt the naming guidelines from XML:
In an encoding declaration, the values "UTF-8", "UTF-16", "ISO-10646-UCS-2", and "ISO-10646-UCS-4" should be used for the various encodings and transformations of Unicode/ISO/IEC 10646, the values "ISO-8859-1", "ISO-8859-2", ... "ISO-8859-9" should be used for the parts of ISO 8859, and the values "ISO-2022-JP", "Shift_JIS", and "EUC-JP" should be used for the various encoded forms of JIS X-0208-1997.
XML processors may recognize other encodings; it is recommended that character encodings registered (as charsets) with the Internet Assigned Numbers Authority [IANA], other than those just listed, should be referred to using their registered names.
Note that these registered names are defined to be case-insensitive, so processors wishing to match against them should do so in a case-insensitive way.
(ie "iso-8859-1" instead of "latin-1", etc -- at least as aliases...).
From the proposal: """ General Remarks:
· Unicode encoding names should be lower case on output and case-insensitive on input (they will be converted to lower case by all APIs taking an encoding name as input). Encoding names should follow the name conventions as used by the Unicode Consortium: spaces are converted to hyphens, e.g. 'utf 16' is written as 'utf-16'. """ Is there a naming scheme definition for these encoding names? (The quote you gave above doesn't really sound like a definition to me.) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
participants (12)
-
Andrew M. Kuchling
-
Andy Robinson
-
andy@robanal.demon.co.uk
-
Barry A. Warsaw
-
Christopher Petrilli
-
Fredrik Lundh
-
Gordon McMillan
-
Greg Stein
-
Greg Ward
-
Guido van Rossum
-
M.-A. Lemburg
-
Tim Peters