
Some thoughts on the codecs... 1. Stream interface At the moment a codec has dump and load methods which read a (slice of a) stream into a string in memory and vice versa. As the proposal notes, this could lead to errors if you take a slice out of a stream. This is not just due to character truncation; some Asian encodings are modal and have shift-in and shift-out sequences as they move from Western single-byte characters to double-byte ones. It also seems a bit pointless to me as the source (or target) is still a Unicode string in memory. This is a real problem - a filter to convert big files between two encodings should be possible without knowledge of the particular encoding, as should one on the input/output of some server. We can still give a default implementation for single-byte encodings. What's a good API for real stream conversion? just Codec.encodeStream(infile, outfile) ? or is it more useful to feed the codec with data a chunk at a time? 2. Data driven codecs I really like codecs being objects, and believe we could build support for a lot more encodings, a lot sooner than is otherwise possible, by making them data driven rather making each one compiled C code with static mapping tables. What do people think about the approach below? First of all, the ISO8859-1 series are straight mappings to Unicode code points. So one Python script could parse these files and build the mapping table, and a very small data file could hold these encodings. A compiled helper function analogous to string.translate() could deal with most of them. Secondly, the double-byte ones involve a mixture of algorithms and data. The worst cases I know are modal encodings which need a single-byte lookup table, a double-byte lookup table, and have some very simple rules about escape sequences in between them. A simple state machine could still handle these (and the single-byte mappings above become extra-simple special cases); I could imagine feeding it a totally data-driven set of rules. Third, we can massively compress the mapping tables using a notation which just lists contiguous ranges; and very often there are relationships between encodings. For example, "cpXYZ is just like cpXYY but with an extra 'smiley' at 0XFE32". In these cases, a script can build a family of related codecs in an auditable manner. 3. What encodings to distribute? The only clean answers to this are 'almost none', or 'everything that Unicode 3.0 has a mapping for'. The latter is going to add some weight to the distribution. What are people's feelings? Do we ship any at all apart from the Unicode ones? Should new encodings be downloadable from www.python.org? Should there be an optional package outside the main distribution? Thanks, Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com

Andy Robinson wrote:
The idea was to use Unicode as intermediate for all encoding conversions. What you invision here are stream recoders. The can easily be implemented as an useful addition to the Codec subclasses, but I don't think that these have to go into the core.
The problem with these large tables is that currently Python modules are not shared among processes since every process builds its own table. Static C data has the advantage of being shareable at the OS level. You can of course implement Python based lookup tables, but these should be too large...
These are all great ideas, but I think they unnecessarily complicate the proposal.
Since Codecs can be registered at runtime, there is quite some potential there for extension writers coding their own fast codecs. E.g. one could use mxTextTools as codec engine working at C speeds. I would propose to only add some very basic encodings to the standard distribution, e.g. the ones mentioned under Standard Codecs in the proposal: 'utf-8': 8-bit variable length encoding 'utf-16': 16-bit variable length encoding (litte/big endian) 'utf-16-le': utf-16 but explicitly little endian 'utf-16-be': utf-16 but explicitly big endian 'ascii': 7-bit ASCII codepage 'latin-1': Latin-1 codepage 'html-entities': Latin-1 + HTML entities; see htmlentitydefs.py from the standard Pythin Lib 'jis' (a popular version XXX): Japanese character encoding 'unicode-escape': See Unicode Constructors for a definition 'native': Dump of the Internal Format used by Python Perhaps not even 'html-entities' (even though it would make a cool replacement for cgi.escape()) and maybe we should also place the JIS encoding into a separate Unicode package. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 46 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

M.-A. Lemburg responds:
What I wanted was a codec API that acts somewhat like a buffered file; the buffer makes it possible to efficient handle shift states. This is not exactly what Andy shows, but it's not what Marc's current spec has either. I had thought something more like what Java does: an output stream codec's constructor takes a writable file object and the object returned by the constructor has a write() method, a flush() method and a close() method. It acts like a buffering interface to the underlying file; this allows it to generate the minimal number of shift sequeuces. Similar for input stream codecs. Andy's file translation example could then be written as follows: # assuming variables input_file, input_encoding, output_file, # output_encoding, and constant BUFFER_SIZE f = open(input_file, "rb") f1 = unicodec.codecs[input_encoding].stream_reader(f) g = open(output_file, "wb") g1 = unicodec.codecs[output_encoding].stream_writer(f) while 1: buffer = f1.read(BUFFER_SIZE) if not buffer: break f2.write(buffer) f2.close() f1.close() Note that we could possibly make these the only API that a codec needs to provide; the string object <--> unicode object conversions can be done using this and the cStringIO module. (On the other hand it seems a common case that would be quite useful.)
Don't worry about it. 128K is too small to care, I think...
Agreed, let's leave the *implementation* of codecs out of the current efforts. However I want to make sure that the *interface* to codecs is defined right, because changing it will be expensive. (This is Linus Torvald's philosophy on drivers -- he doesn't care about bugs in drivers, as they will get fixed; however he greatly cares about defining the driver APIs correctly.)
(Do you think you'll be able to extort some money from HP for these? :-)
I'd drop html-entities, it seems too cutesie. (And who uses these anyway, outside browsers?) For JIS (shift-JIS?) I hope that Andy can help us with some pointers and validation. And unicode-escape: now that you mention it, this is a section of the proposal that I don't understand. I quote it here: | Python should provide a built-in constructor for Unicode strings which | is available through __builtins__: | | u = unicode(<encoded Python string>[,<encoding name>=<default encoding>]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ What do you mean by this notation? Since encoding names are not always legal Python identifiers (most contain hyphens), I don't understand what you really meant here. Do you mean to say that it has to be a keyword argument? I would disagree; and then I would have expected the notation [,encoding=<default encoding>]. | With the 'unicode-escape' encoding being defined as: | | u = u'<unicode-escape encoded Python string>' | | · for single characters (and this includes all \XXX sequences except \uXXXX), | take the ordinal and interpret it as Unicode ordinal; | | · for \uXXXX sequences, insert the Unicode character with ordinal 0xXXXX | instead, e.g. \u03C0 to represent the character Pi. I've looked at this several times and I don't see the difference between the two bullets. (Ironically, you are using a non-ASCII character here that doesn't always display, depending on where I look at your mail :-). Can you give some examples? Is u'\u0020' different from u'\x20' (a space)? Does '\u0020' (no u prefix) have a meaning? Also, I remember reading Tim Peters who suggested that a "raw unicode" notation (ur"...") might be necessary, to encode regular expressions. I tend to agree. While I'm on the topic, I don't see in your proposal a description of the source file character encoding. Currently, this is undefined, and in fact can be (ab)used to enter non-ASCII in string literals. For example, a programmer named François might write a file containing this statement: print "Written by François." # (There's a cedilla in there!) (He assumes his source character encoding is Latin-1, and he doesn't want to have to type \347 when he can type a cedilla on his keyboard.) If his source file (or .pyc file!) is executed by a Japanese user, this will probably print some garbage. Using the new Unicode strings, François could change his program as follows: print unicode("Written by François.", "latin-1") Assuming that François sets his sys.stdout to use Latin-1, while the Japanese user sets his to shift-JIS (or whatever his kanjiterm uses). But when the Japanese user views François' source file, he will again see garbage. If he uses a generic tool to translate latin-1 files to shift-JIS (assuming shift-JIS has a cedilla character) the program will no longer work correctly -- the string "latin-1" has to be changed to "shift-jis". What should we do about this? The safest and most radical solution is to disallow non-ASCII source characters; François will then have to type print u"Written by Fran\u00E7ois." but, knowing François, he probably won't like this solution very much (since he didn't like the \347 version either). --Guido van Rossum (home page: http://www.python.org/~guido/)

On Mon, 15 Nov 1999 16:37:28 -0500, you wrote:
The proposal also says:
It seems to me that if we go for stream_reader, it replaces this bit of the proposal too - no need for unicodec to provide anything. If you want to have a convenience function there to save a line or two, you could have unicodec.open(filename, mode, encoding) which returned a stream_reader. - Andy

[I'll get back on this tomorrow, just some quick notes here...] Guido van Rossum wrote:
The Codecs provide implementations for encoding and decoding, they are not intended as complete wrappers for e.g. files or sockets. The unicodec module will define a generic stream wrapper (which is yet to be defined) for dealing with files, sockets, etc. It will use the codec registry to do the actual codec work.
import unicodec file = open('mytext.txt','rb') ufile = unicodec.stream(file,'utf-16') u = ufile.read() ... ufile.close() XXX unicodec.file(<filename>,<mode>,<encname>) could be provided as short-hand for unicodec.file(open(<filename>,<mode>),<encname>) which also assures that <mode> contains the 'b' character when needed. XXX Specify the wrapper(s)... Open issues: what to do with Python strings fed to the .write() method (may need to know the encoding of the strings) and when/if to return Python strings through the .read() method. Perhaps we need more than one type of wrapper here. """
You wouldn't want to go via cStringIO for *every* encoding translation. The Codec interface defines two pairs of methods on purpose: one which works internally (ie. directly between strings and Unicode objects), and one which works externally (directly between a stream and Unicode objects).
Huh ? 128K for every process using Python ? That quickly sums up to lots of megabytes lying around pretty much unused.
Don't know, it depends on what their specs look like. I use mxTextTools for fast HTML file processing. It uses a small Turing machine with some extra magic and is progammable via Python tuples.
Ok.
I meant this as optional second argument defaulting to whatever we define <default encoding> to mean, e.g. 'utf-8'. u = unicode("string","utf-8") == unicode("string") The <encoding name> argument must be a string identifying one of the registered codecs.
The first bullet covers the normal Python string characters and escapes, e.g. \n and \267 (the center dot ;-), while the second explains how \uXXXX is interpreted.
Can you give some examples?
Is u'\u0020' different from u'\x20' (a space)?
No, they both map to the same Unicode ordinal.
Does '\u0020' (no u prefix) have a meaning?
No, \uXXXX is only defined for u"" strings or strings that are used to build Unicode objects with this encoding: u = u'\u0020' == unicode(r'\u0020','unicode-escape') Note that writing \uXX is an error, e.g. u"\u12 " will cause cause a syntax error. Aside: I just noticed that '\x2010' doesn't give '\x20' + '10' but instead '\x10' -- is this intended ?
This can be had via unicode(): u = unicode(r'\a\b\c\u0020','unicode-escaped') If that's too long, define a ur() function which wraps up the above line in a function.
I think best is to leave it undefined... as with all files, only the programmer knows what format and encoding it contains, e.g. a Japanese programmer might want to use a shift-JIS editor to enter strings directly in shift-JIS via u = unicode("...shift-JIS encoded text...","shift-jis") Of course, this is not readable using an ASCII editor, but Python will continue to produce the intended string. NLS strings don't belong into program text anyway: i10n usually takes the gettext() approach to handle these issues. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 46 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

On Mon, 15 Nov 1999 23:54:38 +0100, you wrote:
That's the problem Guido and I are worried about. Your present API is not enough to build stream encoders. The 'slurp it into a unicode string in one go' approach fails for big files or for network connections. And you just cannot build a generic stream reader/writer by slicing it into strings. The solution must be specific to the codec - only it knows how much to buffer, when to flip states etc. So the codec should provide proper stream reading and writing services. Unicodec can then wrap those up in labour-saving ways - I'm not fussy which but I like the one-line file-open utility. - Andy

Andy Robinson wrote:
I guess I'll have to rethink the Codec specs. Some leads: 1. introduce a new StreamCodec class which is designed for handling stream encoding and decoding (and supports state) 2. give more information to the unicodec registry: one could register classes instead of instances which the Unicode imlementation would then instantiate whenever it needs to apply the conversion; since this is only needed for encodings maintaining state, the registery would only have to do the instantiation for these codecs and could use cached instances for stateless codecs.
Unicodec can then wrap those up in labour-saving ways - I'm not fussy which but I like the one-line file-open utility.
-- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

[Guido]
Does '\u0020' (no u prefix) have a meaning?
[MAL]
No, \uXXXX is only defined for u"" strings or strings that are used to build Unicode objects with this encoding:
I believe your intent is that '\u0020' be exactly those 6 characters, just as today. That is, it does have a meaning, but its meaning differs between Unicode string literals and regular string literals.
Note that writing \uXX is an error, e.g. u"\u12 " will cause cause a syntax error.
Although I believe your intent <wink> is that, just as today, '\u12' is not an error.
Aside: I just noticed that '\x2010' doesn't give '\x20' + '10' but instead '\x10' -- is this intended ?
Yes; see 2.4.1 ("String literals") of the Lang Ref. Blame the C committee for not defining \x in a platform-independent way. Note that a Python \x escape consumes *all* following hex characters, no matter how many -- and ignores all but the last two.
As before, I think that's fine for now, but won't stand forever.

Tim Peters wrote:
Right.
Right again :-) "\u12" gives a 4 byte string, u"\u12" produces an exception.
Strange definition...
If Guido agrees to ur"", I can put that into the proposal too -- it's just that things are starting to get a little crowded for a strawman proposal ;-) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

[Guido]
So long as Python opens source files using libc text mode, it can't guarantee more than C does: the presence of any character other than tab, newline, and ASCII 32-126 inclusive renders the file contents undefined. Go beyond that, and you've got the same problem as mailers and browsers, and so also the same solution: open source files in binary mode, and add a pragma specifying the intended charset. As a practical matter, declare that Python source is Latin-1 for now, and declare any *system* that doesn't support that non-conforming <wink>. python-is-the-measure-of-all-things-ly y'rs - tim

Guido van Rossum <guido@CNRI.Reston.VA.US> wrote:
note that the html/sgml/xml parsers generally support the feed/close protocol. to be able to use these codecs in that context, we need 1) codes written according to the "data consumer model", instead of the "stream" model. class myDecoder: def __init__(self, target): self.target = target self.state = ... def feed(self, data): ... extract as much data as possible ... self.target.feed(extracted data) def close(self): ... extract what's left ... self.target.feed(additional data) self.target.close() or 2) make threads mandatory, just like in Java. or 3) add light-weight threads (ala stackless python) to the interpreter... (I vote for alternative 3, but that's another story ;-) </F>

On Mon, 15 Nov 1999, Guido van Rossum wrote:
This is the reason Python starts up so slow and has a large memory footprint. There hasn't been any concern for moving stuff into shared data pages. As a result, a process must map in a bunch of vmem pages, for no other reason than to allocate Python structures in that memory and copy constants in. Go start Perl 100 times, then do the same with Python. Python is significantly slower. I've actually written a web app in PHP because another one that I did in Python had slow response time. [ yah: the Real Man Answer is to write a real/good mod_python. ] Cheers, -g -- Greg Stein, http://www.lyra.org/

On 16 November 1999, Greg Stein said:
I don't think this is the only factor in startup overhead. Try looking into the number of system calls for the trivial startup case of each interpreter: $ truss perl -e 1 2> perl.log $ truss python -c 1 2> python.log (This is on Solaris; I did the same thing on Linux with "strace", and on IRIX with "par -s -SS". Dunno about other Unices.) The results are interesting, and useful despite the platform and version disparities. (For the record: Python 1.5.2 on all three platforms; Perl 5.005_03 on Solaris, 5.004_05 on Linux, and 5.004_04 on IRIX. The Solaris is 2.6, using the Official CNRI Python Build by Barry, and the ditto Perl build by me; the Linux system is starship, using whatever Perl and Python the Starship Masters provide us with; the IRIX box is an elderly but well-maintained SGI Challenge running IRIX 5.3.) Also, this is with an empty PYTHONPATH. The Solaris build of Python has different prefix and exec_prefix, but on the Linux and IRIX builds, they are the same. (I think this will reflect poorly on the Solaris version.) PERLLIB, PERL5LIB, and Perl's builtin @INC should not affect startup of the trivial "1" script, so I haven't paid attention to them. First, the size of log files (in lines), i.e. number of system calls: Solaris Linux IRIX[1] Perl 88 85 70 Python 425 316 257 [1] after chopping off the summary counts from the "par" output -- ie. these really are the number of system calls, not the number of lines in the log files Next, the number of "open" calls: Solaris Linux IRIX Perl 16 10 9 Python 107 71 48 (It looks as though *all* of the Perl 'open' calls are due to the dynamic linker going through /usr/lib and/or /lib.) And the number of unsuccessful "open" calls: Solaris Linux IRIX Perl 6 1 3 Python 77 49 32 Number of "mmap" calls: Solaris Linux IRIX Perl 25 25 1 Python 36 24 1 ...nope, guess we can't blame mmap for any Perl/Python startup disparity. How about "brk": Solaris Linux IRIX Perl 6 11 12 Python 47 39 25 ...ok, looks like Greg's gripe about memory holds some water. Rerunning "truss" on Solaris with "python -S -c 1" drastically reduces the startup overhead as measured by "number of system calls". Some quick timing experiments show a drastic speedup (in wall-clock time) by adding "-S": about 37% faster under Solaris, 56% faster under Linux, and 35% under IRIX. These figures should be taken with a large grain of salt, as the Linux and IRIX systems were fairly well loaded at the time, and the wall-clock results I measured had huge variance. Still, it gets the point across. Oh, also for the record, all timings were done like: perl -e 'for $i (1 .. 100) { system "python", "-S", "-c", "1"; }' because I wanted to guarantee no shell was involved in the Python startup. Greg -- Greg Ward - software developer gward@cnri.reston.va.us Corporation for National Research Initiatives 1895 Preston White Drive voice: +1-703-620-8990 Reston, Virginia, USA 20191-5434 fax: +1-703-620-0913

Greg Ward writes:
Running 'python -v' explains this: amarok akuchlin>python -v # /usr/local/lib/python1.5/exceptions.pyc matches /usr/local/lib/python1.5/exceptions.py import exceptions # precompiled from /usr/local/lib/python1.5/exceptions.pyc # /usr/local/lib/python1.5/site.pyc matches /usr/local/lib/python1.5/site.py import site # precompiled from /usr/local/lib/python1.5/site.pyc # /usr/local/lib/python1.5/os.pyc matches /usr/local/lib/python1.5/os.py import os # precompiled from /usr/local/lib/python1.5/os.pyc import posix # builtin # /usr/local/lib/python1.5/posixpath.pyc matches /usr/local/lib/python1.5/posixpath.py import posixpath # precompiled from /usr/local/lib/python1.5/posixpath.pyc # /usr/local/lib/python1.5/stat.pyc matches /usr/local/lib/python1.5/stat.py import stat # precompiled from /usr/local/lib/python1.5/stat.pyc # /usr/local/lib/python1.5/UserDict.pyc matches /usr/local/lib/python1.5/UserDict.py import UserDict # precompiled from /usr/local/lib/python1.5/UserDict.pyc Python 1.5.2 (#80, May 25 1999, 18:06:07) [GCC 2.8.1] on sunos5 Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam import readline # dynamically loaded from /usr/local/lib/python1.5/lib-dynload/readline.so And each import tries several different forms of the module name: stat("/usr/local/lib/python1.5/os", 0xEFFFD5E0) Err#2 ENOENT open("/usr/local/lib/python1.5/os.so", O_RDONLY) Err#2 ENOENT open("/usr/local/lib/python1.5/osmodule.so", O_RDONLY) Err#2 ENOENT open("/usr/local/lib/python1.5/os.py", O_RDONLY) = 4 I don't see how this is fixable, unless we strip down site.py, which drags in os, which drags in os.path and stat and UserDict. -- A.M. Kuchling http://starship.python.net/crew/amk/ I'm going stir-crazy, and I've joined the ranks of the walking brain-dead, but otherwise I'm just peachy. -- Lyta Hall on parenthood, in SANDMAN #40: "Parliament of Rooks"

"AMK" == Andrew M Kuchling <akuchlin@mems-exchange.org> writes:
AMK> I don't see how this is fixable, unless we strip down AMK> site.py, which drags in os, which drags in os.path and stat AMK> and UserDict. One approach might be to support loading modules out of jar files (or whatever) using Greg imputils. We could put the bootstrap .pyc files in this jar and teach Python to import from it first. Python installations could even craft their own modules.jar file to include whatever modules they are willing to "hard code". This, with -S might make Python start up much faster, at the small cost of some flexibility (which could be regained with a c.l. switch or other mechanism to bypass modules.jar). -Barry

A completely different approach (which, incidentally, HP has lobbied for before; and which has been implemented by Sjoerd Mullender for one particular application) would be to cache a mapping from module names to filenames in a dbm file. For Sjoerd's app (which imported hundreds of modules) this made a huge difference. The problem is that it's hard to deal with issues like updating the cache while sharing it with other processes and even other users... But if those can be solved, this could greatly reduce the number of stats and unsuccessful opens, without having to resort to jar files. --Guido van Rossum (home page: http://www.python.org/~guido/)

On 16 November 1999, Guido van Rossum said:
Hey, this could be a big win for Zope startup. Dunno how much of that 20-30 sec startup overhead is due to loading modules, but I'm sure it's a sizeable percentage. Any Zope-heads listening?
Probably not a concern in the case of Zope: one installation, one process, only gets started when it's explicitly shut down and restarted. HmmmMMMMmmm... Greg

Greg Ward [gward@cnri.reston.va.us] wrote:
Wow, that's a huge start up that I've personally never seen. I can't imagine... even loading the Oracle libraries dynamically, which are HUGE (2Mb or so), it's only a couple seconds.
This doesn't reslve a lot of other users of Python howver... and Zope would always benefit, especially when you're running multiple instances on th same machine... would perhaps share more code. Chris -- | Christopher Petrilli | petrilli@amber.org

Barry A. Warsaw writes:
Couple hundred Windows users have been doing this for months (http://starship.python.net/crew/gmcm/install.html). The .pyz files are cross-platform, although the "embedding" app would have to be redone for *nix, (and all the embedding really does is keep Python from hunting all over your disk). Yeah, it's faster. And I can put Python+Tcl/Tk+IDLE on a diskette with a little room left over. but-since-its-WIndows-it-must-be-tainted-ly y'rs - Gordon

On Tue, 16 Nov 1999, Gordon McMillan wrote:
I've got a patch from Jim Ahlstrom to provide a "standardized" library file. I've got to review and fold that thing in (I'll post here when that is done). As Gordon states: yes, the startup time is considerably improved. The DBM approach is interesting. That could definitely be used thru an imputils Importer; it would be quite interesting to try that out. (Note that the library style approach would be even harder to deal with updates, relative to what Sjoerd saw with the DBM approach; I would guess that the "right" approach is to rebuild the library from scratch and atomically replace the thing (but that would bust people with open references...)) Certainly something to look at. Cheers, -g p.s. I also want to try mmap'ing a library and creating code objects that use PyBufferObjects (rather than PyStringObjects) that refer to portions of the mmap. Presuming the mmap is shared, there "should" be a large reduction in heap usage. Question is that I don't know the proportion of code bytes to other heap usage caused by loading a .pyc. p.p.s. I also want to try the buffer approach for frozen code. -- Greg Stein, http://www.lyra.org/

[Gordon McMillan]
That's truly remarkable (he says while waiting for the Inbox Repair Tool to finish repairing his 50Mb Outlook mail file ...)!
but-since-its-WIndows-it-must-be-tainted-ly y'rs
Indeed -- if it runs on Windows, it's a worthless piece o' crap <wink>.

Greg Ward wrote:
For kicks I've done a similar test with cgipython, the one file version of Python 1.5.2:
cgipython 182
cgipython 33
cgipython 28 Note that cgipython does search for sitecutomize.py.
cgipython 13
cgipython 41 (?) So at least in theory, using cgipython for the intended purpose should gain some performance. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

On Mon, 15 Nov 1999 20:20:55 +0100, you wrote:
These are all great ideas, but I think they unnecessarily complicate the proposal.
However, to claim that Python is properly internationalized, we will need a large number of multi-byte encodings to be available. It's a large amount of work, it must be provably correct, and someone's going to have to do it. So if anyone with more C expertise than me - not hard :-) - is interested I'm not suggesting putting my points in the Unicode proposal - in fact, I'm very happy we have a proposal which allows for extension, and lets us work on the encodings separately (and later).
Leave JISXXX and the CJK stuff out. If you get into Japanese, you really need to cover ShiftJIS, EUC-JP and JIS, they are big, and there are lots of options about how to do it. The other ones are algorithmic and can be small and fast and fit into the core. Ditto with HTML, and maybe even escaped-unicode too. In summary, the current discussion is clearly doing the right things, but is only covering a small percentage of what needs to be done to internationalize Python fully. - Andy

Agreed. So let's focus on defining interfaces that are correct and convenient so others who want to add codecs won't have to fight our architecture! Is the current architecture good enough so that the Japanese codecs will fit in it? (I'm particularly worried about the stream codecs, see my previous message.) --Guido van Rossum (home page: http://www.python.org/~guido/)

Andy Robinson wrote:
So I can drop JIS ? [I won't be able to drop the escaped unicode codec because this is needed for u"" and ur"".] -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

since this is already very close, maybe we could adopt the naming guidelines from XML: In an encoding declaration, the values "UTF-8", "UTF-16", "ISO-10646-UCS-2", and "ISO-10646-UCS-4" should be used for the various encodings and transformations of Unicode/ISO/IEC 10646, the values "ISO-8859-1", "ISO-8859-2", ... "ISO-8859-9" should be used for the parts of ISO 8859, and the values "ISO-2022-JP", "Shift_JIS", and "EUC-JP" should be used for the various encoded forms of JIS X-0208-1997. XML processors may recognize other encodings; it is recommended that character encodings registered (as charsets) with the Internet Assigned Numbers Authority [IANA], other than those just listed, should be referred to using their registered names. Note that these registered names are defined to be case-insensitive, so processors wishing to match against them should do so in a case-insensitive way. (ie "iso-8859-1" instead of "latin-1", etc -- at least as aliases...). </F>

On Tue, 16 Nov 1999, Fredrik Lundh wrote:
+1 (as we'd say in Apache-land... :-) -g -- Greg Stein, http://www.lyra.org/

Fredrik Lundh wrote:
· Unicode encoding names should be lower case on output and case-insensitive on input (they will be converted to lower case by all APIs taking an encoding name as input). Encoding names should follow the name conventions as used by the Unicode Consortium: spaces are converted to hyphens, e.g. 'utf 16' is written as 'utf-16'. """ Is there a naming scheme definition for these encoding names? (The quote you gave above doesn't really sound like a definition to me.) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Andy Robinson wrote:
The idea was to use Unicode as intermediate for all encoding conversions. What you invision here are stream recoders. The can easily be implemented as an useful addition to the Codec subclasses, but I don't think that these have to go into the core.
The problem with these large tables is that currently Python modules are not shared among processes since every process builds its own table. Static C data has the advantage of being shareable at the OS level. You can of course implement Python based lookup tables, but these should be too large...
These are all great ideas, but I think they unnecessarily complicate the proposal.
Since Codecs can be registered at runtime, there is quite some potential there for extension writers coding their own fast codecs. E.g. one could use mxTextTools as codec engine working at C speeds. I would propose to only add some very basic encodings to the standard distribution, e.g. the ones mentioned under Standard Codecs in the proposal: 'utf-8': 8-bit variable length encoding 'utf-16': 16-bit variable length encoding (litte/big endian) 'utf-16-le': utf-16 but explicitly little endian 'utf-16-be': utf-16 but explicitly big endian 'ascii': 7-bit ASCII codepage 'latin-1': Latin-1 codepage 'html-entities': Latin-1 + HTML entities; see htmlentitydefs.py from the standard Pythin Lib 'jis' (a popular version XXX): Japanese character encoding 'unicode-escape': See Unicode Constructors for a definition 'native': Dump of the Internal Format used by Python Perhaps not even 'html-entities' (even though it would make a cool replacement for cgi.escape()) and maybe we should also place the JIS encoding into a separate Unicode package. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 46 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

M.-A. Lemburg responds:
What I wanted was a codec API that acts somewhat like a buffered file; the buffer makes it possible to efficient handle shift states. This is not exactly what Andy shows, but it's not what Marc's current spec has either. I had thought something more like what Java does: an output stream codec's constructor takes a writable file object and the object returned by the constructor has a write() method, a flush() method and a close() method. It acts like a buffering interface to the underlying file; this allows it to generate the minimal number of shift sequeuces. Similar for input stream codecs. Andy's file translation example could then be written as follows: # assuming variables input_file, input_encoding, output_file, # output_encoding, and constant BUFFER_SIZE f = open(input_file, "rb") f1 = unicodec.codecs[input_encoding].stream_reader(f) g = open(output_file, "wb") g1 = unicodec.codecs[output_encoding].stream_writer(f) while 1: buffer = f1.read(BUFFER_SIZE) if not buffer: break f2.write(buffer) f2.close() f1.close() Note that we could possibly make these the only API that a codec needs to provide; the string object <--> unicode object conversions can be done using this and the cStringIO module. (On the other hand it seems a common case that would be quite useful.)
Don't worry about it. 128K is too small to care, I think...
Agreed, let's leave the *implementation* of codecs out of the current efforts. However I want to make sure that the *interface* to codecs is defined right, because changing it will be expensive. (This is Linus Torvald's philosophy on drivers -- he doesn't care about bugs in drivers, as they will get fixed; however he greatly cares about defining the driver APIs correctly.)
(Do you think you'll be able to extort some money from HP for these? :-)
I'd drop html-entities, it seems too cutesie. (And who uses these anyway, outside browsers?) For JIS (shift-JIS?) I hope that Andy can help us with some pointers and validation. And unicode-escape: now that you mention it, this is a section of the proposal that I don't understand. I quote it here: | Python should provide a built-in constructor for Unicode strings which | is available through __builtins__: | | u = unicode(<encoded Python string>[,<encoding name>=<default encoding>]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ What do you mean by this notation? Since encoding names are not always legal Python identifiers (most contain hyphens), I don't understand what you really meant here. Do you mean to say that it has to be a keyword argument? I would disagree; and then I would have expected the notation [,encoding=<default encoding>]. | With the 'unicode-escape' encoding being defined as: | | u = u'<unicode-escape encoded Python string>' | | · for single characters (and this includes all \XXX sequences except \uXXXX), | take the ordinal and interpret it as Unicode ordinal; | | · for \uXXXX sequences, insert the Unicode character with ordinal 0xXXXX | instead, e.g. \u03C0 to represent the character Pi. I've looked at this several times and I don't see the difference between the two bullets. (Ironically, you are using a non-ASCII character here that doesn't always display, depending on where I look at your mail :-). Can you give some examples? Is u'\u0020' different from u'\x20' (a space)? Does '\u0020' (no u prefix) have a meaning? Also, I remember reading Tim Peters who suggested that a "raw unicode" notation (ur"...") might be necessary, to encode regular expressions. I tend to agree. While I'm on the topic, I don't see in your proposal a description of the source file character encoding. Currently, this is undefined, and in fact can be (ab)used to enter non-ASCII in string literals. For example, a programmer named François might write a file containing this statement: print "Written by François." # (There's a cedilla in there!) (He assumes his source character encoding is Latin-1, and he doesn't want to have to type \347 when he can type a cedilla on his keyboard.) If his source file (or .pyc file!) is executed by a Japanese user, this will probably print some garbage. Using the new Unicode strings, François could change his program as follows: print unicode("Written by François.", "latin-1") Assuming that François sets his sys.stdout to use Latin-1, while the Japanese user sets his to shift-JIS (or whatever his kanjiterm uses). But when the Japanese user views François' source file, he will again see garbage. If he uses a generic tool to translate latin-1 files to shift-JIS (assuming shift-JIS has a cedilla character) the program will no longer work correctly -- the string "latin-1" has to be changed to "shift-jis". What should we do about this? The safest and most radical solution is to disallow non-ASCII source characters; François will then have to type print u"Written by Fran\u00E7ois." but, knowing François, he probably won't like this solution very much (since he didn't like the \347 version either). --Guido van Rossum (home page: http://www.python.org/~guido/)

On Mon, 15 Nov 1999 16:37:28 -0500, you wrote:
The proposal also says:
It seems to me that if we go for stream_reader, it replaces this bit of the proposal too - no need for unicodec to provide anything. If you want to have a convenience function there to save a line or two, you could have unicodec.open(filename, mode, encoding) which returned a stream_reader. - Andy

[I'll get back on this tomorrow, just some quick notes here...] Guido van Rossum wrote:
The Codecs provide implementations for encoding and decoding, they are not intended as complete wrappers for e.g. files or sockets. The unicodec module will define a generic stream wrapper (which is yet to be defined) for dealing with files, sockets, etc. It will use the codec registry to do the actual codec work.
import unicodec file = open('mytext.txt','rb') ufile = unicodec.stream(file,'utf-16') u = ufile.read() ... ufile.close() XXX unicodec.file(<filename>,<mode>,<encname>) could be provided as short-hand for unicodec.file(open(<filename>,<mode>),<encname>) which also assures that <mode> contains the 'b' character when needed. XXX Specify the wrapper(s)... Open issues: what to do with Python strings fed to the .write() method (may need to know the encoding of the strings) and when/if to return Python strings through the .read() method. Perhaps we need more than one type of wrapper here. """
You wouldn't want to go via cStringIO for *every* encoding translation. The Codec interface defines two pairs of methods on purpose: one which works internally (ie. directly between strings and Unicode objects), and one which works externally (directly between a stream and Unicode objects).
Huh ? 128K for every process using Python ? That quickly sums up to lots of megabytes lying around pretty much unused.
Don't know, it depends on what their specs look like. I use mxTextTools for fast HTML file processing. It uses a small Turing machine with some extra magic and is progammable via Python tuples.
Ok.
I meant this as optional second argument defaulting to whatever we define <default encoding> to mean, e.g. 'utf-8'. u = unicode("string","utf-8") == unicode("string") The <encoding name> argument must be a string identifying one of the registered codecs.
The first bullet covers the normal Python string characters and escapes, e.g. \n and \267 (the center dot ;-), while the second explains how \uXXXX is interpreted.
Can you give some examples?
Is u'\u0020' different from u'\x20' (a space)?
No, they both map to the same Unicode ordinal.
Does '\u0020' (no u prefix) have a meaning?
No, \uXXXX is only defined for u"" strings or strings that are used to build Unicode objects with this encoding: u = u'\u0020' == unicode(r'\u0020','unicode-escape') Note that writing \uXX is an error, e.g. u"\u12 " will cause cause a syntax error. Aside: I just noticed that '\x2010' doesn't give '\x20' + '10' but instead '\x10' -- is this intended ?
This can be had via unicode(): u = unicode(r'\a\b\c\u0020','unicode-escaped') If that's too long, define a ur() function which wraps up the above line in a function.
I think best is to leave it undefined... as with all files, only the programmer knows what format and encoding it contains, e.g. a Japanese programmer might want to use a shift-JIS editor to enter strings directly in shift-JIS via u = unicode("...shift-JIS encoded text...","shift-jis") Of course, this is not readable using an ASCII editor, but Python will continue to produce the intended string. NLS strings don't belong into program text anyway: i10n usually takes the gettext() approach to handle these issues. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 46 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

On Mon, 15 Nov 1999 23:54:38 +0100, you wrote:
That's the problem Guido and I are worried about. Your present API is not enough to build stream encoders. The 'slurp it into a unicode string in one go' approach fails for big files or for network connections. And you just cannot build a generic stream reader/writer by slicing it into strings. The solution must be specific to the codec - only it knows how much to buffer, when to flip states etc. So the codec should provide proper stream reading and writing services. Unicodec can then wrap those up in labour-saving ways - I'm not fussy which but I like the one-line file-open utility. - Andy

Andy Robinson wrote:
I guess I'll have to rethink the Codec specs. Some leads: 1. introduce a new StreamCodec class which is designed for handling stream encoding and decoding (and supports state) 2. give more information to the unicodec registry: one could register classes instead of instances which the Unicode imlementation would then instantiate whenever it needs to apply the conversion; since this is only needed for encodings maintaining state, the registery would only have to do the instantiation for these codecs and could use cached instances for stateless codecs.
Unicodec can then wrap those up in labour-saving ways - I'm not fussy which but I like the one-line file-open utility.
-- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

[Guido]
Does '\u0020' (no u prefix) have a meaning?
[MAL]
No, \uXXXX is only defined for u"" strings or strings that are used to build Unicode objects with this encoding:
I believe your intent is that '\u0020' be exactly those 6 characters, just as today. That is, it does have a meaning, but its meaning differs between Unicode string literals and regular string literals.
Note that writing \uXX is an error, e.g. u"\u12 " will cause cause a syntax error.
Although I believe your intent <wink> is that, just as today, '\u12' is not an error.
Aside: I just noticed that '\x2010' doesn't give '\x20' + '10' but instead '\x10' -- is this intended ?
Yes; see 2.4.1 ("String literals") of the Lang Ref. Blame the C committee for not defining \x in a platform-independent way. Note that a Python \x escape consumes *all* following hex characters, no matter how many -- and ignores all but the last two.
As before, I think that's fine for now, but won't stand forever.

Tim Peters wrote:
Right.
Right again :-) "\u12" gives a 4 byte string, u"\u12" produces an exception.
Strange definition...
If Guido agrees to ur"", I can put that into the proposal too -- it's just that things are starting to get a little crowded for a strawman proposal ;-) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

[Guido]
So long as Python opens source files using libc text mode, it can't guarantee more than C does: the presence of any character other than tab, newline, and ASCII 32-126 inclusive renders the file contents undefined. Go beyond that, and you've got the same problem as mailers and browsers, and so also the same solution: open source files in binary mode, and add a pragma specifying the intended charset. As a practical matter, declare that Python source is Latin-1 for now, and declare any *system* that doesn't support that non-conforming <wink>. python-is-the-measure-of-all-things-ly y'rs - tim

Guido van Rossum <guido@CNRI.Reston.VA.US> wrote:
note that the html/sgml/xml parsers generally support the feed/close protocol. to be able to use these codecs in that context, we need 1) codes written according to the "data consumer model", instead of the "stream" model. class myDecoder: def __init__(self, target): self.target = target self.state = ... def feed(self, data): ... extract as much data as possible ... self.target.feed(extracted data) def close(self): ... extract what's left ... self.target.feed(additional data) self.target.close() or 2) make threads mandatory, just like in Java. or 3) add light-weight threads (ala stackless python) to the interpreter... (I vote for alternative 3, but that's another story ;-) </F>

On Mon, 15 Nov 1999, Guido van Rossum wrote:
This is the reason Python starts up so slow and has a large memory footprint. There hasn't been any concern for moving stuff into shared data pages. As a result, a process must map in a bunch of vmem pages, for no other reason than to allocate Python structures in that memory and copy constants in. Go start Perl 100 times, then do the same with Python. Python is significantly slower. I've actually written a web app in PHP because another one that I did in Python had slow response time. [ yah: the Real Man Answer is to write a real/good mod_python. ] Cheers, -g -- Greg Stein, http://www.lyra.org/

On 16 November 1999, Greg Stein said:
I don't think this is the only factor in startup overhead. Try looking into the number of system calls for the trivial startup case of each interpreter: $ truss perl -e 1 2> perl.log $ truss python -c 1 2> python.log (This is on Solaris; I did the same thing on Linux with "strace", and on IRIX with "par -s -SS". Dunno about other Unices.) The results are interesting, and useful despite the platform and version disparities. (For the record: Python 1.5.2 on all three platforms; Perl 5.005_03 on Solaris, 5.004_05 on Linux, and 5.004_04 on IRIX. The Solaris is 2.6, using the Official CNRI Python Build by Barry, and the ditto Perl build by me; the Linux system is starship, using whatever Perl and Python the Starship Masters provide us with; the IRIX box is an elderly but well-maintained SGI Challenge running IRIX 5.3.) Also, this is with an empty PYTHONPATH. The Solaris build of Python has different prefix and exec_prefix, but on the Linux and IRIX builds, they are the same. (I think this will reflect poorly on the Solaris version.) PERLLIB, PERL5LIB, and Perl's builtin @INC should not affect startup of the trivial "1" script, so I haven't paid attention to them. First, the size of log files (in lines), i.e. number of system calls: Solaris Linux IRIX[1] Perl 88 85 70 Python 425 316 257 [1] after chopping off the summary counts from the "par" output -- ie. these really are the number of system calls, not the number of lines in the log files Next, the number of "open" calls: Solaris Linux IRIX Perl 16 10 9 Python 107 71 48 (It looks as though *all* of the Perl 'open' calls are due to the dynamic linker going through /usr/lib and/or /lib.) And the number of unsuccessful "open" calls: Solaris Linux IRIX Perl 6 1 3 Python 77 49 32 Number of "mmap" calls: Solaris Linux IRIX Perl 25 25 1 Python 36 24 1 ...nope, guess we can't blame mmap for any Perl/Python startup disparity. How about "brk": Solaris Linux IRIX Perl 6 11 12 Python 47 39 25 ...ok, looks like Greg's gripe about memory holds some water. Rerunning "truss" on Solaris with "python -S -c 1" drastically reduces the startup overhead as measured by "number of system calls". Some quick timing experiments show a drastic speedup (in wall-clock time) by adding "-S": about 37% faster under Solaris, 56% faster under Linux, and 35% under IRIX. These figures should be taken with a large grain of salt, as the Linux and IRIX systems were fairly well loaded at the time, and the wall-clock results I measured had huge variance. Still, it gets the point across. Oh, also for the record, all timings were done like: perl -e 'for $i (1 .. 100) { system "python", "-S", "-c", "1"; }' because I wanted to guarantee no shell was involved in the Python startup. Greg -- Greg Ward - software developer gward@cnri.reston.va.us Corporation for National Research Initiatives 1895 Preston White Drive voice: +1-703-620-8990 Reston, Virginia, USA 20191-5434 fax: +1-703-620-0913

Greg Ward writes:
Running 'python -v' explains this: amarok akuchlin>python -v # /usr/local/lib/python1.5/exceptions.pyc matches /usr/local/lib/python1.5/exceptions.py import exceptions # precompiled from /usr/local/lib/python1.5/exceptions.pyc # /usr/local/lib/python1.5/site.pyc matches /usr/local/lib/python1.5/site.py import site # precompiled from /usr/local/lib/python1.5/site.pyc # /usr/local/lib/python1.5/os.pyc matches /usr/local/lib/python1.5/os.py import os # precompiled from /usr/local/lib/python1.5/os.pyc import posix # builtin # /usr/local/lib/python1.5/posixpath.pyc matches /usr/local/lib/python1.5/posixpath.py import posixpath # precompiled from /usr/local/lib/python1.5/posixpath.pyc # /usr/local/lib/python1.5/stat.pyc matches /usr/local/lib/python1.5/stat.py import stat # precompiled from /usr/local/lib/python1.5/stat.pyc # /usr/local/lib/python1.5/UserDict.pyc matches /usr/local/lib/python1.5/UserDict.py import UserDict # precompiled from /usr/local/lib/python1.5/UserDict.pyc Python 1.5.2 (#80, May 25 1999, 18:06:07) [GCC 2.8.1] on sunos5 Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam import readline # dynamically loaded from /usr/local/lib/python1.5/lib-dynload/readline.so And each import tries several different forms of the module name: stat("/usr/local/lib/python1.5/os", 0xEFFFD5E0) Err#2 ENOENT open("/usr/local/lib/python1.5/os.so", O_RDONLY) Err#2 ENOENT open("/usr/local/lib/python1.5/osmodule.so", O_RDONLY) Err#2 ENOENT open("/usr/local/lib/python1.5/os.py", O_RDONLY) = 4 I don't see how this is fixable, unless we strip down site.py, which drags in os, which drags in os.path and stat and UserDict. -- A.M. Kuchling http://starship.python.net/crew/amk/ I'm going stir-crazy, and I've joined the ranks of the walking brain-dead, but otherwise I'm just peachy. -- Lyta Hall on parenthood, in SANDMAN #40: "Parliament of Rooks"

"AMK" == Andrew M Kuchling <akuchlin@mems-exchange.org> writes:
AMK> I don't see how this is fixable, unless we strip down AMK> site.py, which drags in os, which drags in os.path and stat AMK> and UserDict. One approach might be to support loading modules out of jar files (or whatever) using Greg imputils. We could put the bootstrap .pyc files in this jar and teach Python to import from it first. Python installations could even craft their own modules.jar file to include whatever modules they are willing to "hard code". This, with -S might make Python start up much faster, at the small cost of some flexibility (which could be regained with a c.l. switch or other mechanism to bypass modules.jar). -Barry

A completely different approach (which, incidentally, HP has lobbied for before; and which has been implemented by Sjoerd Mullender for one particular application) would be to cache a mapping from module names to filenames in a dbm file. For Sjoerd's app (which imported hundreds of modules) this made a huge difference. The problem is that it's hard to deal with issues like updating the cache while sharing it with other processes and even other users... But if those can be solved, this could greatly reduce the number of stats and unsuccessful opens, without having to resort to jar files. --Guido van Rossum (home page: http://www.python.org/~guido/)

On 16 November 1999, Guido van Rossum said:
Hey, this could be a big win for Zope startup. Dunno how much of that 20-30 sec startup overhead is due to loading modules, but I'm sure it's a sizeable percentage. Any Zope-heads listening?
Probably not a concern in the case of Zope: one installation, one process, only gets started when it's explicitly shut down and restarted. HmmmMMMMmmm... Greg

Greg Ward [gward@cnri.reston.va.us] wrote:
Wow, that's a huge start up that I've personally never seen. I can't imagine... even loading the Oracle libraries dynamically, which are HUGE (2Mb or so), it's only a couple seconds.
This doesn't reslve a lot of other users of Python howver... and Zope would always benefit, especially when you're running multiple instances on th same machine... would perhaps share more code. Chris -- | Christopher Petrilli | petrilli@amber.org

Barry A. Warsaw writes:
Couple hundred Windows users have been doing this for months (http://starship.python.net/crew/gmcm/install.html). The .pyz files are cross-platform, although the "embedding" app would have to be redone for *nix, (and all the embedding really does is keep Python from hunting all over your disk). Yeah, it's faster. And I can put Python+Tcl/Tk+IDLE on a diskette with a little room left over. but-since-its-WIndows-it-must-be-tainted-ly y'rs - Gordon

On Tue, 16 Nov 1999, Gordon McMillan wrote:
I've got a patch from Jim Ahlstrom to provide a "standardized" library file. I've got to review and fold that thing in (I'll post here when that is done). As Gordon states: yes, the startup time is considerably improved. The DBM approach is interesting. That could definitely be used thru an imputils Importer; it would be quite interesting to try that out. (Note that the library style approach would be even harder to deal with updates, relative to what Sjoerd saw with the DBM approach; I would guess that the "right" approach is to rebuild the library from scratch and atomically replace the thing (but that would bust people with open references...)) Certainly something to look at. Cheers, -g p.s. I also want to try mmap'ing a library and creating code objects that use PyBufferObjects (rather than PyStringObjects) that refer to portions of the mmap. Presuming the mmap is shared, there "should" be a large reduction in heap usage. Question is that I don't know the proportion of code bytes to other heap usage caused by loading a .pyc. p.p.s. I also want to try the buffer approach for frozen code. -- Greg Stein, http://www.lyra.org/

[Gordon McMillan]
That's truly remarkable (he says while waiting for the Inbox Repair Tool to finish repairing his 50Mb Outlook mail file ...)!
but-since-its-WIndows-it-must-be-tainted-ly y'rs
Indeed -- if it runs on Windows, it's a worthless piece o' crap <wink>.

Greg Ward wrote:
For kicks I've done a similar test with cgipython, the one file version of Python 1.5.2:
cgipython 182
cgipython 33
cgipython 28 Note that cgipython does search for sitecutomize.py.
cgipython 13
cgipython 41 (?) So at least in theory, using cgipython for the intended purpose should gain some performance. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

On Mon, 15 Nov 1999 20:20:55 +0100, you wrote:
These are all great ideas, but I think they unnecessarily complicate the proposal.
However, to claim that Python is properly internationalized, we will need a large number of multi-byte encodings to be available. It's a large amount of work, it must be provably correct, and someone's going to have to do it. So if anyone with more C expertise than me - not hard :-) - is interested I'm not suggesting putting my points in the Unicode proposal - in fact, I'm very happy we have a proposal which allows for extension, and lets us work on the encodings separately (and later).
Leave JISXXX and the CJK stuff out. If you get into Japanese, you really need to cover ShiftJIS, EUC-JP and JIS, they are big, and there are lots of options about how to do it. The other ones are algorithmic and can be small and fast and fit into the core. Ditto with HTML, and maybe even escaped-unicode too. In summary, the current discussion is clearly doing the right things, but is only covering a small percentage of what needs to be done to internationalize Python fully. - Andy

Agreed. So let's focus on defining interfaces that are correct and convenient so others who want to add codecs won't have to fight our architecture! Is the current architecture good enough so that the Japanese codecs will fit in it? (I'm particularly worried about the stream codecs, see my previous message.) --Guido van Rossum (home page: http://www.python.org/~guido/)

Andy Robinson wrote:
So I can drop JIS ? [I won't be able to drop the escaped unicode codec because this is needed for u"" and ur"".] -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

since this is already very close, maybe we could adopt the naming guidelines from XML: In an encoding declaration, the values "UTF-8", "UTF-16", "ISO-10646-UCS-2", and "ISO-10646-UCS-4" should be used for the various encodings and transformations of Unicode/ISO/IEC 10646, the values "ISO-8859-1", "ISO-8859-2", ... "ISO-8859-9" should be used for the parts of ISO 8859, and the values "ISO-2022-JP", "Shift_JIS", and "EUC-JP" should be used for the various encoded forms of JIS X-0208-1997. XML processors may recognize other encodings; it is recommended that character encodings registered (as charsets) with the Internet Assigned Numbers Authority [IANA], other than those just listed, should be referred to using their registered names. Note that these registered names are defined to be case-insensitive, so processors wishing to match against them should do so in a case-insensitive way. (ie "iso-8859-1" instead of "latin-1", etc -- at least as aliases...). </F>

On Tue, 16 Nov 1999, Fredrik Lundh wrote:
+1 (as we'd say in Apache-land... :-) -g -- Greg Stein, http://www.lyra.org/

Fredrik Lundh wrote:
· Unicode encoding names should be lower case on output and case-insensitive on input (they will be converted to lower case by all APIs taking an encoding name as input). Encoding names should follow the name conventions as used by the Unicode Consortium: spaces are converted to hyphens, e.g. 'utf 16' is written as 'utf-16'. """ Is there a naming scheme definition for these encoding names? (The quote you gave above doesn't really sound like a definition to me.) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
participants (12)
-
Andrew M. Kuchling
-
Andy Robinson
-
andy@robanal.demon.co.uk
-
Barry A. Warsaw
-
Christopher Petrilli
-
Fredrik Lundh
-
Gordon McMillan
-
Greg Stein
-
Greg Ward
-
Guido van Rossum
-
M.-A. Lemburg
-
Tim Peters