From mhammond@skippinet.com.au Mon Nov 1 01:51:56 1999 From: mhammond@skippinet.com.au (Mark Hammond) Date: Mon, 1 Nov 1999 12:51:56 +1100 Subject: [Python-Dev] Benevolent dictator versus the bureaucratic committee? Message-ID: <002301bf240b$ae61fa00$0501a8c0@bobcat> I have for some time been wondering about the usefulness of this mailing list. It seems to have produced staggeringly few results since inception. This is not a critisism of any individual, but of the process. It is proof in my mind of how effective the benevolent dictator model is, and how ineffective a language run by committee would be. This "committee" never seems to be capable of reaching a consensus on anything. A number of issues dont seem to provoke any responses. As a result, many things seem to die a slow and lingering death. Often there is lots of interesting discussion, but still precious few results. In the pre python-dev days, the process seemed easier - we mailed Guido directly, and he either stated "yea" or "nay" - maybe we didnt get the response we hoped for, but at least we got a response. Now, we have the result that even if Guido does enter into a thread, the noise seems to drown out any hope of getting anything done. Guido seems to be faced with the dilemma of asserting his dictatorship in the face of many dissenting opinions from many people he respects, or putting it in the too hard basket. I fear the latter is the easiest option. At the end of this mail I list some of the major threads over the last few months, and can't see a single thread that has resulted in a CVS checkin, and only one that has resulted in agreement. This, to my mind at least, is proof that things are really not working. I long for the "good old days" - take the replacement of "ni" with built-in functionality, for example. I posit that if this was discussed on python-dev, it would have caused a huge flood of mail, and nothing remotely resembling a consensus. Instead, Guido simply wrote an essay and implemented some code that he personally liked. No debate, no discussion. Still an excellent result. Maybe not a perfect result, but a result nonetheless. However, Guido's time is becoming increasingly limited. So should we consider moving to a "benevolent lieutenent" model, in conjunction with re-ramping up the SIGS? This would provide 2 ways to get things done: * A new SIG. Take relative imports, for example. If we really do need a change in this fairly fundamental area, a SIG would be justified ("import-sig"). The responsibility of the SIG is to form a consensus (and code that reflects it), and report back to Guido (and the main newsgroup) with the result of this. It worked well for RE, and allowed those of us not particularly interested to keep out of the debate. If the SIG can not form consensus, then tough - it dies - and should not be mourned. Presumably Guido would keep a watchful eye over the SIG, providing direction where necessary, but in general stay out of the day to day traffic. New SIGs seem to have stopped since this list creation, and it seems that issues that should be discussed in new SIGS are now discussed here. * Guido could delegate some of his authority to a single individual responsible for a certain limited area - a benevolent lieutenent. We may have a lieutentant responsible for different areas, and could only exercise their authority with small, trivial changes. Eg, the "getopt helper" thread - if a lieutenant was given authority for the "standard library", they could simply make a yea or nay decision, and present it to Guido. Presumably Guido trusts this person he delegated to enough that the majority of the lieutenant's recommendations would be accepted. Presumably there would be a small number of lieutentants, and they would then become the new "python-dev" - say up to 5 people. This list then discusses high level strategies and seek direction from each other when things get murky. This select group of people may not (indeed, probably would not) include me, but I would have no problem with that - I would prefer to see results achieved than have my own ego stroked by being included in a select, but ineffective group. In parting, I repeat this is not a direct critisism, simply an observation of the last few months. I am on this list, so I am definately as guilty as any one else - which is "not at all" - ie, no one is guilty, I simply see it as endemic to a committee with people of diverse backgrounds, skills and opinions. Any thoughts? Long live the dictator! :-) Mark. Recent threads, and my take on the results: * getopt helper? Too much noise regarding semantic changes. * Alternative Approach to Relative Imports * Relative package imports * Path hacking * Towards a Python based import scheme Too much noise - no one could really agree on the semantics. Implementation thrown in the ring, and promptly forgotten. * Corporate installations Very young, but no result at all. * Embedding Python when using different calling conventions Quite young, but no result as yet, and I have no reason to believe there will be. * Catching "return" and "return expr" at compile time Seemed to be blessed - yay! Dont believe I have seen a check-in yet. * More Python command-line features Seemed general agreement, but nothing happened? * Tackling circular dependencies in 2.0? Lots of noise, but no results other than "GC may be there in 2.0" * Buffer interface in abstract.c Determined it could break - no solution proposed. Lots of noise regarding if is is a good idea at all! * mmapfile module No result. * Quick-and-dirty weak references No result. * Portable "spawn" module for core? No result. * Fake threads Seemed to spawn stackless Python, but in the face of Guido being "at best, lukewarm" about this issue, I would again have to conclude "no result". An authorative "no" in this area may have saved lots of effort and heartache. * add Expat to 1.6 No result. * I'd like list.pop to accept an optional second argument giving a default value No result * etc No result. From jack@oratrix.nl Mon Nov 1 09:56:48 1999 From: jack@oratrix.nl (Jack Jansen) Date: Mon, 01 Nov 1999 10:56:48 +0100 Subject: [Python-Dev] Embedding Python when using different calling conventions. In-Reply-To: Message by "M.-A. Lemburg" <mal@lemburg.com> , Sat, 30 Oct 1999 10:46:30 +0200 , <381AB066.B54A47E0@lemburg.com> Message-ID: <19991101095648.DC2E535BB1E@snelboot.oratrix.nl> > OTOH, we could take chance to reorganize these macros from bottom > up: when I started coding extensions I found them not very useful > mostly because I didn't have control over them meaning "export > this symbol" or "import the symbol". Especially the DL_IMPORT > macro is strange because it seems to handle both import *and* > export depending on whether Python is compiled or not. This would be very nice. The DL_IMPORT/DL_EXPORT stuff is really weird unless you're working with it all the time. We were trying to build a plugin DLL for PythonWin and first you spend hours finding out that you have to set DL_IMPORT (and how to set it), and the you spend another few hours before you realize that you can't simply copy the DL_IMPORT and DL_EXPORT from, say, timemodule.c because timemodule.c is going to be in the Python core (and hence can use DL_IMPORT for its init() routine declaration) while your module is going to be a plugin so it can't. I would opt for a scheme where the define shows where the symbols is expected to live (DL_CORE and DL_THISMODULE would be needed at least, but probably one or two more for .h files). -- Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++ Jack.Jansen@oratrix.com | ++++ if you agree copy these lines to your sig ++++ www.oratrix.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm From jack@oratrix.nl Mon Nov 1 10:12:37 1999 From: jack@oratrix.nl (Jack Jansen) Date: Mon, 01 Nov 1999 11:12:37 +0100 Subject: [Python-Dev] Benevolent dictator versus the bureaucratic committee? In-Reply-To: Message by "Mark Hammond" <mhammond@skippinet.com.au> , Mon, 1 Nov 1999 12:51:56 +1100 , <002301bf240b$ae61fa00$0501a8c0@bobcat> Message-ID: <19991101101238.3D6FA35BB1E@snelboot.oratrix.nl> I think I agree with Mark's post, although I do see a little more light (the relative imports dicussion resulted in working code, for instance). The benevolent lieutenant idea may work, _if_ the lieutenants can be found. I myself will quickly join Mark in wishing the new python-dev well and abandoning ship (half a :-). If that doesn't work maybe we should try at the very least to create a "memory". If you bring up a subject for discussion and you don't have working code that's fine the first time. But if anyone brings it up a second time they're supposed to have code. That way at least we won't be rehashing old discussions (as happend on the python-list every time, with subjects like GC or optimizations). And maybe we should limit ourselves in our replies: don't speak up too much in discussions if you're not going to write code. I know that I'm pretty good at answering with my brilliant insights to everything myself:-). It could well be that refining and refining the design (as in the getopt discussion) results in such a mess of opinions that no-one has the guts to write the code anymore. -- Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++ Jack.Jansen@oratrix.com | ++++ if you agree copy these lines to your sig ++++ www.oratrix.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm From mal@lemburg.com Mon Nov 1 11:09:21 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 01 Nov 1999 12:09:21 +0100 Subject: [Python-Dev] dircache.py References: <1270737688-19939033@hypernet.com> Message-ID: <381D74E0.1AE3DA6A@lemburg.com> Gordon McMillan wrote: > > Pursuant to my volunteering to implement Guido's plan to > combine cmp.py, cmpcache.py, dircmp.py and dircache.py > into filecmp.py, I did some investigating of dircache.py. > > I find it completely unreliable. On my NT box, the mtime of the > directory is updated (on average) 2 secs after a file is added, > but within 10 tries, there's always one in which it takes more > than 100 secs (and my test script quits). My Linux box hardly > ever detects a change within 100 secs. > > I've tried a number of ways of testing this ("this" being > checking for a change in the mtime of the directory), the latest > of which is below. Even if dircache can be made to work > reliably and surprise-free on some platforms, I doubt it can be > done cross-platform. So I'd recommend that it just get dropped. > > Comments? Note that you'll have to flush and close the tmp file to actually have it written to the file system. That's why you are not seeing any new mtimes on Linux. Still, I'd suggest declaring it obsolete. Filesystem access is usually cached by the underlying OS anyway, so adding another layer of caching on top of it seems not worthwhile (plus, the OS knows better when and what to cache). Another argument against using stat() time entries for caching purposes is the resolution of 1 second. It makes the dircache.py unreliable per se for fast changing directories. The problem is most probably even worse for NFS and on Samba mounted WinXX filesystems the mtime trick doesn't work at all (stat() returns the creation time for atime, mtime and ctime). -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 60 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From gward@cnri.reston.va.us Mon Nov 1 13:28:51 1999 From: gward@cnri.reston.va.us (Greg Ward) Date: Mon, 1 Nov 1999 08:28:51 -0500 Subject: [Python-Dev] Benevolent dictator versus the bureaucratic committee? In-Reply-To: <002301bf240b$ae61fa00$0501a8c0@bobcat>; from mhammond@skippinet.com.au on Mon, Nov 01, 1999 at 12:51:56PM +1100 References: <002301bf240b$ae61fa00$0501a8c0@bobcat> Message-ID: <19991101082851.A16952@cnri.reston.va.us> On 01 November 1999, Mark Hammond said: > I have for some time been wondering about the usefulness of this > mailing list. It seems to have produced staggeringly few results > since inception. Perhaps this is an indication of stability rather than stagnation. Of course we can't have *total* stability or Python 1.6 will never appear, but... > * Portable "spawn" module for core? > No result. ...I started this little thread to see if there was any interest, and to find out the easy way if VMS/Unix/DOS-style "spawn sub-process with list of strings as command-line arguments" makes any sense at all on the Mac without actually having to go learn about the Mac. The result: if 'spawn()' is added to the core, it should probably be 'os.spawn()', but it's not really clear if this is necessary or useful to many people; and, no, it doesn't make sense on the Mac. That answered my questions, so I don't really see the thread as a failure. I might still turn the distutils.spawn module into an appendage of the os module, but there doesn't seem to be a compelling reason to do so. Not every thread has to result in working code. In other words, negative results are results too. Greg From skip@mojam.com (Skip Montanaro) Mon Nov 1 16:58:41 1999 From: skip@mojam.com (Skip Montanaro) (Skip Montanaro) Date: Mon, 1 Nov 1999 10:58:41 -0600 (CST) Subject: [Python-Dev] Benevolent dictator versus the bureaucratic committee? In-Reply-To: <002301bf240b$ae61fa00$0501a8c0@bobcat> References: <002301bf240b$ae61fa00$0501a8c0@bobcat> Message-ID: <14365.50881.778143.590205@dolphin.mojam.com> Mark> * Catching "return" and "return expr" at compile time Mark> Seemed to be blessed - yay! Dont believe I have seen a check-in Mark> yet. I did post a patch to compile.c here and to the announce list. I think the temporal distance between the furor in the main list and when it appeared "in print" may have been a problem. Also, as the author of that code I surmised that compile.c was the wrong place for it. I would have preferred to see it in some Python code somewhere, but there's no obvious place to put it. Finally, there is as yet no convention about how to handle warnings. (Maybe some sort of PyLint needs to be "blessed" and made part of the distribution.) Perhaps python-dev would be good to generate SIGs, sort of like a hurricane spinning off tornadoes. Skip Montanaro | http://www.mojam.com/ skip@mojam.com | http://www.musi-cal.com/ 847-971-7098 | Python: Programming the way Guido indented... From guido@CNRI.Reston.VA.US Mon Nov 1 18:41:32 1999 From: guido@CNRI.Reston.VA.US (Guido van Rossum) Date: Mon, 01 Nov 1999 13:41:32 -0500 Subject: [Python-Dev] Misleading syntax error text In-Reply-To: Your message of "Mon, 01 Nov 1999 00:00:55 +0100." <381CCA27.59506CF6@lemburg.com> References: <1270838575-13870925@hypernet.com> <381CCA27.59506CF6@lemburg.com> Message-ID: <199911011841.NAA06233@eric.cnri.reston.va.us> > How about chainging the com_assign_trailer function in Python/compile.c > to: Please don't use the python-dev list for issues like this. The place to go is the python-bugs database (http://www.python.org/search/search_bugs.html) or you could just send me a patch (please use a context diff and include the standard disclaimer language). --Guido van Rossum (home page: http://www.python.org/~guido/) From mal@lemburg.com Mon Nov 1 19:06:39 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 01 Nov 1999 20:06:39 +0100 Subject: [Python-Dev] Misleading syntax error text References: <1270838575-13870925@hypernet.com> <381CCA27.59506CF6@lemburg.com> <199911011841.NAA06233@eric.cnri.reston.va.us> Message-ID: <381DE4BF.951B03F0@lemburg.com> Guido van Rossum wrote: > > > How about chainging the com_assign_trailer function in Python/compile.c > > to: > > Please don't use the python-dev list for issues like this. The place > to go is the python-bugs database > (http://www.python.org/search/search_bugs.html) or you could just send > me a patch (please use a context diff and include the standard disclaimer > language). This wasn't really a bug report... I was actually looking for some feedback prior to sending a real (context) patch. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 60 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From jim@interet.com Tue Nov 2 15:43:56 1999 From: jim@interet.com (James C. Ahlstrom) Date: Tue, 02 Nov 1999 10:43:56 -0500 Subject: [Python-Dev] Benevolent dictator versus the bureaucratic committee? References: <002301bf240b$ae61fa00$0501a8c0@bobcat> Message-ID: <381F06BC.CC2CBFBD@interet.com> Mark Hammond wrote: > > I have for some time been wondering about the usefulness of this > mailing list. It seems to have produced staggeringly few results > since inception. I appreciate the points you made, but I think this list is still a valuable place to air design issues. I don't want to see too many Python core changes anyway. Just my 2.E-2 worth. Jim Ahlstrom From Vladimir.Marangozov@inrialpes.fr Wed Nov 3 22:34:44 1999 From: Vladimir.Marangozov@inrialpes.fr (Vladimir Marangozov) Date: Wed, 3 Nov 1999 23:34:44 +0100 (NFT) Subject: [Python-Dev] paper available Message-ID: <199911032234.XAA26442@pukapuka.inrialpes.fr> I've OCR'd Saltzer's paper. It's available temporarily (in MS Word format) at http://sirac.inrialpes.fr/~marangoz/tmp/Saltzer.zip Since there may be legal problems with LNCS, I will disable the link shortly (so those of you who have not received a copy and are interested in reading it, please grab it quickly) If prof. Saltzer agrees (and if he can, legally) put it on his web page, I guess that the paper will show up at http://mit.edu/saltzer/ Jeremy, could you please check this with prof. Saltzer? (This version might need some corrections due to the OCR process, despite that I've made a significant effort to clean it up) -- Vladimir MARANGOZOV | Vladimir.Marangozov@inrialpes.fr http://sirac.inrialpes.fr/~marangoz | tel:(+33-4)76615277 fax:76615252 From guido@CNRI.Reston.VA.US Thu Nov 4 20:58:53 1999 From: guido@CNRI.Reston.VA.US (Guido van Rossum) Date: Thu, 04 Nov 1999 15:58:53 -0500 Subject: [Python-Dev] wish list Message-ID: <199911042058.PAA15437@eric.cnri.reston.va.us> I got the wish list below. Anyone care to comment on how close we are on fulfilling some or all of this? --Guido van Rossum (home page: http://www.python.org/~guido/) ------- Forwarded Message Date: Thu, 04 Nov 1999 20:26:54 +0700 From: "Claudio Ramón" <rmn70@hotmail.com> To: guido@python.org Hello, I'm a python user (excuse my english, I'm spanish and...). I think it is a very complete language and I use it in solve statistics, phisics, mathematics, chemistry and biology problemns. I'm not an experienced programmer, only a scientific with problems to solve. The motive of this letter is explain to you a needs that I have in the python use and I think in the next versions... * GNU CC for Win32 compatibility (compilation of python interpreter and "Freeze" utility). I think MingWin32 (Mummint Khan) is a good alternative eviting the cygwin dll user. * Add low level programming capabilities for system access and speed of code fragments eviting the C-C++ or Java code use. Python, I think, must be a complete programming language in the "programming for every body" philosofy. * Incorporate WxWindows (wxpython) and/or Gtk+ (now exist a win32 port) GUI in the standard distribution. For example, Wxpython permit an html browser. It is very importan for document presentations. And Wxwindows and Gtk+ are faster than tk. * Incorporate a database system in the standard library distribution. To be possible with relational and documental capabilites and with import facility of DBASE, Paradox, MSAccess files. * Incorporate a XML/HTML/Math-ML editor/browser with graphics capability (to be possible with XML how internal file format). And to be possible with Microsoft Word import export facility. For example, AbiWord project can be an alternative but if lacks programming language. If we can make python the programming language for AbiWord project... Thanks. Ramón Molina. rmn70@hotmail.com ______________________________________________________ Get Your Private, Free Email at http://www.hotmail.com ------- End of Forwarded Message From skip@mojam.com (Skip Montanaro) Thu Nov 4 21:06:53 1999 From: skip@mojam.com (Skip Montanaro) (Skip Montanaro) Date: Thu, 4 Nov 1999 15:06:53 -0600 (CST) Subject: [Python-Dev] wish list In-Reply-To: <199911042058.PAA15437@eric.cnri.reston.va.us> References: <199911042058.PAA15437@eric.cnri.reston.va.us> Message-ID: <14369.62829.389307.377095@dolphin.mojam.com> * Incorporate a database system in the standard library distribution. To be possible with relational and documental capabilites and with import facility of DBASE, Paradox, MSAccess files. I know Digital Creations has a dbase module knocking around there somewhere. I hacked on it for them a couple years ago. You might see if JimF can scrounge it up and donate it to the cause. Skip Montanaro | http://www.mojam.com/ skip@mojam.com | http://www.musi-cal.com/ 847-971-7098 | Python: Programming the way Guido indented... From fdrake@acm.org Thu Nov 4 21:08:26 1999 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Thu, 4 Nov 1999 16:08:26 -0500 (EST) Subject: [Python-Dev] wish list In-Reply-To: <199911042058.PAA15437@eric.cnri.reston.va.us> References: <199911042058.PAA15437@eric.cnri.reston.va.us> Message-ID: <14369.62922.994300.233350@weyr.cnri.reston.va.us> Guido van Rossum writes: > I got the wish list below. Anyone care to comment on how close we are > on fulfilling some or all of this? Claudio Ramón <rmn70@hotmail.com> wrote: > * Incorporate WxWindows (wxpython) and/or Gtk+ (now exist a win32 port) GUI > in the standard distribution. For example, Wxpython permit an html browser. > It is very importan for document presentations. And Wxwindows and Gtk+ are > faster than tk. And GTK+ looks better, too. ;-) None the less, I don't think GTK+ is as solid or mature as Tk. There are still a lot of oddities, and several warnings/errors get messages printed on stderr/stdout (don't know which) rather than raising exceptions. (This is a failing of GTK+, not PyGTK.) There isn't an equivalent of the Tk text widget, which is a real shame. There are people working on something better, but it's not a trivial project and I don't have any idea how its going. > * Incorporate a database system in the standard library distribution. To be > possible with relational and documental capabilites and with import facility > of DBASE, Paradox, MSAccess files. Doesn't sound like part of a core library really, though I could see combining the Win32 extensions with the core package to produce a single installable. That should at least provide access to MSAccess, and possible the others, via ODBC. > * Incorporate a XML/HTML/Math-ML editor/browser with graphics capability (to > be possible with XML how internal file format). And to be possible with > Microsoft Word import export facility. For example, AbiWord project can be > an alternative but if lacks programming language. If we can make python the > programming language for AbiWord project... I think this would be great to have. But I wouldn't put the editor/browser in the core. I would stick something like the XML-SIG's package in, though, once that's better polished. -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives From jim@interet.com Fri Nov 5 00:09:40 1999 From: jim@interet.com (James C. Ahlstrom) Date: Thu, 04 Nov 1999 19:09:40 -0500 Subject: [Python-Dev] wish list References: <199911042058.PAA15437@eric.cnri.reston.va.us> Message-ID: <38222044.46CB297E@interet.com> Guido van Rossum wrote: > > I got the wish list below. Anyone care to comment on how close we are > on fulfilling some or all of this? > * GNU CC for Win32 compatibility (compilation of python interpreter and > "Freeze" utility). I think MingWin32 (Mummint Khan) is a good alternative > eviting the cygwin dll user. I don't know what this means. > * Add low level programming capabilities for system access and speed of code > fragments eviting the C-C++ or Java code use. Python, I think, must be a > complete programming language in the "programming for every body" philosofy. I don't know what this means in practical terms either. I use the C interface for this. > * Incorporate WxWindows (wxpython) and/or Gtk+ (now exist a win32 port) GUI > in the standard distribution. For example, Wxpython permit an html browser. > It is very importan for document presentations. And Wxwindows and Gtk+ are > faster than tk. As a Windows user, I don't feel comfortable publishing GUI code based on these tools. Maybe they have progressed and I should look at them again. But I doubt the Python world is going to standardize on a single GUI anyway. Does anyone out there publish Windows Python code with a Windows Python GUI? If so, what GUI toolkit do you use? Jim Ahlstrom From rushing@nightmare.com Fri Nov 5 07:22:22 1999 From: rushing@nightmare.com (Sam Rushing) Date: Thu, 4 Nov 1999 23:22:22 -0800 (PST) Subject: [Python-Dev] wish list In-Reply-To: <668469884@toto.iv> Message-ID: <14370.34222.884193.260990@seattle.nightmare.com> James C. Ahlstrom writes: > Guido van Rossum wrote: > > I got the wish list below. Anyone care to comment on how close we are > > on fulfilling some or all of this? > > > * GNU CC for Win32 compatibility (compilation of python interpreter and > > "Freeze" utility). I think MingWin32 (Mummint Khan) is a good alternative > > eviting the cygwin dll user. > > I don't know what this means. mingw32: 'minimalist gcc for win32'. it's gcc on win32 without trying to be unix. It links against crtdll, so for example it can generate small executables that run on any win32 platform. Also, an alternative to plunking down money ever year to keep up with MSVC++ I used to use mingw32 a lot, and it's even possible to set up egcs to cross-compile to it. At one point using egcs on linux I was able to build a stripped-down python.exe for win32... http://agnes.dida.physik.uni-essen.de/~janjaap/mingw32/ -Sam From jim@interet.com Fri Nov 5 14:04:59 1999 From: jim@interet.com (James C. Ahlstrom) Date: Fri, 05 Nov 1999 09:04:59 -0500 Subject: [Python-Dev] wish list References: <14370.34222.884193.260990@seattle.nightmare.com> Message-ID: <3822E40B.99BA7CA0@interet.com> Sam Rushing wrote: > mingw32: 'minimalist gcc for win32'. it's gcc on win32 without trying > to be unix. It links against crtdll, so for example it can generate OK, thanks. But I don't believe this is something that Python should pursue. Binaries are available for Windows and Visual C++ is widely available and has a professional debugger (etc.). Jim Ahlstrom From skip@mojam.com (Skip Montanaro) Fri Nov 5 17:17:58 1999 From: skip@mojam.com (Skip Montanaro) (Skip Montanaro) Date: Fri, 5 Nov 1999 11:17:58 -0600 (CST) Subject: [Python-Dev] paper available In-Reply-To: <199911032234.XAA26442@pukapuka.inrialpes.fr> References: <199911032234.XAA26442@pukapuka.inrialpes.fr> Message-ID: <14371.4422.96832.498067@dolphin.mojam.com> Vlad> I've OCR'd Saltzer's paper. It's available temporarily (in MS Word Vlad> format) at http://sirac.inrialpes.fr/~marangoz/tmp/Saltzer.zip I downloaded it and took a very quick peek at it, but it's applicability to Python wasn't immediately obvious to me. Did you download it in response to some other thread I missed somewhere? Skip Montanaro | http://www.mojam.com/ skip@mojam.com | http://www.musi-cal.com/ 847-971-7098 | Python: Programming the way Guido indented... From gstein@lyra.org Fri Nov 5 22:19:49 1999 From: gstein@lyra.org (Greg Stein) Date: Fri, 5 Nov 1999 14:19:49 -0800 (PST) Subject: [Python-Dev] wish list In-Reply-To: <3822E40B.99BA7CA0@interet.com> Message-ID: <Pine.LNX.4.10.9911051418330.32496-100000@nebula.lyra.org> On Fri, 5 Nov 1999, James C. Ahlstrom wrote: > Sam Rushing wrote: > > mingw32: 'minimalist gcc for win32'. it's gcc on win32 without trying > > to be unix. It links against crtdll, so for example it can generate > > OK, thanks. But I don't believe this is something that > Python should pursue. Binaries are available for Windows > and Visual C++ is widely available and has a professional > debugger (etc.). If somebody is willing to submit patches, then I don't see a problem with it. There are quite a few people who are unable/unwilling to purchase VC++. People may also need to build their own Python rather than using the prebuilt binaries. Cheers, -g -- Greg Stein, http://www.lyra.org/ From gstein@lyra.org Sun Nov 7 13:24:24 1999 From: gstein@lyra.org (Greg Stein) Date: Sun, 7 Nov 1999 05:24:24 -0800 (PST) Subject: [Python-Dev] updated modules Message-ID: <Pine.LNX.4.10.9911070518020.32496-100000@nebula.lyra.org> Hi all... I've updated some of the modules at http://www.lyra.org/greg/python/. Specifically, there is a new httplib.py, davlib.py, qp_xml.py, and a new imputil.py. The latter will be updated again RSN with some patches from Jim Ahlstrom. Besides some tweaks/fixes/etc, I've also clarified the ownership and licensing of the things. httplib and davlib are (C) Guido, licensed under the Python license (well... anything he chooses :-). qp_xml and imputil are still Public Domain. I also added some comments into the headers to note where they come from (I've had a few people remark that they ran across the module but had no idea who wrote it or where to get updated versions :-), and I inserted a CVS Id to track the versions (yes, I put them into CVS just now). Note: as soon as I figure out the paperwork or whatever, I'll also be skipping the whole "wetsign.txt" thingy and just transfer everything to Guido. He remarked a while ago that he will finally own some code in the Python distribution(!) despite not writing it :-) I might encourage others to consider the same... Cheers, -g -- Greg Stein, http://www.lyra.org/ From mal@lemburg.com Mon Nov 8 09:33:30 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 08 Nov 1999 10:33:30 +0100 Subject: [Python-Dev] wish list References: <199911042058.PAA15437@eric.cnri.reston.va.us> Message-ID: <382698EA.4DBA5E4B@lemburg.com> Guido van Rossum wrote: > > * GNU CC for Win32 compatibility (compilation of python interpreter and > "Freeze" utility). I think MingWin32 (Mummint Khan) is a good alternative > eviting the cygwin dll user. I think this would be a good alternative for all those not having MS VC for one reason or another. Since Mingw32 is free this might be an appropriate solution for e.g. schools which don't want to spend lots of money for VC licenses. > * Add low level programming capabilities for system access and speed of code > fragments eviting the C-C++ or Java code use. Python, I think, must be a > complete programming language in the "programming for every body" philosofy. Don't know what he meant here... > * Incorporate WxWindows (wxpython) and/or Gtk+ (now exist a win32 port) GUI > in the standard distribution. For example, Wxpython permit an html browser. > It is very importan for document presentations. And Wxwindows and Gtk+ are > faster than tk. GUIs tend to be fast moving targets, better leave them out of the main distribution. > * Incorporate a database system in the standard library distribution. To be > possible with relational and documental capabilites and with import facility > of DBASE, Paradox, MSAccess files. Database interfaces are usually way to complicated and largish for the standard dist. IMHO, they should always be packaged separately. Note that simple interfaces such as a standard CSV file import/export module would be neat extensions to the dist. > * Incorporate a XML/HTML/Math-ML editor/browser with graphics capability (to > be possible with XML how internal file format). And to be possible with > Microsoft Word import export facility. For example, AbiWord project can be > an alternative but if lacks programming language. If we can make python the > programming language for AbiWord project... I'm getting the feeling that Ramon is looking for a complete visual programming environment here. XML support in the standard dist (faster than xmllib.py) would be nice. Before that we'd need solid builtin Unicode support though... -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 53 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From andy@robanal.demon.co.uk Tue Nov 9 13:57:46 1999 From: andy@robanal.demon.co.uk (=?iso-8859-1?q?Andy=20Robinson?=) Date: Tue, 9 Nov 1999 05:57:46 -0800 (PST) Subject: [Python-Dev] Internationalisation Case Study Message-ID: <19991109135746.20446.rocketmail@web608.mail.yahoo.com> Guido has asked me to get involved in this discussion, as I've been working practically full-time on i18n for the last year and a half and have done quite a bit with Python in this regard. I thought the most helpful thing would be to describe the real-world business problems I have been tackling so people can understand what one might want from an encoding toolkit. In this (long) post I have included: 1. who I am and what I want to do 2. useful sources of info 3. a real world i18n project 4. what I'd like to see in an encoding toolkit Grab a coffee - this is a long one. 1. Who I am -------------- Firstly, credentials. I'm a Python programmer by night, and when I can involve it in my work which happens perhaps 20% of the time. More relevantly, I did a postgrad course in Japanese Studies and lived in Japan for about two years; in 1990 when I returned, I was speaking fairly fluently and could read a newspaper with regular reference tio a dictionary. Since then my Japanese has atrophied badly, but it is good enough for IT purposes. For the last year and a half I have been internationalizing a lot of systems - more on this below. My main personal interest is that I am hoping to launch a company using Python for reporting, data cleaning and transformation. An encoding library is sorely needed for this. 2. Sources of Knowledge ------------------------------ We should really go for world class advice on this. Some people who could really contribute to this discussion are: - Ken Lunde, author of "CJKV Information Processing" and head of Asian Type Development at Adobe. - Jeffrey Friedl, author of "Mastering Regular Expressions", and a long time Japan resident and expert on things Japanese - Maybe some of the Ruby community? I'll list up books URLs etc. for anyone who needs them on request. 3. A Real World Project ---------------------------- 18 months ago I was offered a contract with one of the world's largest investment management companies (which I will nickname HugeCo) , who (after many years having analysts out there) were launching a business in Japan to attract savers; due to recent legal changes, Japanese people can now freely buy into mutual funds run by foreign firms. Given the 2% they historically get on their savings, and the 12% that US equities have returned for most of this century, this is a business with huge potential. I've been there for a while now, rotating through many different IT projects. HugeCo runs its non-US business out of the UK. The core deal-processing business runs on IBM AS400s. These are kind of a cross between a relational database and a file system, and speak their own encoding called EBCDIC. Five years ago the AS400 had limited connectivity to everything else, so they also started deploying Sybase databases on Unix to support some functions. This means 'mirroring' data between the two systems on a regular basis. IBM has always included encoding information on the AS400 and it converts from EBCDIC to ASCII on request with most of the transfer tools (FTP, database queries etc.) To make things work for Japan, everyone realised that a double-byte representation would be needed. Japanese has about 7000 characters in most IT-related character sets, and there are a lot of ways to store it. Here's a potted language lesson. (Apologies to people who really know this field -- I am not going to be fully pedantic or this would take forever). Japanese includes two phonetic alphabets (each with about 80-90 characters), the thousands of Kanji, and English characters, often all in the same sentence. The first attempt to display something was to make a single -byte character set which included ASCII, and a simplified (and very ugly) katakana alphabet in the upper half of the code page. So you could spell out the sounds of Japanese words using 'half width katakana'. The basic 'character set' is Japan Industrial Standard 0208 ("JIS"). This was defined in 1978, the first official Asian character set to be defined by a government. This can be thought of as a printed chart showing the characters - it does not define their storage on a computer. It defined a logical 94 x 94 grid, and each character has an index in this grid. The "JIS" encoding was a way of mixing ASCII and Japanese in text files and emails. Each Japanese character had a double-byte value. It had 'escape sequences' to say 'You are now entering ASCII territory' or the opposite. In 1978 Microsoft quickly came up with Shift-JIS, a smarter encoding. This basically said "Look at the next byte. If below 127, it is ASCII; if between A and B, it is a half-width katakana; if between B and C, it is the first half of a double-byte character and the next one is the second half". Extended Unix Code (EUC) does similar tricks. Both have the property that there are no control characters, and ASCII is still ASCII. There are a few other encodings too. Unfortunately for me and HugeCo, IBM had their own standard before the Japanese government did, and it differs; it is most commonly called DBCS (Double-Byte Character Set). This involves shift-in and shift-out sequences (0x16 and 0x17, cannot remember which way round), so you can mix single and double bytes in a field. And we used AS400s for our core processing. So, back to the problem. We had a FoxPro system using ShiftJIS on the desks in Japan which we wanted to replace in stages, and an AS400 database to replace it with. The first stage was to hook them up so names and addresses could be uploaded to the AS400, and data files consisting of daily report input could be downloaded to the PCs. The AS400 supposedly had a library which did the conversions, but no one at IBM knew how it worked. The people who did all the evaluations had basically proved that 'Hello World' in Japanese could be stored on an AS400, but never looked at the conversion issues until mid-project. Not only did we need a conversion filter, we had the problem that the character sets were of different sizes. So it was possible - indeed, likely - that some of our ten thousand customers' names and addresses would contain characters only on one system or the other, and fail to survive a round trip. (This is the absolute key issue for me - will a given set of data survive a round trip through various encoding conversions?) We figured out how to get the AS400 do to the conversions during a file transfer in one direction, and I wrote some Python scripts to make up files with each official character in JIS on a line; these went up with conversion, came back binary, and I was able to build a mapping table and 'reverse engineer' the IBM encoding. It was straightforward in theory, "fun" in practice. I then wrote a python library which knew about the AS400 and Shift-JIS encodings, and could translate a string between them. It could also detect corruption and warn us when it occurred. (This is another key issue - you will often get badly encoded data, half a kanji or a couple of random bytes, and need to be clear on your strategy for handling it in any library). It was slow, but it got us our gateway in both directions, and it warned us of bad input. 360 characters in the DBCS encoding actually appear twice, so perfect round trips are impossible, but practically you can survive with some validation of input at both ends. The final story was that our names and addresses were mostly safe, but a few obscure symbols weren't. A big issue was that field lengths varied. An address field 40 characters long on a PC might grow to 42 or 44 on an AS400 because of the shift characters, so the software would truncate the address during import, and cut a kanji in half. This resulted in a string that was illegal DBCS, and errors in the database. To guard against this, you need really picky input validation. You not only ask 'is this string valid Shift-JIS', you check it will fit on the other system too. The next stage was to bring in our Sybase databases. Sybase make a Unicode database, which works like the usual one except that all your SQL code suddenly becomes case sensitive - more (unrelated) fun when you have 2000 tables. Internally it stores data in UTF8, which is a 'rearrangement' of Unicode which is much safer to store in conventional systems. Basically, a UTF8 character is between one and three bytes, there are no nulls or control characters, and the ASCII characters are still the same ASCII characters. UTF8<->Unicode involves some bit twiddling but is one-to-one and entirely algorithmic. We had a product to 'mirror' data between AS400 and Sybase, which promptly broke when we fed it Japanese. The company bought a library called Unilib to do conversions, and started rewriting the data mirror software. This library (like many) uses Unicode as a central point in all conversions, and offers most of the world's encodings. We wanted to test it, and used the Python routines to put together a regression test. As expected, it was mostly right but had some differences, which we were at least able to document. We also needed to rig up a daily feed from the legacy FoxPro database into Sybase while it was being replaced (about six months). We took the same library, built a DLL wrapper around it, and I interfaced to this with DynWin , so we were able to do the low-level string conversion in compiled code and the high-level control in Python. A FoxPro batch job wrote out delimited text in shift-JIS; Python read this in, ran it through the DLL to convert it to UTF8, wrote that out as UTF8 delimited files, ftp'ed them to an in directory on the Unix box ready for daily import. At this point we had a lot of fun with field widths - Shift-JIS is much more compact than UTF8 when you have a lot of kanji (e.g. address fields). Another issue was half-width katakana. These were the earliest attempt to get some form of Japanese out of a computer, and are single-byte characters above 128 in Shift-JIS - but are not part of the JIS0208 standard. They look ugly and are discouraged; but when you ar enterinh a long address in a field of a database, and it won't quite fit, the temptation is to go from two-bytes-per -character to one (just hit F7 in windows) to save space. Unilib rejected these (as would Java), but has optional modes to preserve them or 'expand them out' to their full-width equivalents. The final technical step was our reports package. This is a 4GL using a really horrible 1980s Basic-like language which reads in fixed-width data files and writes out Postscript; you write programs saying 'go to x,y' and 'print customer_name', and can build up anything you want out of that. It's a monster to develop in, but when done it really works - million page jobs no problem. We had bought into this on the promise that it supported Japanese; actually, I think they had got the equivalent of 'Hello World' out of it, since we had a lot of problems later. The first stage was that the AS400 would send down fixed width data files in EBCDIC and DBCS. We ran these through a C++ conversion utility, again using Unilib. We had to filter out and warn about corrupt fields, which the conversion utility would reject. Surviving records then went into the reports program. It then turned out that the reports program only supported some of the Japanese alphabets. Specifically, it had a built in font switching system whereby when it encountered ASCII text, it would flip to the most recent single byte text, and when it found a byte above 127, it would flip to a double byte font. This is because many Chinese fonts do (or did) not include English characters, or included really ugly ones. This was wrong for Japanese, and made the half-width katakana unprintable. I found out that I could control fonts if I printed one character at a time with a special escape sequence, so wrote my own bit-scanning code (tough in a language without ord() or bitwise operations) to examine a string, classify every byte, and control the fonts the way I wanted. So a special subroutine is used for every name or address field. This is apparently not unusual in GUI development (especially web browsers) - you rarely find a complete Unicode font, so you have to switch fonts on the fly as you print a string. After all of this, we had a working system and knew quite a bit about encodings. Then the curve ball arrived: User Defined Characters! It is not true to say that there are exactly 6879 characters in Japanese, and more than counting the number of languages on the Indian sub-continent or the types of cheese in France. There are historical variations and they evolve. Some people's names got missed out, and others like to write a kanji in an unusual way. Others arrived from China where they have more complex variants of the same characters. Despite the Japanese government's best attempts, these people have dug their heels in and want to keep their names the way they like them. My first reaction was 'Just Say No' - I basically said that it one of these customers (14 out of a database of 8000) could show me a tax form or phone bill with the correct UDC on it, we would implement it but not otherwise (the usual workaround is to spell their name phonetically in katakana). But our marketing people put their foot down. A key factor is that Microsoft has 'extended the standard' a few times. First of all, Microsoft and IBM include an extra 360 characters in their code page which are not in the JIS0208 standard. This is well understood and most encoding toolkits know what 'Code Page 932' is Shift-JIS plus a few extra characters. Secondly, Shift-JIS has a User-Defined region of a couple of thousand characters. They have lately been taking Chinese variants of Japanese characters (which are readable but a bit old-fashioned - I can imagine pipe-smoking professors using these forms as an affectation) and adding them into their standard Windows fonts; so users are getting used to these being available. These are not in a standard. Thirdly, they include something called the 'Gaiji Editor' in Japanese Win95, which lets you add new characters to the fonts on your PC within the user-defined region. The first step was to review all the PCs in the Tokyo office, and get one centralized extension font file on a server. This was also fun as people had assigned different code points to characters on differene machines, so what looked correct on your word processor was a black square on mine. Effectively, each company has its own custom encoding a bit bigger than the standard. Clearly, none of these extensions would convert automatically to the other platforms. Once we actually had an agreed list of code points, we scanned the database by eye and made sure that the relevant people were using them. We decided that space for 128 User-Defined Characters would be allowed. We thought we would need a wrapper around Unilib to intercept these values and do a special conversion; but to our amazement it worked! Somebody had already figured out a mapping for at least 1000 characters for all the Japanes encodings, and they did the round trips from Shift-JIS to Unicode to DBCS and back. So the conversion problem needed less code than we thought. This mapping is not defined in a standard AFAIK (certainly not for DBCS anyway). We did, however, need some really impressive validation. When you input a name or address on any of the platforms, the system should say (a) is it valid for my encoding? (b) will it fit in the available field space in the other platforms? (c) if it contains user-defined characters, are they the ones we know about, or is this a new guy who will require updates to our fonts etc.? Finally, we got back to the display problems. Our chosen range had a particular first byte. We built a miniature font with the characters we needed starting in the lower half of the code page. I then generalized by name-printing routine to say 'if the first character is XX, throw it away, and print the subsequent character in our custom font'. This worked beautifully - not only could we print everything, we were using type 1 embedded fonts for the user defined characters, so we could distill it and also capture it for our internal document imaging systems. So, that is roughly what is involved in building a Japanese client reporting system that spans several platforms. I then moved over to the web team to work on our online trading system for Japan, where I am now - people will be able to open accounts and invest on the web. The first stage was to prove it all worked. With HTML, Java and the Web, I had high hopes, which have mostly been fulfilled - we set an option in the database connection to say 'this is a UTF8 database', and Java converts it to Unicode when reading the results, and we set another option saying 'the output stream should be Shift-JIS' when we spew out the HTML. There is one limitations: Java sticks to the JIS0208 standard, so the 360 extra IBM/Microsoft Kanji and our user defined characters won't work on the web. You cannot control the fonts on someone else's web browser; management accepted this because we gave them no alternative. Certain customers will need to be warned, or asked to suggest a standard version of a charactere if they want to see their name on the web. I really hope the web actually brings character usage in line with the standard in due course, as it will save a fortune. Our system is multi-language - when a customer logs in, we want to say 'You are a Japanese customer of our Tokyo Operation, so you see page X in language Y'. The language strings all all kept in UTF8 in XML files, so the same file can hold many languages. This and the database are the real-world reasons why you want to store stuff in UTF8. There are very few tools to let you view UTF8, but luckily there is a free Word Processor that lets you type Japanese and save it in any encoding; so we can cut and paste between Shift-JIS and UTF8 as needed. And that's it. No climactic endings and a lot of real world mess, just like life in IT. But hopefully this gives you a feel for some of the practical stuff internationalisation projects have to deal with. See my other mail for actual suggestions - Andy Robinson ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From andy@robanal.demon.co.uk Tue Nov 9 13:58:39 1999 From: andy@robanal.demon.co.uk (=?iso-8859-1?q?Andy=20Robinson?=) Date: Tue, 9 Nov 1999 05:58:39 -0800 (PST) Subject: [Python-Dev] Internationalization Toolkit Message-ID: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> Here are the features I'd like to see in a Python Internationalisation Toolkit. I'm very open to persuasion about APIs and how to do it, but this is roughly the functionality I would have wanted for the last year (see separate post "Internationalization Case Study"): Built-in types: --------------- "Unicode String" and "Normal String". The normal string is can hold all 256 possible byte values and is analogous to java's Byte Array - in other words an ordinary Python string. Unicode strings iterate (and are manipulated) per character, not per byte. You knew that already. To manipulate anything in a funny encoding, you convert it to Unicode, manipulate it there, then convert it back. Easy Conversions ---------------------- This is modelled on Java which I think has it right. When you construct a Unicode string, you may supply an optional encoding argument. I'm not bothered if conversion happens in a global function, a constructor method or whatever. MyUniString = ToUnicode('hello') # assumes ASCII MyUniString = ToUnicode('pretend this is Japanese', 'ShiftJIS') #specified The converse applies when converting back. The encoding designators should agree with Java. If data is encountered which is not valid for the encoding, there are several strategies, and it would be nice if they could be specified explicitly: 1. replace offending characters with a question mark 2. try to recover intelligently (possible in some cases) 3. raise an exception A 'Unicode' designator is needed which performs a dummy conversion. File Opening: --------------- It should be possible to work with files as we do now - just streams of binary data. It should also be possible to read, say, a file of locally endoded addresses into a Unicode string. e.g. open(myfile, 'r', 'ShiftJIS'). It should also be possible to open a raw Unicode file and read the bytes into ordinary Python strings, or Unicode strings. In this case one needs to watch out for the byte-order marks at the beginning of the file. Not sure of a good API to do this. We could have OrdinaryFile objects and UnicodeFile objects, or proliferate the arguments to 'open. Doing the Conversions ---------------------------- All conversions should go through Unicode as the central point. Here is where we can start to define the territory. Some conversions are algorithmic, some are lookups, many are a mixture with some simple state transitions (e.g. shift characters to denote switches from double-byte to single-byte). I'd like to see an 'encoding engine' modelled on something like mxTextTools - a state machine with a few simple actions, effectively a mini-language for doing simple operations. Then a new encoding can be added in a data-driven way, and still go at C-like speeds. Making this open and extensible (and preferably not needing to code C to do it) is the only way I can see to get a really good solid encodings library. Not all encodings need go in the standard distribution, but all should be downloadable from www.python.org. A generalized two-byte-to-two-byte mapping is 128kb. But there are compact forms which can reduce these to a few kb, and also make the data intelligible. It is obviously desirable to store stuff compactly if we can unpack it fast. Typed Strings ---------------- When you are writing data conversion tools to sit in the middle of a bunch of databases, you could save a lot of grief with a string that knows its encoding. What follows could be done as a Python wrapper around something ordinary strings rather than as a new type, and thus need not be part of the language. This is analogous to Martin Fowler's Quantity pattern in Analysis Patterns, where a number knows its units and you cannot add dollars and pounds accidentally. These would do implicit conversions; and they would stop you assigning or confusing differently encoded strings. They would also validate when constructed. 'Typecasting' would be allowed but would require explicit code. So maybe something like... >>>ts1 = TypedString('hello', 'cp932ms') # specify encoding, it remembers it >>>ts2 = TypedString('goodbye','cp5035') >>>ts1 + ts2 #or any of a host of other encoding options EncodingError >>>ts3 = TypedString(ts1, 'cp5035') #converts it implicitly going via Unicode >>>ts4 = ts1.cast('ShiftJIS') #the developer knows that in this case the string is compatible. Going Deeper ---------------- The project I describe involved many more issues than just a straight conversion. I envisage an encodings package or module which power users could get at directly. We have be able to answer the questions: 'is string X a valid instance of encoding Y?' 'is string X nearly a valid instance of encoding Y, maybe with a little corruption, or is it something totally different?' - this one might be a task left to a programmer, but the toolkit should help where it can. 'can string X be converted from encoding Y to encoding Z without loss of data? If not, exactly what will get trashed' ? This is a really useful utility. More generally, I want tools to reason about character sets and encodings. I have 'Character Set' and 'Character Mapping' classes - very app-specific and proprietary - which let me express and answer questions about whether one character set is a superset of another and reason about round trips. I'd like to do these properly for the toolkit. They would need some C support for speed, but I think they could still be data driven. So we could have an Endoding object which could be pickled, and we could keep a directory full of them as our database. There might actually be two encoding objects - one for single-byte, one for multi-byte, with the same API. There are so many subtle differences between encodings (even within the Shift-JIS family) - company X has ten extra characters, and that is technically a new encoding. So it would be really useful to reason about these and say 'find me all JIS-compatible encodings', or 'report on the differences between Shift-JIS and 'cp932ms'. GUI Issues ------------- The new Pythonwin breaks somewhat on Japanese - editor windows are fine but console output is show as single-byte garbage. I will try to evaluate IDLE on a Japanese test box this week. I think these two need to work for double-byte languages for our credibility. Verifiability and printing ----------------------------- We will need to prove it all works. This means looking at text on a screen or on paper. A really wicked demo utility would be a GUI which could open files and convert encodings in an editor window or spreadsheet window, and specify conversions on copy/paste. If it could save a page as HTML (just an encoding tag and data between <PRE> tags, then we could use Netscape/IE for verification. Better still, a web server demo could convert on python.org and tag the pages appropriately - browsers support most common encodings. All the encoding stuff is ultimately a bit meaningless without a way to display a character. I am hoping that PDF and PDFgen may add a lot of value here. Adobe (and Ken Lunde) have spent years coming up with a general architecture for this stuff in PDF. Basically, the multi-byte fonts they use are encoding independent, and come with a whole bunch of mapping tables. So I can ask for the same Japanese font in any of about ten encodings - font name is a combination of face name and encoding. The font itself does the remapping. They make available downloadable font packs for Acrobat 4.0 for most languages now; these are good places to raid for building encoding databases. It also means that I can write a Python script to crank out beautiful-looking code page charts for all of our encodings from the database, and input and output to regression tests. I've done it for Shift-JIS at Fidelity, and would have to rewrite it once I am out of here. But I think that some good graphic design here would lead to a product that blows people away - an encodings library that can print out its own contents for viewing and thus help demonstrate its own correctness (or make errors stick out like a sore thumb). Am I mad? Have I put you off forever? What I outline above would be a serious project needing months of work; I'd be really happy to take a role, if we could find sponsors for the project. But I believe we could define the standard for years to come. Furthermore, it would go a long way to making Python the corporate choice for data cleaning and transformation - territory I think we should own. Regards, Andy Robinson Robinson Analytics Ltd. ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From guido@CNRI.Reston.VA.US Tue Nov 9 16:46:41 1999 From: guido@CNRI.Reston.VA.US (Guido van Rossum) Date: Tue, 09 Nov 1999 11:46:41 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: Your message of "Tue, 09 Nov 1999 05:58:39 PST." <19991109135839.25864.rocketmail@web607.mail.yahoo.com> References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> Message-ID: <199911091646.LAA21467@eric.cnri.reston.va.us> Andy, Thanks a bundle for your case study and your toolkit proposal. It's interesting that you haven't touched upon internationalization of user interfaces (dialog text, menus etc.) -- that's a whole nother can of worms. Marc-Andre Lemburg has a proposal for work that I'm asking him to do (under pressure from HP who want Python i18n badly and are willing to pay!): http://starship.skyport.net/~lemburg/unicode-proposal.txt I think his proposal will go a long way towards your toolkit. I hope to hear soon from anybody who disagrees with Marc-Andre's proposal, because without opposition this is going to be Python 1.6's offering for i18n... (Together with a new Unicode regex engine by /F.) One specific question: in you discussion of typed strings, I'm not sure why you couldn't convert everything to Unicode and be done with it. I have a feeling that the answer is somewhere in your case study -- maybe you can elaborate? --Guido van Rossum (home page: http://www.python.org/~guido/) From akuchlin@mems-exchange.org Tue Nov 9 17:21:03 1999 From: akuchlin@mems-exchange.org (Andrew M. Kuchling) Date: Tue, 9 Nov 1999 12:21:03 -0500 (EST) Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <199911091646.LAA21467@eric.cnri.reston.va.us> References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> <199911091646.LAA21467@eric.cnri.reston.va.us> Message-ID: <14376.22527.323888.677816@amarok.cnri.reston.va.us> Guido van Rossum writes: >I think his proposal will go a long way towards your toolkit. I hope >to hear soon from anybody who disagrees with Marc-Andre's proposal, >because without opposition this is going to be Python 1.6's offering >for i18n... The proposal seems reasonable to me. >(Together with a new Unicode regex engine by /F.) This is good news! Would it be a from-scratch regex implementation, or would it be an adaptation of an existing engine? Would it involve modifications to the existing re module, or a completely new unicodere module? (If, unlike re.py, it has POSIX longest-match semantics, that would pretty much settle the question.) -- A.M. Kuchling http://starship.python.net/crew/amk/ All around me darkness gathers, fading is the sun that shone, we must speak of other matters, you can be me when I'm gone... -- The train's clattering, in SANDMAN #67: "The Kindly Ones:11" From guido@CNRI.Reston.VA.US Tue Nov 9 17:26:38 1999 From: guido@CNRI.Reston.VA.US (Guido van Rossum) Date: Tue, 09 Nov 1999 12:26:38 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: Your message of "Tue, 09 Nov 1999 12:21:03 EST." <14376.22527.323888.677816@amarok.cnri.reston.va.us> References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> <199911091646.LAA21467@eric.cnri.reston.va.us> <14376.22527.323888.677816@amarok.cnri.reston.va.us> Message-ID: <199911091726.MAA21754@eric.cnri.reston.va.us> [AMK] > The proposal seems reasonable to me. Thanks. I really hope that this time we can move forward united... > >(Together with a new Unicode regex engine by /F.) > > This is good news! Would it be a from-scratch regex implementation, > or would it be an adaptation of an existing engine? Would it involve > modifications to the existing re module, or a completely new unicodere > module? (If, unlike re.py, it has POSIX longest-match semantics, that > would pretty much settle the question.) It's from scratch, and I believe it's got Perl style, not POSIX style semantics -- per Tim Peters' recommendations. Do we need to open the discussion again? It involves a redone re module (supporting Unicode as well as 8-bit), but its API could be unchanged. /F does the parsing and compilation in Python, only the matching engine is in C -- not sure how that impacts performance, but I imagine with aggressive caching it would be okay. --Guido van Rossum (home page: http://www.python.org/~guido/) From akuchlin@mems-exchange.org Tue Nov 9 17:40:07 1999 From: akuchlin@mems-exchange.org (Andrew M. Kuchling) Date: Tue, 9 Nov 1999 12:40:07 -0500 (EST) Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <199911091726.MAA21754@eric.cnri.reston.va.us> References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> <199911091646.LAA21467@eric.cnri.reston.va.us> <14376.22527.323888.677816@amarok.cnri.reston.va.us> <199911091726.MAA21754@eric.cnri.reston.va.us> Message-ID: <14376.23671.250752.637144@amarok.cnri.reston.va.us> Guido van Rossum writes: >It's from scratch, and I believe it's got Perl style, not POSIX style >semantics -- per Tim Peters' recommendations. Do we need to open the >discussion again? No, no; I'm actually happier with Perl-style, because it's far better documented and familiar to people. Worse *is* better, after all. My concern is simply that I've started translating re.py into C, and wonder how this affects the translation. This isn't a pressing issue, because the C version isn't finished yet. >It involves a redone re module (supporting Unicode as well as 8-bit), >but its API could be unchanged. /F does the parsing and compilation >in Python, only the matching engine is in C -- not sure how that >impacts performance, but I imagine with aggressive caching it would be >okay. Can I get my paws on a copy of the modified re.py to see what ramifications it has, or is this all still an unreleased work-in-progress? Doing the compilation in Python is a good idea, and will make it possible to implement alternative syntaxes. I would have liked to make it possible to generate PCRE bytecodes from Python, but what stopped me is the chance of bogus bytecode causing the engine to dump core, loop forever, or some other nastiness. (This is particularly important for code that uses rexec.py, because you'd expect regexes to be safe.) Fixing the engine to be stable when faced with bad bytecodes appears to require many additional checks that would slow down the common case of correct code, which is unappealing. -- A.M. Kuchling http://starship.python.net/crew/amk/ Anybody else on the list got an opinion? Should I change the language or not? -- Guido van Rossum, 28 Dec 91 From ping@lfw.org Tue Nov 9 18:08:05 1999 From: ping@lfw.org (Ka-Ping Yee) Date: Tue, 9 Nov 1999 10:08:05 -0800 (PST) Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <14376.23671.250752.637144@amarok.cnri.reston.va.us> Message-ID: <Pine.LNX.4.10.9911091004240.7102-100000@localhost> On Tue, 9 Nov 1999, Andrew M. Kuchling wrote: > Guido van Rossum writes: > >It's from scratch, and I believe it's got Perl style, not POSIX style > >semantics -- per Tim Peters' recommendations. Do we need to open the > >discussion again? > > No, no; I'm actually happier with Perl-style, because it's far better > documented and familiar to people. Worse *is* better, after all. I would concur with the preference for Perl-style semantics. Aside from the issue of consistency with other scripting languages, i think it's easier to predict the behaviour of these semantics. You can run the algorithm in your head, and try the backtracking yourself. It's good for the algorithm to be predictable and well understood. > Doing the compilation in Python is a good idea, and will make it > possible to implement alternative syntaxes. Also agree. I still have some vague wishes for a simpler, more readable (more Pythonian?) way to express patterns -- perhaps not as powerful as full regular expressions, but useful for many simpler cases (an 80-20 solution). -- ?!ng From bwarsaw@cnri.reston.va.us (Barry A. Warsaw) Tue Nov 9 18:15:04 1999 From: bwarsaw@cnri.reston.va.us (Barry A. Warsaw) (Barry A. Warsaw) Date: Tue, 9 Nov 1999 13:15:04 -0500 (EST) Subject: [Python-Dev] Internationalization Toolkit References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> <199911091646.LAA21467@eric.cnri.reston.va.us> <14376.22527.323888.677816@amarok.cnri.reston.va.us> <199911091726.MAA21754@eric.cnri.reston.va.us> <14376.23671.250752.637144@amarok.cnri.reston.va.us> Message-ID: <14376.25768.368164.88151@anthem.cnri.reston.va.us> >>>>> "AMK" == Andrew M Kuchling <akuchlin@mems-exchange.org> writes: AMK> No, no; I'm actually happier with Perl-style, because it's AMK> far better documented and familiar to people. Worse *is* AMK> better, after all. Plus, you can't change re's semantics and I think it makes sense if the Unicode engine is as close semantically as possible to the existing engine. We need to be careful not to worsen performance for 8bit strings. I think we're already on the edge of acceptability w.r.t. P*** and hopefully we can /improve/ performance here. MAL's proposal seems quite reasonable. It would be excellent to see these things done for Python 1.6. There's still some discussion on supporting internationalization of applications, e.g. using gettext but I think those are smaller in scope. -Barry From akuchlin@mems-exchange.org Tue Nov 9 19:36:28 1999 From: akuchlin@mems-exchange.org (Andrew M. Kuchling) Date: Tue, 9 Nov 1999 14:36:28 -0500 (EST) Subject: [Python-Dev] I18N Toolkit In-Reply-To: <14376.25768.368164.88151@anthem.cnri.reston.va.us> References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> <199911091646.LAA21467@eric.cnri.reston.va.us> <14376.22527.323888.677816@amarok.cnri.reston.va.us> <199911091726.MAA21754@eric.cnri.reston.va.us> <14376.23671.250752.637144@amarok.cnri.reston.va.us> <14376.25768.368164.88151@anthem.cnri.reston.va.us> Message-ID: <14376.30652.201552.116828@amarok.cnri.reston.va.us> Barry A. Warsaw writes: (in relation to support for Unicode regexes) >We need to be careful not to worsen performance for 8bit strings. I >think we're already on the edge of acceptability w.r.t. P*** and >hopefully we can /improve/ performance here. I don't think that will be a problem, given that the Unicode engine would be a separate C implementation. A bit of 'if type(strg) == UnicodeType' in re.py isn't going to cost very much speed. (Speeding up PCRE -- that's another question. I'm often tempted to rewrite pcre_compile to generate an easier-to-analyse parse tree, instead of its current complicated-but-memory-parsimonious compiler, but I'm very reluctant to introduce a fork like that.) -- A.M. Kuchling http://starship.python.net/crew/amk/ The world does so well without me, that I am moved to wish that I could do equally well without the world. -- Robertson Davies, _The Diary of Samuel Marchbanks_ From mhammond@skippinet.com.au Tue Nov 9 22:27:45 1999 From: mhammond@skippinet.com.au (Mark Hammond) Date: Wed, 10 Nov 1999 09:27:45 +1100 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <199911091646.LAA21467@eric.cnri.reston.va.us> Message-ID: <001c01bf2b01$a58d5d50$0501a8c0@bobcat> > I think his proposal will go a long way towards your toolkit. I hope > to hear soon from anybody who disagrees with Marc-Andre's proposal, No disagreement as such, but a small hole: From the proposal: Internal Argument Parsing: -------------------------- ... 's': For Unicode objects: auto convert them to the <default encoding> and return a pointer to the object's <defencbuf> buffer. -- Excellent - if someone passes a Unicode object, it can be auto-converted to a string. This will allow "open()" to accept Unicode strings. However, there doesnt appear to be a reverse. Eg, if my extension module interfaces to a library that uses Unicode natively, how can I get a Unicode object when the user passes a string? If I had to explicitely check for a string, then check for a Unicode on failure it would get messy pretty quickly... Is it not possible to have "U" also do a conversion? Mark. From tim_one@email.msn.com Wed Nov 10 05:57:14 1999 From: tim_one@email.msn.com (Tim Peters) Date: Wed, 10 Nov 1999 00:57:14 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <199911091726.MAA21754@eric.cnri.reston.va.us> Message-ID: <000001bf2b40$70183840$d82d153f@tim> [Guido, on "a new Unicode regex engine by /F"] > It's from scratch, and I believe it's got Perl style, not POSIX style > semantics -- per Tim Peters' recommendations. Do we need to open the > discussion again? No, but I get to whine just a little <wink>: I didn't recommend either approach. I asked many futile questions about HP's requirements, and sketched implications either way. If HP *has* a requirement wrt POSIX-vs-Perl, it would be good to find that out before it's too late. I personally prefer POSIX semantics -- but, as Andrew so eloquently said, worse is better here; all else being equal it's best to follow JPython's Perl-compatible re lead. last-time-i-ever-say-what-i-really-think<wink>-ly y'rs - tim From tim_one@email.msn.com Wed Nov 10 06:25:07 1999 From: tim_one@email.msn.com (Tim Peters) Date: Wed, 10 Nov 1999 01:25:07 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <199911091646.LAA21467@eric.cnri.reston.va.us> Message-ID: <000201bf2b44$55b8ad00$d82d153f@tim> > Marc-Andre Lemburg has a proposal for work that I'm asking him to do > (under pressure from HP who want Python i18n badly and are willing to > pay!): http://starship.skyport.net/~lemburg/unicode-proposal.txt I can't make time for a close review now. Just one thing that hit my eye early: Python should provide a built-in constructor for Unicode strings which is available through __builtins__: u = unicode(<encoded Python string>[,<encoding name>= <default encoding>]) u = u'<utf-8 encoded Python string>' Two points on the Unicode literals (u'abc'): UTF-8 is a very nice encoding scheme, but is very hard for people "to do" by hand -- it breaks apart and rearranges bytes at the bit level, and everything other than 7-bit ASCII requires solid strings of "high-bit" characters. This is painful for people to enter manually on both counts -- and no common reference gives the UTF-8 encoding of glyphs directly. So, as discussed earlier, we should follow Java's lead and also introduce a \u escape sequence: octet: hexdigit hexdigit unicodecode: octet octet unicode_escape: "\\u" unicodecode Inside a u'' string, I guess this should expand to the UTF-8 encoding of the Unicode character at the unicodecode code position. For consistency, then, it should probably expand the same way inside "regular strings" too. Unlike Java does, I'd rather not give it a meaning outside string literals. The other point is a nit: The vast bulk of UTF-8 encodings encode characters in UCS-4 space outside of Unicode. In good Pythonic fashion, those must either be explicitly outlawed, or explicitly defined. I vote for outlawed, in the sense of detected error that raises an exception. That leaves our future options open. BTW, is ord(unicode_char) defined? And as what? And does ord have an inverse in the Unicode world? Both seem essential. international-in-spite-of-himself-ly y'rs - tim From fredrik@pythonware.com Wed Nov 10 08:08:06 1999 From: fredrik@pythonware.com (Fredrik Lundh) Date: Wed, 10 Nov 1999 09:08:06 +0100 Subject: Internal Format (Re: [Python-Dev] Internationalization Toolkit) References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> <199911091646.LAA21467@eric.cnri.reston.va.us> Message-ID: <00f401bf2b53$3c234e90$f29b12c2@secret.pythonware.com> Guido van Rossum <guido@CNRI.Reston.VA.US> wrote: > http://starship.skyport.net/~lemburg/unicode-proposal.txt Marc-Andre writes: The internal format for Unicode objects should either use a Python specific fixed cross-platform format <PythonUnicode> (e.g. 2-byte little endian byte order) or a compiler provided wchar_t format (if available). Using the wchar_t format will ease embedding of Python in other Unicode aware applications, but will also make internal format dumps platform dependent. having been there and done that, I strongly suggest a third option: a 16-bit unsigned integer, in platform specific byte order (PY_UNICODE_T). along all other roads lie code bloat and speed penalties... (besides, this is exactly how it's already done in unicode.c and what 'sre' prefers...) </F> From andy@robanal.demon.co.uk Wed Nov 10 08:09:26 1999 From: andy@robanal.demon.co.uk (=?iso-8859-1?q?Andy=20Robinson?=) Date: Wed, 10 Nov 1999 00:09:26 -0800 (PST) Subject: [Python-Dev] Internationalization Toolkit Message-ID: <19991110080926.2400.rocketmail@web602.mail.yahoo.com> In general, I like this proposal a lot, but I think it only covers half the story. How we actually build the encoder/decoder for each encoding is a very big issue. Thoughts on this below. First, a little nit > u = u'<utf-8 encoded Python string>' I don't like using funny prime characters - why not an explicit function like "utf8()" On to the important stuff:> > unicodec.register(<encname>,<encoder>,<decoder> > [,<stream_encoder>, <stream_decoder>]) > This registers the codecs under the given encoding > name in the module global dictionary > unicodec.codecs. Stream codecs are optional: > the unicodec module will provide appropriate > wrappers around <encoder> and > <decoder> if not given. I would MUCH prefer a single 'Encoding' class or type to wrap up these things, rather than up to four disconnected objects/functions. Essentially it would be an interface standard and would offer methods to do the four things above. There are several reasons for this. (1) there are quite a lot of things you might want to do with an encoding object, and we could extend the interface in future easily. As a minimum, give it the four methods implied by the above, two of which can be defaults. But I'd like an encoding to be able to tell me the set of characters to which it applies; validate a string; and maybe tell me if it is a subset or superset of another. (2) especially with double-byte encodings, they will need to load up some kind of database on startup and use this for both encoding and decoding - much better to share it and encapsulate it inside one object (3) for some languages, there are extra functions wanted. For Japanese, you need two or three functions to expand half-width to full-width katakana, convert double-byte english to single-byte and vice versa. A Japanese encoding object would be a handy place to put this knowledge. (4) In the real world you get many encodings which are subtle variations of the same thing, plus or minus a few characters. One bit of code might be able to share the work of several encodings, by setting a few flags. Certainly true of Japanese. (5) encoding/decoding algorithms can be program or data or (very often) a bit of both. We have not yet discussed where to keep all the mapping tables, but if data is involved it should be hidden in an object. (6) See my comments on a state machine for doing the encodings. If this is done well, we might two different standard objects which conform to the Encoding interface (a really light one for single-byte encodings, and a bigger one for multi-byte), and everything else could be data driven. (6) Easy to grow - encodings can be prototyped and proven in Python, ported to C if needed or when ready. In summary, firm up the concept of an Encoding object and give it room to grow - that's the key to real-world usefulness. If people feel the same way I'll have a go at an interface for that, and try show how it would have simplified specific problems I have faced. We also need to think about where encoding info will live. You cannot avoid mapping tables, although you can hide them inside code modules or pickled objects if you want. Should there be a standard "..\Python\Enc" directory? And we're going to need some kind of testing and certification procedure when adding new encodings. This stuff has to be right. Guido asked about TypedString. This can probably be done on top of the built-in stuff - it is just a convenience which would clarify intent, reduce lines of code and prevent people shooting themselves in the foot when juggling a lot of strings in different (non-Unicode) encodings. I can do a Python module to implement that on top of whatever is built. Regards, Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From fredrik@pythonware.com Wed Nov 10 08:14:21 1999 From: fredrik@pythonware.com (Fredrik Lundh) Date: Wed, 10 Nov 1999 09:14:21 +0100 Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit) References: <000201bf2b44$55b8ad00$d82d153f@tim> Message-ID: <00f501bf2b53$9872e610$f29b12c2@secret.pythonware.com> Tim Peters wrote: > UTF-8 is a very nice encoding scheme, but is very hard for people "to do" by > hand -- it breaks apart and rearranges bytes at the bit level, and > everything other than 7-bit ASCII requires solid strings of "high-bit" > characters. unless you're using a UTF-8 aware editor, of course ;-) (some days, I think we need some way to tell the compiler what encoding we're using for the source file...) > This is painful for people to enter manually on both counts -- > and no common reference gives the UTF-8 encoding of glyphs > directly. So, as discussed earlier, we should follow Java's lead > and also introduce a \u escape sequence: > > octet: hexdigit hexdigit > unicodecode: octet octet > unicode_escape: "\\u" unicodecode > > Inside a u'' string, I guess this should expand to the UTF-8 encoding of the > Unicode character at the unicodecode code position. For consistency, then, > it should probably expand the same way inside "regular strings" too. Unlike > Java does, I'd rather not give it a meaning outside string literals. good idea. and by some reason, patches for this is included in the unicode distribution (see the attached str2utf.c). > The other point is a nit: The vast bulk of UTF-8 encodings encode > characters in UCS-4 space outside of Unicode. In good Pythonic fashion, > those must either be explicitly outlawed, or explicitly defined. I vote for > outlawed, in the sense of detected error that raises an exception. That > leaves our future options open. I vote for 'outlaw'. </F> /* A small code snippet that translates \uxxxx syntax to UTF-8 text. To be cut and pasted into Python/compile.c */ /* Written by Fredrik Lundh, January 1999. */ /* Documentation (for the language reference): \uxxxx -- Unicode character with hexadecimal value xxxx. The character is stored using UTF-8 encoding, which means that this sequence can result in up to three encoded characters. Note that the 'u' must be followed by four hexadecimal digits. If fewer digits are given, the sequence is left in the resulting string exactly as given. If more digits are given, only the first four are translated to Unicode, and the remaining digits are left in the resulting string. */ #define Py_CHARMASK(ch) ch void convert(const char *s, char *p) { while (*s) { if (*s != '\\') { *p++ = *s++; continue; } s++; switch (*s++) { /* -------------------------------------------------------------------- */ /* copy this section to the appropriate place in compile.c... */ case 'u': /* \uxxxx => UTF-8 encoded unicode character */ if (isxdigit(Py_CHARMASK(s[0])) && isxdigit(Py_CHARMASK(s[1])) && isxdigit(Py_CHARMASK(s[2])) && isxdigit(Py_CHARMASK(s[3]))) { /* fetch hexadecimal character value */ unsigned int n, ch = 0; for (n = 0; n < 4; n++) { int c = Py_CHARMASK(*s); s++; ch = (ch << 4) & ~0xF; if (isdigit(c)) ch += c - '0'; else if (islower(c)) ch += 10 + c - 'a'; else ch += 10 + c - 'A'; } /* store as UTF-8 */ if (ch < 0x80) *p++ = (char) ch; else { if (ch < 0x800) { *p++ = 0xc0 | (ch >> 6); *p++ = 0x80 | (ch & 0x3f); } else { *p++ = 0xe0 | (ch >> 12); *p++ = 0x80 | ((ch >> 6) & 0x3f); *p++ = 0x80 | (ch & 0x3f); } } break; } else goto bogus; /* -------------------------------------------------------------------- */ default: bogus: *p++ = '\\'; *p++ = s[-1]; break; } } *p++ = '\0'; } main() { int i; unsigned char buffer[100]; convert("Link\\u00f6ping", buffer); for (i = 0; buffer[i]; i++) if (buffer[i] < 0x20 || buffer[i] >= 0x80) printf("\\%03o", buffer[i]); else printf("%c", buffer[i]); } From gstein@lyra.org Thu Nov 11 09:18:52 1999 From: gstein@lyra.org (Greg Stein) Date: Thu, 11 Nov 1999 01:18:52 -0800 (PST) Subject: [Python-Dev] Re: Internal Format In-Reply-To: <00f401bf2b53$3c234e90$f29b12c2@secret.pythonware.com> Message-ID: <Pine.LNX.4.10.9911110116050.638-100000@nebula.lyra.org> On Wed, 10 Nov 1999, Fredrik Lundh wrote: > Marc-Andre writes: > > The internal format for Unicode objects should either use a Python > specific fixed cross-platform format <PythonUnicode> (e.g. 2-byte > little endian byte order) or a compiler provided wchar_t format (if > available). Using the wchar_t format will ease embedding of Python in > other Unicode aware applications, but will also make internal format > dumps platform dependent. > > having been there and done that, I strongly suggest > a third option: a 16-bit unsigned integer, in platform > specific byte order (PY_UNICODE_T). along all other > roads lie code bloat and speed penalties... I agree 100% !! wchar_t will introduce portability issues right on up into the Python level. The byte-order introduces speed issues and OS interoperability issues, yet solves no portability problems (Byte Order Marks should still be present and used). There are two "platforms" out there that use Unicode: Win32 and Java. They both use UCS-2, AFAIK. Cheers, -g -- Greg Stein, http://www.lyra.org/ From fredrik@pythonware.com Wed Nov 10 08:24:16 1999 From: fredrik@pythonware.com (Fredrik Lundh) Date: Wed, 10 Nov 1999 09:24:16 +0100 Subject: cached encoding (Re: [Python-Dev] Internationalization Toolkit) References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> <199911091646.LAA21467@eric.cnri.reston.va.us> Message-ID: <010b01bf2b54$fb107430$f29b12c2@secret.pythonware.com> Guido van Rossum <guido@CNRI.Reston.VA.US> wrote: > One specific question: in you discussion of typed strings, I'm not > sure why you couldn't convert everything to Unicode and be done with > it. I have a feeling that the answer is somewhere in your case study > -- maybe you can elaborate? Marc-Andre writes: Unicode objects should have a pointer to a cached (read-only) char buffer <defencbuf> holding the object's value using the current <default encoding>. This is needed for performance and internal parsing (see below) reasons. The buffer is filled when the first conversion request to the <default encoding> is issued on the object. keeping track of an external encoding is better left for the application programmers -- I'm pretty sure that different application builders will want to handle this in radically different ways, depending on their environ- ment, underlying user interface toolkit, etc. besides, this is how Tcl would have done it. Python's not Tcl, and I think you need *very* good arguments for moving in that direction. </F> From mal@lemburg.com Wed Nov 10 09:04:39 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 10 Nov 1999 10:04:39 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <001c01bf2b01$a58d5d50$0501a8c0@bobcat> Message-ID: <38293527.3CF5C7B0@lemburg.com> Mark Hammond wrote: > > > I think his proposal will go a long way towards your toolkit. I > hope > > to hear soon from anybody who disagrees with Marc-Andre's proposal, > > No disagreement as such, but a small hole: > > >From the proposal: > > Internal Argument Parsing: > -------------------------- > ... > 's': For Unicode objects: auto convert them to the <default encoding> > and return a pointer to the object's <defencbuf> buffer. > > -- > Excellent - if someone passes a Unicode object, it can be > auto-converted to a string. This will allow "open()" to accept > Unicode strings. Well almost... it depends on the current value of <default encoding>. If it's UTF8 and you only use normal ASCII characters the above is indeed true, but UTF8 can go far beyond ASCII and have up to 3 bytes per character (for UCS2, even more for UCS4). With <default encoding> set to other exotic encodings this is likely to fail though. > However, there doesnt appear to be a reverse. Eg, if my extension > module interfaces to a library that uses Unicode natively, how can I > get a Unicode object when the user passes a string? If I had to > explicitely check for a string, then check for a Unicode on failure it > would get messy pretty quickly... Is it not possible to have "U" also > do a conversion? "U" is meant to simplify checks for Unicode objects, much like "S". It returns a reference to the object. Auto-conversions are not possible due to this, because they would create new objects which don't get properly garbage collected later on. Another problem is that Unicode types differ between platforms (MS VCLIB uses 16-bit wchar_t, while GLIBC2 uses 32-bit wchar_t). Depending on the internal format of Unicode objects this could mean calling different conversion APIs. BTW, I'm still not too sure about the underlying internal format. The problem here is that Unicode started out as 2-byte fixed length representation (UCS2) but then shifted towards a 4-byte fixed length reprensetation known as UCS4. Since having 4 bytes per character is hard sell to customers, UTF16 was created to stuff the UCS4 code points (this is how character entities are called in Unicode) into 2 bytes... with a variable length encoding. Some platforms that started early into the Unicode business such as the MS ones use UCS2 as wchar_t, while more recent ones (e.g. the glibc2 on Linux) use UCS4 for wchar_t. I haven't yet checked in what ways the two are compatible (I would suspect the top bytes in UCS4 being 0 for UCS2 codes), but would like to hear whether it wouldn't be a better idea to use UTF16 as internal format. The latter works in 2 bytes for most characters and conversion to UCS2|4 should be fast. Still, conversion to UCS2 could fail. The downside of using UTF16: it is a variable length format, so iterations over it will be slower than for UCS4. Simply sticking to UCS2 is probably out of the question, since Unicode 3.0 requires UCS4 and we are targetting Unicode 3.0. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Wed Nov 10 09:49:01 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 10 Nov 1999 10:49:01 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <000201bf2b44$55b8ad00$d82d153f@tim> Message-ID: <38293F8D.F60AE605@lemburg.com> Tim Peters wrote: > > > Marc-Andre Lemburg has a proposal for work that I'm asking him to do > > (under pressure from HP who want Python i18n badly and are willing to > > pay!): http://starship.skyport.net/~lemburg/unicode-proposal.txt > > I can't make time for a close review now. Just one thing that hit my eye > early: > > Python should provide a built-in constructor for Unicode strings > which is available through __builtins__: > > u = unicode(<encoded Python string>[,<encoding name>= > <default encoding>]) > > u = u'<utf-8 encoded Python string>' > > Two points on the Unicode literals (u'abc'): > > UTF-8 is a very nice encoding scheme, but is very hard for people "to do" by > hand -- it breaks apart and rearranges bytes at the bit level, and > everything other than 7-bit ASCII requires solid strings of "high-bit" > characters. This is painful for people to enter manually on both counts -- > and no common reference gives the UTF-8 encoding of glyphs directly. So, as > discussed earlier, we should follow Java's lead and also introduce a \u > escape sequence: > > octet: hexdigit hexdigit > unicodecode: octet octet > unicode_escape: "\\u" unicodecode > > Inside a u'' string, I guess this should expand to the UTF-8 encoding of the > Unicode character at the unicodecode code position. For consistency, then, > it should probably expand the same way inside "regular strings" too. Unlike > Java does, I'd rather not give it a meaning outside string literals. It would be more conform to use the Unicode ordinal (instead of interpreting the number as UTF8 encoding), e.g. \u03C0 for Pi. The codes are easy to look up in the standard's UnicodeData.txt file or the Unicode book for that matter. > The other point is a nit: The vast bulk of UTF-8 encodings encode > characters in UCS-4 space outside of Unicode. In good Pythonic fashion, > those must either be explicitly outlawed, or explicitly defined. I vote for > outlawed, in the sense of detected error that raises an exception. That > leaves our future options open. See my other post for a discussion of UCS4 vs. UTF16 vs. UCS2. Perhaps we could add a flag to Unicode objects stating whether the characters can be treated as UCS4 limited to the lower 16 bits (UCS4 and UTF16 are the same in most ranges). This flag could then be used to choose optimized algorithms for scanning the strings. Fredrik's implementation currently uses UCS2, BTW. > BTW, is ord(unicode_char) defined? And as what? And does ord have an > inverse in the Unicode world? Both seem essential. Good points. How about uniord(u[:1]) --> Unicode ordinal number (32-bit) unichr(i) --> Unicode object for character i (provided it is 32-bit); ValueError otherwise They are inverse of each other, but note that Unicode allows private encodings too, which will of course not necessarily make it across platforms or even from one PC to the next (see Andy Robinson's interesting case study). I've uploaded a new version of the proposal (0.3) to the URL: http://starship.skyport.net/~lemburg/unicode-proposal.txt Thanks, -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From fredrik@pythonware.com Wed Nov 10 10:50:05 1999 From: fredrik@pythonware.com (Fredrik Lundh) Date: Wed, 10 Nov 1999 11:50:05 +0100 Subject: regexp performance (Re: [Python-Dev] I18N Toolkit References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com><199911091646.LAA21467@eric.cnri.reston.va.us><14376.22527.323888.677816@amarok.cnri.reston.va.us><199911091726.MAA21754@eric.cnri.reston.va.us><14376.23671.250752.637144@amarok.cnri.reston.va.us><14376.25768.368164.88151@anthem.cnri.reston.va.us> <14376.30652.201552.116828@amarok.cnri.reston.va.us> Message-ID: <027c01bf2b69$59e60330$f29b12c2@secret.pythonware.com> Andrew M. Kuchling <akuchlin@mems-exchange.org> wrote: > (Speeding up PCRE -- that's another question. I'm often tempted to > rewrite pcre_compile to generate an easier-to-analyse parse tree, > instead of its current complicated-but-memory-parsimonious compiler, > but I'm very reluctant to introduce a fork like that.) any special pattern constructs that are in need of per- formance improvements? (compared to Perl, that is). or maybe anyone has an extensive performance test suite for perlish regular expressions? (preferrably based on how real people use regular expressions, not only on things that are known to be slow if not optimized) </F> From gstein@lyra.org Thu Nov 11 10:46:55 1999 From: gstein@lyra.org (Greg Stein) Date: Thu, 11 Nov 1999 02:46:55 -0800 (PST) Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <38293527.3CF5C7B0@lemburg.com> Message-ID: <Pine.LNX.4.10.9911110236000.18059-100000@nebula.lyra.org> On Wed, 10 Nov 1999, M.-A. Lemburg wrote: >... > Well almost... it depends on the current value of <default encoding>. Default encodings are kind of nasty when they can be altered. The same problem occurred with import hooks. Only one can be present at a time. This implies that modules, packages, subsystems, whatever, cannot set a default encoding because something else might depend on it having a different value. In the end, nobody uses the default encoding because it is unreliable, so you end up with extra implementation/semantics that aren't used/needed. Have you ever noticed how Python modules, packages, tools, etc, never define an import hook? I'll bet nobody ever monkeys with the default encoding either... I say axe it and say "UTF-8" is the fixed, default encoding. If you want something else, then do that explicitly. >... > Another problem is that Unicode types differ between platforms > (MS VCLIB uses 16-bit wchar_t, while GLIBC2 uses 32-bit > wchar_t). Depending on the internal format of Unicode objects > this could mean calling different conversion APIs. Exactly the reason to avoid wchar_t. > BTW, I'm still not too sure about the underlying internal format. > The problem here is that Unicode started out as 2-byte fixed length > representation (UCS2) but then shifted towards a 4-byte fixed length > reprensetation known as UCS4. Since having 4 bytes per character > is hard sell to customers, UTF16 was created to stuff the UCS4 > code points (this is how character entities are called in Unicode) > into 2 bytes... with a variable length encoding. History is basically irrelevant. What is the situation today? What is in use, and what are people planning for right now? >... > The downside of using UTF16: it is a variable length format, > so iterations over it will be slower than for UCS4. Bzzt. May as well go with UTF-8 as the internal format, much like Perl is doing (as I recall). Why go with a variable length format, when people seem to be doing fine with UCS-2? Like I said in the other mail note: two large platforms out there are UCS-2 based. They seem to be doing quite well with that approach. If people truly need UCS-4, then they can work with that on their own. One of the major reasons for putting Unicode into Python is to increase/simplify its ability to speak to the underlying platform. Hey! Guess what? That generally means UCS2. If we didn't need to speak to the OS with these Unicode values, then people can work with the values entirely in Python, PyUnicodeType-be-damned. Are we digging a hole for ourselves? Maybe. But there are two other big platforms that have the same hole to dig out of *IF* it ever comes to that. I posit that it won't be necessary; that the people needing UCS-4 can do so entirely in Python. Maybe we can allow the encoder to do UCS-4 to UTF-8 encoding and vice-versa. But: it only does it from String to String -- you can't use Unicode objects anywhere in there. > Simply sticking to UCS2 is probably out of the question, > since Unicode 3.0 requires UCS4 and we are targetting > Unicode 3.0. Oh? Who says? Cheers, -g -- Greg Stein, http://www.lyra.org/ From fredrik@pythonware.com Wed Nov 10 10:52:28 1999 From: fredrik@pythonware.com (Fredrik Lundh) Date: Wed, 10 Nov 1999 11:52:28 +0100 Subject: [Python-Dev] I18N Toolkit References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com><199911091646.LAA21467@eric.cnri.reston.va.us><14376.22527.323888.677816@amarok.cnri.reston.va.us><199911091726.MAA21754@eric.cnri.reston.va.us><14376.23671.250752.637144@amarok.cnri.reston.va.us><14376.25768.368164.88151@anthem.cnri.reston.va.us> <14376.30652.201552.116828@amarok.cnri.reston.va.us> Message-ID: <029c01bf2b69$af0da250$f29b12c2@secret.pythonware.com> (a copy was sent to comp.lang.python by mistake; sorry for that). Andrew M. Kuchling <akuchlin@mems-exchange.org> wrote: > I don't think that will be a problem, given that the Unicode engine > would be a separate C implementation. A bit of 'if type(strg) == > UnicodeType' in re.py isn't going to cost very much speed. a slightly hairer design issue is what combinations of pattern and string the new 're' will handle. the first two are obvious: ordinary pattern, ordinary string unicode pattern, unicode string but what about these? ordinary pattern, unicode string unicode pattern, ordinary string "coercing" patterns (i.e. recompiling, on demand) seem to be a somewhat risky business ;-) </F> From gstein@lyra.org Thu Nov 11 10:50:56 1999 From: gstein@lyra.org (Greg Stein) Date: Thu, 11 Nov 1999 02:50:56 -0800 (PST) Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <38293F8D.F60AE605@lemburg.com> Message-ID: <Pine.LNX.4.10.9911110248270.18059-100000@nebula.lyra.org> On Wed, 10 Nov 1999, M.-A. Lemburg wrote: > Tim Peters wrote: > > BTW, is ord(unicode_char) defined? And as what? And does ord have an > > inverse in the Unicode world? Both seem essential. > > Good points. > > How about > > uniord(u[:1]) --> Unicode ordinal number (32-bit) > > unichr(i) --> Unicode object for character i (provided it is 32-bit); > ValueError otherwise Why new functions? Why not extend the definition of ord() and chr()? In terms of backwards compatibility, the only issue could possibly be that people relied on chr(x) to throw an error when x>=256. They certainly couldn't pass a Unicode object to ord(), so that function can safely be extended to accept a Unicode object and return a larger integer. Cheers, -g -- Greg Stein, http://www.lyra.org/ From jcw@equi4.com Wed Nov 10 11:14:17 1999 From: jcw@equi4.com (Jean-Claude Wippler) Date: Wed, 10 Nov 1999 12:14:17 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <Pine.LNX.4.10.9911110236000.18059-100000@nebula.lyra.org> Message-ID: <38295389.397DDE5E@equi4.com> Greg Stein wrote: [MAL:] > > The downside of using UTF16: it is a variable length format, > > so iterations over it will be slower than for UCS4. > > Bzzt. May as well go with UTF-8 as the internal format, much like Perl > is doing (as I recall). Ehm, pardon me for asking - what is the brief rationale for selecting UCS2/4, or whetever it ends up being, over UTF8? I couldn't find a discussion in the last months of the string SIG, was this decided upon and frozen long ago? I'm not trying to re-open a can of worms, just to understand. -- Jean-Claude From gstein@lyra.org Thu Nov 11 11:17:56 1999 From: gstein@lyra.org (Greg Stein) Date: Thu, 11 Nov 1999 03:17:56 -0800 (PST) Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <38295389.397DDE5E@equi4.com> Message-ID: <Pine.LNX.4.10.9911110315330.18059-100000@nebula.lyra.org> On Wed, 10 Nov 1999, Jean-Claude Wippler wrote: > Greg Stein wrote: > > Bzzt. May as well go with UTF-8 as the internal format, much like Perl > > is doing (as I recall). > > Ehm, pardon me for asking - what is the brief rationale for selecting > UCS2/4, or whetever it ends up being, over UTF8? > > I couldn't find a discussion in the last months of the string SIG, was > this decided upon and frozen long ago? Try sometime last year :-) ... something like July thru September as I recall. Things will be a lot faster if we have a fixed-size character. Variable length formats like UTF-8 are a lot harder to slice, search, etc. Also, (IMO) a big reason for this new type is for interaction with the underlying OS/platform. I don't know of any platforms right now that really use UTF-8 as their Unicode string representation (meaning we'd have to convert back/forth from our UTF-8 representation to talk to the OS). Cheers, -g -- Greg Stein, http://www.lyra.org/ From mal@lemburg.com Wed Nov 10 09:55:42 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 10 Nov 1999 10:55:42 +0100 Subject: cached encoding (Re: [Python-Dev] Internationalization Toolkit) References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> <199911091646.LAA21467@eric.cnri.reston.va.us> <010b01bf2b54$fb107430$f29b12c2@secret.pythonware.com> Message-ID: <3829411E.FD32F8CC@lemburg.com> Fredrik Lundh wrote: > > Guido van Rossum <guido@CNRI.Reston.VA.US> wrote: > > One specific question: in you discussion of typed strings, I'm not > > sure why you couldn't convert everything to Unicode and be done with > > it. I have a feeling that the answer is somewhere in your case study > > -- maybe you can elaborate? > > Marc-Andre writes: > > Unicode objects should have a pointer to a cached (read-only) char > buffer <defencbuf> holding the object's value using the current > <default encoding>. This is needed for performance and internal > parsing (see below) reasons. The buffer is filled when the first > conversion request to the <default encoding> is issued on the object. > > keeping track of an external encoding is better left > for the application programmers -- I'm pretty sure that > different application builders will want to handle this > in radically different ways, depending on their environ- > ment, underlying user interface toolkit, etc. It's not that hard to implement. All you have to do is check whether the current encoding in <defencbuf> still is the same as the threads view of <default encoding>. The <defencbuf> buffer is needed to implement "s" et al. argument parsing anyways. > besides, this is how Tcl would have done it. Python's > not Tcl, and I think you need *very* good arguments > for moving in that direction. > > </F> > > _______________________________________________ > Python-Dev maillist - Python-Dev@python.org > http://www.python.org/mailman/listinfo/python-dev -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Wed Nov 10 11:42:00 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 10 Nov 1999 12:42:00 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <19991110080926.2400.rocketmail@web602.mail.yahoo.com> Message-ID: <38295A08.D3928401@lemburg.com> Andy Robinson wrote: > > In general, I like this proposal a lot, but I think it > only covers half the story. How we actually build the > encoder/decoder for each encoding is a very big issue. > Thoughts on this below. > > First, a little nit > > u = u'<utf-8 encoded Python string>' > I don't like using funny prime characters - why not an > explicit function like "utf8()" u = unicode('...I am UTF8...','utf-8') will do just that. I've moved to Tim's proposal with the \uXXXX encoding for u'', BTW. > On to the important stuff:> > > unicodec.register(<encname>,<encoder>,<decoder> > > [,<stream_encoder>, <stream_decoder>]) > > > This registers the codecs under the given encoding > > name in the module global dictionary > > unicodec.codecs. Stream codecs are optional: > > the unicodec module will provide appropriate > > wrappers around <encoder> and > > <decoder> if not given. > > I would MUCH prefer a single 'Encoding' class or type > to wrap up these things, rather than up to four > disconnected objects/functions. Essentially it would > be an interface standard and would offer methods to do > the four things above. > > There are several reasons for this. > > ... > > In summary, firm up the concept of an Encoding object > and give it room to grow - that's the key to > real-world usefulness. If people feel the same way > I'll have a go at an interface for that, and try show > how it would have simplified specific problems I have > faced. Ok, you have a point there. Here's a proposal (note that this only defines an interface, not a class structure): Codec Interface Definition: --------------------------- The following base class should be defined in the module unicodec. class Codec: def encode(self,u): """ Return the Unicode object u encoded as Python string. """ ... def decode(self,s): """ Return an equivalent Unicode object for the encoded Python string s. """ ... def dump(self,u,stream,slice=None): """ Writes the Unicode object's contents encoded to the stream. stream must be a file-like object open for writing binary data. If slice is given (as slice object), only the sliced part of the Unicode object is written. """ ... the base class should provide a default implementation of this method using self.encode ... def load(self,stream,length=None): """ Reads an encoded string (up to <length> bytes) from the stream and returns an equivalent Unicode object. stream must be a file-like object open for reading binary data. If length is given, only length bytes are read. Note that this can cause the decoding algorithm to fail due to truncations in the encoding. """ ... the base class should provide a default implementation of this method using self.encode ... Codecs should raise an UnicodeError in case the conversion is not possible. It is not required by the unicodec.register() API to provide a subclass of this base class, only the 4 given methods must be present. This allows writing Codecs as extensions types. XXX Still to be discussed: · support for line breaks (see http://www.unicode.org/unicode/reports/tr13/ ) · support for case conversion: Problems: string lengths can change due to multiple characters being mapped to a single new one, capital letters starting a word can be different than ones occurring in the middle, there are locale dependent deviations from the standard mappings. · support for numbers, digits, whitespace, etc. · support (or no support) for private code point areas > We also need to think about where encoding info will > live. You cannot avoid mapping tables, although you > can hide them inside code modules or pickled objects > if you want. Should there be a standard > "..\Python\Enc" directory? Mapping tables should be incorporated into the codec modules preferably as static C data. That way multiple processes can share the same data. > And we're going to need some kind of testing and > certification procedure when adding new encodings. > This stuff has to be right. I will have to rely on your cooperation for the test data. Roundtrip testing is easy to implement, but I will also have to verify the output against prechecked data which is probably only creatable using visual tools to which I don't have access (e.g. a Japanese Windows installation). > Guido asked about TypedString. This can probably be > done on top of the built-in stuff - it is just a > convenience which would clarify intent, reduce lines > of code and prevent people shooting themselves in the > foot when juggling a lot of strings in different > (non-Unicode) encodings. I can do a Python module to > implement that on top of whatever is built. Ok. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Wed Nov 10 10:03:36 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 10 Nov 1999 11:03:36 +0100 Subject: Internal Format (Re: [Python-Dev] Internationalization Toolkit) References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> <199911091646.LAA21467@eric.cnri.reston.va.us> <00f401bf2b53$3c234e90$f29b12c2@secret.pythonware.com> Message-ID: <382942F8.1921158E@lemburg.com> Fredrik Lundh wrote: > > Guido van Rossum <guido@CNRI.Reston.VA.US> wrote: > > http://starship.skyport.net/~lemburg/unicode-proposal.txt > > Marc-Andre writes: > > The internal format for Unicode objects should either use a Python > specific fixed cross-platform format <PythonUnicode> (e.g. 2-byte > little endian byte order) or a compiler provided wchar_t format (if > available). Using the wchar_t format will ease embedding of Python in > other Unicode aware applications, but will also make internal format > dumps platform dependent. > > having been there and done that, I strongly suggest > a third option: a 16-bit unsigned integer, in platform > specific byte order (PY_UNICODE_T). along all other > roads lie code bloat and speed penalties... > > (besides, this is exactly how it's already done in > unicode.c and what 'sre' prefers...) Ok, byte order can cause a speed penalty, so it might be worthwhile introducing sys.bom (or sys.endianness) for this reason and sticking to 16-bit integers as you have already done in unicode.h. What I don't like is using wchar_t if available (and then addressing it as if it were defined as unsigned integer). IMO, it's better to define a Python Unicode representation which then gets converted to whatever wchar_t represents on the target machine. Another issue is whether to use UCS2 (as you have done) or UTF16 (which is what Unicode 3.0 requires)... see my other post for a discussion. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From fredrik@pythonware.com Wed Nov 10 12:32:16 1999 From: fredrik@pythonware.com (Fredrik Lundh) Date: Wed, 10 Nov 1999 13:32:16 +0100 Subject: Internal Format (Re: [Python-Dev] Internationalization Toolkit) References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> <199911091646.LAA21467@eric.cnri.reston.va.us> <00f401bf2b53$3c234e90$f29b12c2@secret.pythonware.com> <382942F8.1921158E@lemburg.com> Message-ID: <038501bf2b77$a06046f0$f29b12c2@secret.pythonware.com> > What I don't like is using wchar_t if available (and then addressing > it as if it were defined as unsigned integer). IMO, it's better > to define a Python Unicode representation which then gets converted > to whatever wchar_t represents on the target machine. you should read the unicode.h file a bit more carefully: ... /* Unicode declarations. Tweak these to match your platform */ /* set this flag if the platform has "wchar.h", "wctype.h" and the wchar_t type is a 16-bit unsigned type */ #define HAVE_USABLE_WCHAR_H #if defined(WIN32) || defined(HAVE_USABLE_WCHAR_H) (this uses wchar_t, and also iswspace and friends) ... #else /* Use if you have a standard ANSI compiler, without wchar_t support. If a short is not 16 bits on your platform, you have to fix the typedef below, or the module initialization code will complain. */ (this maps iswspace to isspace, for 8-bit characters). #endif ... the plan was to use the second solution (using "configure" to figure out what integer type to use), and its own uni- code database table for the is/to primitives (iirc, the unicode.txt file discussed this, but that one seems to be missing from the zip archive). </F> From fredrik@pythonware.com Wed Nov 10 12:39:56 1999 From: fredrik@pythonware.com (Fredrik Lundh) Date: Wed, 10 Nov 1999 13:39:56 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <Pine.LNX.4.10.9911110236000.18059-100000@nebula.lyra.org> Message-ID: <039c01bf2b78$b234d520$f29b12c2@secret.pythonware.com> Greg Stein <gstein@lyra.org> wrote: > Have you ever noticed how Python modules, packages, tools, etc, never > define an import hook? hey, didn't MAL use one in one of his mx kits? ;-) > I say axe it and say "UTF-8" is the fixed, default encoding. If you want > something else, then do that explicitly. exactly. modes are evil. python is not perl. etc. > Are we digging a hole for ourselves? Maybe. But there are two other big > platforms that have the same hole to dig out of *IF* it ever comes to > that. I posit that it won't be necessary; that the people needing UCS-4 > can do so entirely in Python. last time I checked, there were no characters (even in the ISO standard) outside the 16-bit range. has that changed? </F> From mal@lemburg.com Wed Nov 10 12:44:39 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 10 Nov 1999 13:44:39 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <Pine.LNX.4.10.9911110248270.18059-100000@nebula.lyra.org> Message-ID: <382968B7.ABFFD4C0@lemburg.com> Greg Stein wrote: > > On Wed, 10 Nov 1999, M.-A. Lemburg wrote: > > Tim Peters wrote: > > > BTW, is ord(unicode_char) defined? And as what? And does ord have an > > > inverse in the Unicode world? Both seem essential. > > > > Good points. > > > > How about > > > > uniord(u[:1]) --> Unicode ordinal number (32-bit) > > > > unichr(i) --> Unicode object for character i (provided it is 32-bit); > > ValueError otherwise > > Why new functions? Why not extend the definition of ord() and chr()? > > In terms of backwards compatibility, the only issue could possibly be that > people relied on chr(x) to throw an error when x>=256. They certainly > couldn't pass a Unicode object to ord(), so that function can safely be > extended to accept a Unicode object and return a larger integer. Because unichr() will always have to return Unicode objects. You don't want chr(i) to return Unicode for i>255 and strings for i<256. OTOH, ord() could probably be extended to also work on Unicode objects. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Wed Nov 10 13:08:30 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 10 Nov 1999 14:08:30 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <Pine.LNX.4.10.9911110236000.18059-100000@nebula.lyra.org> Message-ID: <38296E4E.914C0ED7@lemburg.com> Greg Stein wrote: > > On Wed, 10 Nov 1999, M.-A. Lemburg wrote: > >... > > Well almost... it depends on the current value of <default encoding>. > > Default encodings are kind of nasty when they can be altered. The same > problem occurred with import hooks. Only one can be present at a time. > This implies that modules, packages, subsystems, whatever, cannot set a > default encoding because something else might depend on it having a > different value. In the end, nobody uses the default encoding because it > is unreliable, so you end up with extra implementation/semantics that > aren't used/needed. I know, but this is a little different: you use strings a lot while import hooks are rarely used directly by the user. E.g. people in Europe will probably prefer Latin-1 as default encoding while people in Asia will use one of the common CJK encodings. The <default encoding> decides what encoding to use for many typical tasks: printing, str(u), "s" argument parsing, etc. Note that setting the <default encoding> is not intended to be done prior to single operations. It is meant to be settable at thread creation time. > [...] > > > BTW, I'm still not too sure about the underlying internal format. > > The problem here is that Unicode started out as 2-byte fixed length > > representation (UCS2) but then shifted towards a 4-byte fixed length > > reprensetation known as UCS4. Since having 4 bytes per character > > is hard sell to customers, UTF16 was created to stuff the UCS4 > > code points (this is how character entities are called in Unicode) > > into 2 bytes... with a variable length encoding. > > History is basically irrelevant. What is the situation today? What is in > use, and what are people planning for right now? > > >... > > The downside of using UTF16: it is a variable length format, > > so iterations over it will be slower than for UCS4. > > Bzzt. May as well go with UTF-8 as the internal format, much like Perl is > doing (as I recall). > > Why go with a variable length format, when people seem to be doing fine > with UCS-2? The reason for UTF-16 is simply that it is identical to UCS-2 over large ranges which makes optimizations (e.g. the UCS2 flag I mentioned in an earlier post) feasable and effective. UTF-8 slows things down for CJK encodings, since the APIs will very often have to scan the string to find the correct logical position in the data. Here's a quote from the Unicode FAQ (http://www.unicode.org/unicode/faq/ ): """ Q: How about using UCS-4 interfaces in my APIs? Given an internal UTF-16 storage, you can, of course, still index into text using UCS-4 indices. However, while converting from a UCS-4 index to a UTF-16 index or vice versa is fairly straightforward, it does involve a scan through the 16-bit units up to the index point. In a test run, for example, accessing UTF-16 storage as UCS-4 characters results in a 10X degradation. Of course, the precise differences will depend on the compiler, and there are some interesting optimizations that can be performed, but it will always be slower on average. This kind of performance hit is unacceptable in many environments. Most Unicode APIs are using UTF-16. The low-level character indexing are at the common storage level, with higher-level mechanisms for graphemes or words specifying their boundaries in terms of the storage units. This provides efficiency at the low levels, and the required functionality at the high levels. Convenience APIs can be produced that take parameters in UCS-4 methods for common utilities: e.g. converting UCS-4 indices back and forth, accessing character properties, etc. Outside of indexing, differences between UCS-4 and UTF-16 are not as important. For most other APIs outside of indexing, characters values cannot really be considered outside of their context--not when you are writing internationalized code. For such operations as display, input, collation, editing, and even upper and lowercasing, characters need to be considered in the context of a string. That means that in any event you end up looking at more than one character. In our experience, the incremental cost of doing surrogates is pretty small. """ > Like I said in the other mail note: two large platforms out there are > UCS-2 based. They seem to be doing quite well with that approach. > > If people truly need UCS-4, then they can work with that on their own. One > of the major reasons for putting Unicode into Python is to > increase/simplify its ability to speak to the underlying platform. Hey! > Guess what? That generally means UCS2. All those formats are upward compatible (within certain ranges) and the Python Unicode API will provide converters between its internal format and the few common Unicode implementations, e.g. for MS compilers (16-bit UCS2 AFAIK), GLIBC (32-bit UCS4). > If we didn't need to speak to the OS with these Unicode values, then > people can work with the values entirely in Python, > PyUnicodeType-be-damned. > > Are we digging a hole for ourselves? Maybe. But there are two other big > platforms that have the same hole to dig out of *IF* it ever comes to > that. I posit that it won't be necessary; that the people needing UCS-4 > can do so entirely in Python. > > Maybe we can allow the encoder to do UCS-4 to UTF-8 encoding and > vice-versa. But: it only does it from String to String -- you can't use > Unicode objects anywhere in there. See above. > > Simply sticking to UCS2 is probably out of the question, > > since Unicode 3.0 requires UCS4 and we are targetting > > Unicode 3.0. > > Oh? Who says? >From the FAQ: """ Q: What is UTF-16? Unicode was originally designed as a pure 16-bit encoding, aimed at representing all modern scripts. (Ancient scripts were to be represented with private-use characters.) Over time, and especially after the addition of over 14,500 composite characters for compatibility with legacy sets, it became clear that 16-bits were not sufficient for the user community. Out of this arose UTF-16. """ Note that there currently are no defined surrogate pairs for UTF-16, meaning that in practice the difference between UCS-2 and UTF-16 is probably negligable, e.g. we could define the internal format to be UTF-16 and raise exception whenever the border between UTF-16 and UCS-2 is crossed -- sort of as political compromise ;-). But... I think HP has the last word on this one. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Wed Nov 10 12:36:44 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 10 Nov 1999 13:36:44 +0100 Subject: Internal Format (Re: [Python-Dev] Internationalization Toolkit) References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> <199911091646.LAA21467@eric.cnri.reston.va.us> <00f401bf2b53$3c234e90$f29b12c2@secret.pythonware.com> <382942F8.1921158E@lemburg.com> <038501bf2b77$a06046f0$f29b12c2@secret.pythonware.com> Message-ID: <382966DC.F33E340E@lemburg.com> Fredrik Lundh wrote: > > > What I don't like is using wchar_t if available (and then addressing > > it as if it were defined as unsigned integer). IMO, it's better > > to define a Python Unicode representation which then gets converted > > to whatever wchar_t represents on the target machine. > > you should read the unicode.h file a bit more carefully: > > ... > > /* Unicode declarations. Tweak these to match your platform */ > > /* set this flag if the platform has "wchar.h", "wctype.h" and the > wchar_t type is a 16-bit unsigned type */ > #define HAVE_USABLE_WCHAR_H > > #if defined(WIN32) || defined(HAVE_USABLE_WCHAR_H) > > (this uses wchar_t, and also iswspace and friends) > > ... > > #else > > /* Use if you have a standard ANSI compiler, without wchar_t support. > If a short is not 16 bits on your platform, you have to fix the > typedef below, or the module initialization code will complain. */ > > (this maps iswspace to isspace, for 8-bit characters). > > #endif > > ... > > the plan was to use the second solution (using "configure" > to figure out what integer type to use), and its own uni- > code database table for the is/to primitives Oh, I did read unicode.h, stumbled across the mixed usage and decided not to like it ;-) Seriously, I find the second solution where you use the 'unsigned short' much more portable and straight forward. You never know what the compiler does for isw*() and it's probably better sticking to one format for all platforms. Only endianness gets in the way, but that's easy to handle. So I opt for 'unsigned short'. The encoding used in these 2 bytes is a different question though. If HP insists on Unicode 3.0, there's probably no other way than to use UTF-16. > (iirc, the unicode.txt file discussed this, but that one > seems to be missing from the zip archive). It's not in the file I downloaded from your site. Could you post it here ? -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Wed Nov 10 13:13:10 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 10 Nov 1999 14:13:10 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <Pine.LNX.4.10.9911110236000.18059-100000@nebula.lyra.org> <38295389.397DDE5E@equi4.com> Message-ID: <38296F66.5DF9263E@lemburg.com> Jean-Claude Wippler wrote: > > Greg Stein wrote: > [MAL:] > > > The downside of using UTF16: it is a variable length format, > > > so iterations over it will be slower than for UCS4. > > > > Bzzt. May as well go with UTF-8 as the internal format, much like Perl > > is doing (as I recall). > > Ehm, pardon me for asking - what is the brief rationale for selecting > UCS2/4, or whetever it ends up being, over UTF8? UCS-2 is the native format on major platforms (meaning straight fixed length encoding using 2 bytes), ie. interfacing between Python's Unicode object and the platform APIs will be simple and fast. UTF-8 is short for ASCII users, but imposes a performance hit for the CJK (Asian character sets) world, since UTF8 uses *variable* length encodings. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From akuchlin@mems-exchange.org Wed Nov 10 14:56:16 1999 From: akuchlin@mems-exchange.org (Andrew M. Kuchling) Date: Wed, 10 Nov 1999 09:56:16 -0500 (EST) Subject: [Python-Dev] Re: regexp performance In-Reply-To: <027c01bf2b69$59e60330$f29b12c2@secret.pythonware.com> References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> <199911091646.LAA21467@eric.cnri.reston.va.us> <14376.22527.323888.677816@amarok.cnri.reston.va.us> <199911091726.MAA21754@eric.cnri.reston.va.us> <14376.23671.250752.637144@amarok.cnri.reston.va.us> <14376.25768.368164.88151@anthem.cnri.reston.va.us> <14376.30652.201552.116828@amarok.cnri.reston.va.us> <027c01bf2b69$59e60330$f29b12c2@secret.pythonware.com> Message-ID: <14377.34704.639462.794509@amarok.cnri.reston.va.us> [Cc'ed to the String-SIG; sheesh, what's the point of having SIGs otherwise?] Fredrik Lundh writes: >any special pattern constructs that are in need of per- >formance improvements? (compared to Perl, that is). In the 1.5 source tree, I think one major slowdown is coming from the malloc'ed failure stack. This was introduced in order to prevent an expression like (x)* from filling the stack when applied to a string contained 50,000 'x' characters (hence 50,000 recursive function calls). I'd like to get rid of this stack because it's slow and requires much tedious patching of the upstream PCRE. >or maybe anyone has an extensive performance test >suite for perlish regular expressions? (preferrably based >on how real people use regular expressions, not only on >things that are known to be slow if not optimized) Friedl's book describes several optimizations which aren't implemented in PCRE. The problem is that PCRE never builds a parse tree, and parse trees are easy to analyse recursively. Instead, PCRE's functions actually look at the compiled byte codes (for example, look at find_firstchar or is_anchored in pypcre.c), but this makes analysis functions hard to write, and rearranging the code near-impossible. -- A.M. Kuchling http://starship.python.net/crew/amk/ I didn't say it was my fault. I said it was my responsibility. I know the difference. -- Rose Walker, in SANDMAN #60: "The Kindly Ones:4" From jack@oratrix.nl Wed Nov 10 15:04:58 1999 From: jack@oratrix.nl (Jack Jansen) Date: Wed, 10 Nov 1999 16:04:58 +0100 Subject: [Python-Dev] I18N Toolkit In-Reply-To: Message by "Fredrik Lundh" <fredrik@pythonware.com> , Wed, 10 Nov 1999 11:52:28 +0100 , <029c01bf2b69$af0da250$f29b12c2@secret.pythonware.com> Message-ID: <19991110150458.B542735BB1E@snelboot.oratrix.nl> > a slightly hairer design issue is what combinations > of pattern and string the new 're' will handle. > > the first two are obvious: > > ordinary pattern, ordinary string > unicode pattern, unicode string > > but what about these? > > ordinary pattern, unicode string > unicode pattern, ordinary string I think the logical thing to do would be to "promote" the ordinary pattern or string to unicode, in a similar way to what happens if you combine ints and floats in a single expression. The result may be a bit surprising if your pattern is in ascii and you've never been aware of unicode and are given such a string from somewhere else, but then if you're only aware of integer arithmetic and are suddenly presented with a couple of floats you'll also be pretty surprised at the result. At least it's easily explained. -- Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++ Jack.Jansen@oratrix.com | ++++ if you agree copy these lines to your sig ++++ www.oratrix.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm From fdrake@acm.org Wed Nov 10 15:22:17 1999 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Wed, 10 Nov 1999 10:22:17 -0500 (EST) Subject: Internal Format (Re: [Python-Dev] Internationalization Toolkit) In-Reply-To: <00f401bf2b53$3c234e90$f29b12c2@secret.pythonware.com> References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> <199911091646.LAA21467@eric.cnri.reston.va.us> <00f401bf2b53$3c234e90$f29b12c2@secret.pythonware.com> Message-ID: <14377.36265.315127.788319@weyr.cnri.reston.va.us> Fredrik Lundh writes: > having been there and done that, I strongly suggest > a third option: a 16-bit unsigned integer, in platform > specific byte order (PY_UNICODE_T). along all other I actually like this best, but I understand that there are reasons for using wchar_t, especially for interfacing with other code that uses Unicode. Perhaps someone who knows more about the specific issues with interfacing using wchar_t can summarize them, or point me to whatever I've already missed. p-) -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives From skip@mojam.com (Skip Montanaro) Wed Nov 10 15:54:30 1999 From: skip@mojam.com (Skip Montanaro) (Skip Montanaro) Date: Wed, 10 Nov 1999 09:54:30 -0600 (CST) Subject: cached encoding (Re: [Python-Dev] Internationalization Toolkit) In-Reply-To: <010b01bf2b54$fb107430$f29b12c2@secret.pythonware.com> References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> <199911091646.LAA21467@eric.cnri.reston.va.us> <010b01bf2b54$fb107430$f29b12c2@secret.pythonware.com> Message-ID: <14377.38198.793496.870273@dolphin.mojam.com> Just a couple observations from the peanut gallery... 1. I'm glad I don't have to do this Unicode/UTF/internationalization stuff. Seems like it would be easier to just get the whole world speaking Esperanto. 2. Are there plans for an internationalization session at IPC8? Perhaps a few key players could be locked into a room for a couple days, to emerge bloodied, but with an implementation in-hand... Skip Montanaro | http://www.mojam.com/ skip@mojam.com | http://www.musi-cal.com/ 847-971-7098 | Python: Programming the way Guido indented... From fdrake@acm.org Wed Nov 10 15:58:30 1999 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Wed, 10 Nov 1999 10:58:30 -0500 (EST) Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <38295A08.D3928401@lemburg.com> References: <19991110080926.2400.rocketmail@web602.mail.yahoo.com> <38295A08.D3928401@lemburg.com> Message-ID: <14377.38438.615701.231437@weyr.cnri.reston.va.us> M.-A. Lemburg writes: > def encode(self,u): > > """ Return the Unicode object u encoded as Python string. This should accept an optional slice parameter, and use it in the same way as .dump(). > def dump(self,u,stream,slice=None): ... > def load(self,stream,length=None): Why not have something like .wrapFile(f) that returns a file-like object with all the file methods implemented, and doing to "right thing" regarding encoding/decoding? That way, the new file-like object can be used directly with code that works with files and doesn't care whether it uses 8-bit or unicode strings. > Codecs should raise an UnicodeError in case the conversion is > not possible. I think that should be ValueError, or UnicodeError should be a subclass of ValueError. (Can the -X interpreter option be removed yet?) -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives From bwarsaw@cnri.reston.va.us (Barry A. Warsaw) Wed Nov 10 16:41:29 1999 From: bwarsaw@cnri.reston.va.us (Barry A. Warsaw) (Barry A. Warsaw) Date: Wed, 10 Nov 1999 11:41:29 -0500 (EST) Subject: cached encoding (Re: [Python-Dev] Internationalization Toolkit) References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> <199911091646.LAA21467@eric.cnri.reston.va.us> <010b01bf2b54$fb107430$f29b12c2@secret.pythonware.com> <14377.38198.793496.870273@dolphin.mojam.com> Message-ID: <14377.41017.413515.887236@anthem.cnri.reston.va.us> >>>>> "SM" == Skip Montanaro <skip@mojam.com> writes: SM> 2. Are there plans for an internationalization session at SM> IPC8? Perhaps a few key players could be locked into a room SM> for a couple days, to emerge bloodied, but with an SM> implementation in-hand... I'm starting to think about devday topics. Sounds like an I18n session would be very useful. Champions? -Barry From mal@lemburg.com Wed Nov 10 13:31:47 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 10 Nov 1999 14:31:47 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <Pine.LNX.4.10.9911110236000.18059-100000@nebula.lyra.org> <039c01bf2b78$b234d520$f29b12c2@secret.pythonware.com> Message-ID: <382973C3.DCA77051@lemburg.com> Fredrik Lundh wrote: > > Greg Stein <gstein@lyra.org> wrote: > > Have you ever noticed how Python modules, packages, tools, etc, never > > define an import hook? > > hey, didn't MAL use one in one of his mx kits? ;-) Not yet, but I will unless my last patch ("walk me up, Scotty" - import) goes into the core interpreter. > > I say axe it and say "UTF-8" is the fixed, default encoding. If you want > > something else, then do that explicitly. > > exactly. > > modes are evil. python is not perl. etc. But a requirement by the customer... they want to be able to set the locale on a per thread basis. Not exactly my preference (I think all locale settings should be passed as parameters, not via globals). > > Are we digging a hole for ourselves? Maybe. But there are two other big > > platforms that have the same hole to dig out of *IF* it ever comes to > > that. I posit that it won't be necessary; that the people needing UCS-4 > > can do so entirely in Python. > > last time I checked, there were no characters (even in the > ISO standard) outside the 16-bit range. has that changed? No, but people are already thinking about it and there is a defined range in the >16-bit area for private encodings (F0000..FFFFD and 100000..10FFFD). -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mhammond@skippinet.com.au Wed Nov 10 21:36:04 1999 From: mhammond@skippinet.com.au (Mark Hammond) Date: Thu, 11 Nov 1999 08:36:04 +1100 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <382973C3.DCA77051@lemburg.com> Message-ID: <005701bf2bc3$980f4d60$0501a8c0@bobcat> Marc writes: > > modes are evil. python is not perl. etc. > > But a requirement by the customer... they want to be able to > set the locale > on a per thread basis. Not exactly my preference (I think all locale > settings should be passed as parameters, not via globals). Sure - that is what this customer wants, but we need to be clear about the "best thing" for Python generally versus what this particular client wants. For example, if we went with UTF-8 as the only default encoding, then HP may be forced to use a helper function to perform the conversion, rather than the built-in functions. This helper function can use TLS (in Python) to store the encoding. At least it is localized. I agree that having a default encoding that can be changed is a bad idea. It may make 3 line scripts that need to print something easier to work with, but at the cost of reliability in large systems. Kinda like the existing "locale" support, which is thread specific, and is well known to cause these sorts of problems. The end result is that in your app, you find _someone_ has changed the default encoding, and some code no longer works. So the solution is to change the default encoding back, so _your_ code works again. You just know that whoever it was that changed the default encoding in the first place is now going to break - but what else can you do? Having a fixed, default encoding may make life slightly more difficult when you want to work primarily in a different encoding, but at least your system is predictable and reliable. Mark. > > > > Are we digging a hole for ourselves? Maybe. But there are > two other big > > > platforms that have the same hole to dig out of *IF* it > ever comes to > > > that. I posit that it won't be necessary; that the people > needing UCS-4 > > > can do so entirely in Python. > > > > last time I checked, there were no characters (even in the > > ISO standard) outside the 16-bit range. has that changed? > > No, but people are already thinking about it and there is > a defined range in the >16-bit area for private encodings > (F0000..FFFFD and 100000..10FFFD). > > -- > Marc-Andre Lemburg > ______________________________________________________________________ > Y2000: 51 days left > Business: http://www.lemburg.com/ > Python Pages: http://www.lemburg.com/python/ > > > _______________________________________________ > Python-Dev maillist - Python-Dev@python.org > http://www.python.org/mailman/listinfo/python-dev > From gstein@lyra.org Thu Nov 11 23:14:55 1999 From: gstein@lyra.org (Greg Stein) Date: Thu, 11 Nov 1999 15:14:55 -0800 (PST) Subject: [Python-Dev] default encodings (was: Internationalization Toolkit) In-Reply-To: <005701bf2bc3$980f4d60$0501a8c0@bobcat> Message-ID: <Pine.LNX.4.10.9911111502360.18059-100000@nebula.lyra.org> On Thu, 11 Nov 1999, Mark Hammond wrote: > Marc writes: > > > modes are evil. python is not perl. etc. > > > > But a requirement by the customer... they want to be able to > > set the locale > > on a per thread basis. Not exactly my preference (I think all locale > > settings should be passed as parameters, not via globals). > > Sure - that is what this customer wants, but we need to be clear about > the "best thing" for Python generally versus what this particular > client wants. Ha! I was getting ready to say exactly the same thing. Are building Python for a particular customer, or are we building it to Do The Right Thing? I've been getting increasingly annoyed at "well, HP says this" or "HP wants that." I'm ecstatic that they are a Consortium member and are helping to fund the development of Python. However, if that means we are selling Python's soul to corporate wishes rather than programming and design ideals... well, it reduces my enthusiasm :-) >... > I agree that having a default encoding that can be changed is a bad > idea. It may make 3 line scripts that need to print something easier > to work with, but at the cost of reliability in large systems. Kinda > like the existing "locale" support, which is thread specific, and is > well known to cause these sorts of problems. The end result is that > in your app, you find _someone_ has changed the default encoding, and > some code no longer works. So the solution is to change the default > encoding back, so _your_ code works again. You just know that whoever > it was that changed the default encoding in the first place is now > going to break - but what else can you do? Yes! Yes! Example #2. My first example (import hooks) was shrugged off by some as "well, nobody uses those." Okay, maybe people don't use them (but I believe that is *because* of this kind of problem). In Mark's example, however... this is a definite problem. I ran into this when I was building some code for Microsoft Site Server. IIS was setting a different locale on my thread -- one that I definitely was not expecting. All of a sudden, strlwr() no longer worked as I expected -- certain characters didn't get lower-cased, so my dictionary lookups failed because the keys were not all lower-cased. Solution? Before passing control from C++ into Python, I set the locale to the default locale. Restored it on the way back out. Extreme measures, and costly to do, but it had to be done. I think I'll pick up Fredrik's phrase here... (chanting) "Modes Are Evil!" "Modes Are Evil!" "Down with Modes!" :-) > Having a fixed, default encoding may make life slightly more difficult > when you want to work primarily in a different encoding, but at least > your system is predictable and reliable. *bing* I'm with Mark on this one. Global modes and state are a serious pain when it comes to developing a system. Python is very amenable to utility functions and classes. Any "customer" can use a utility function to manually do the encoding according to a per-thread setting stashed in some module-global dictionary (map thread-id to default-encoding). Done. Keep it out of the interpreter... Cheers, -g -- Greg Stein, http://www.lyra.org/ From da@ski.org Wed Nov 10 23:21:54 1999 From: da@ski.org (David Ascher) Date: Wed, 10 Nov 1999 15:21:54 -0800 (Pacific Standard Time) Subject: [Python-Dev] default encodings (was: Internationalization Toolkit) In-Reply-To: <Pine.LNX.4.10.9911111502360.18059-100000@nebula.lyra.org> Message-ID: <Pine.WNT.4.04.9911101519110.244-100000@rigoletto.ski.org> On Thu, 11 Nov 1999, Greg Stein wrote: > Ha! I was getting ready to say exactly the same thing. Are building Python > for a particular customer, or are we building it to Do The Right Thing? > > I've been getting increasingly annoyed at "well, HP says this" or "HP > wants that." I'm ecstatic that they are a Consortium member and are > helping to fund the development of Python. However, if that means we are > selling Python's soul to corporate wishes rather than programming and > design ideals... well, it reduces my enthusiasm :-) What about just explaining the rationale for the default-less point of view to whoever is in charge of this at HP and see why they came up with their rationale in the first place? They might have a good reason, or they might be willing to change said requirement. --david From gstein@lyra.org Thu Nov 11 23:31:43 1999 From: gstein@lyra.org (Greg Stein) Date: Thu, 11 Nov 1999 15:31:43 -0800 (PST) Subject: [Python-Dev] default encodings (was: Internationalization Toolkit) In-Reply-To: <Pine.WNT.4.04.9911101519110.244-100000@rigoletto.ski.org> Message-ID: <Pine.LNX.4.10.9911111531200.18059-100000@nebula.lyra.org> Damn, you're smooth... maybe you should have run for SF Mayor... :-) On Wed, 10 Nov 1999, David Ascher wrote: > On Thu, 11 Nov 1999, Greg Stein wrote: > > > Ha! I was getting ready to say exactly the same thing. Are building Python > > for a particular customer, or are we building it to Do The Right Thing? > > > > I've been getting increasingly annoyed at "well, HP says this" or "HP > > wants that." I'm ecstatic that they are a Consortium member and are > > helping to fund the development of Python. However, if that means we are > > selling Python's soul to corporate wishes rather than programming and > > design ideals... well, it reduces my enthusiasm :-) > > What about just explaining the rationale for the default-less point of > view to whoever is in charge of this at HP and see why they came up with > their rationale in the first place? They might have a good reason, or > they might be willing to change said requirement. > > --david > -- Greg Stein, http://www.lyra.org/ From tim_one@email.msn.com Thu Nov 11 06:25:27 1999 From: tim_one@email.msn.com (Tim Peters) Date: Thu, 11 Nov 1999 01:25:27 -0500 Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit) In-Reply-To: <00f501bf2b53$9872e610$f29b12c2@secret.pythonware.com> Message-ID: <000201bf2c0d$8b866160$262d153f@tim> [/F, dripping with code] > ... > Note that the 'u' must be followed by four hexadecimal digits. If > fewer digits are given, the sequence is left in the resulting string > exactly as given. Yuck -- don't let probable error pass without comment. "must be" == "must be"! [moving backwards] > \uxxxx -- Unicode character with hexadecimal value xxxx. The > character is stored using UTF-8 encoding, which means that this > sequence can result in up to three encoded characters. The code is fine, but I've gotten confused about what the intent is now. Expanding \uxxxx to its UTF-8 encoding made sense when MAL had UTF-8 literals, but now he's got Unicode-escaped literals instead -- and you favor an internal 2-byte-per-char Unicode storage format. In that combination of worlds, is there any use in the *language* (as opposed to in a runtime module) for \uxxxx -> UTF-8 conversion? And MAL, if you're listening, I'm not clear on what a Unicode-escaped literal means. When you had UTF-8 literals, the meaning of something like u"a\340\341" was clear, since UTF-8 is defined as a byte stream and UTF-8 string literals were just a way of specifying a byte stream. As a Unicode-escaped string, I assume the "a" maps to the Unicode "a", but what of the rest? Are the octal escapes to be taken as two separate Latin-1 characters (in their role as a Unicode subset), or as an especially clumsy way to specify a single 16-bit Unicode character? I'm afraid I'd vote for the former. Same issue wrt \x escapes. One other issue: are there "raw" Unicode strings too, as in ur"\u20ac"? There probably should be; and while Guido will hate this, a ur string should probably *not* leave \uxxxx escapes untouched. Nasties like this are why Java defines \uxxxx expansion as occurring in a preprocessing step. BTW, the meaning of \uxxxx in a non-Unicode string is now also unclear (or isn't \uxxxx allowed in a non-Unicode string? that's what I would do ...). From tim_one@email.msn.com Thu Nov 11 06:49:16 1999 From: tim_one@email.msn.com (Tim Peters) Date: Thu, 11 Nov 1999 01:49:16 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <Pine.LNX.4.10.9911110315330.18059-100000@nebula.lyra.org> Message-ID: <000501bf2c10$df4679e0$262d153f@tim> [ Greg Stein] > ... > Things will be a lot faster if we have a fixed-size character. Variable > length formats like UTF-8 are a lot harder to slice, search, etc. The initial byte of any UTF-8 encoded character never appears in a *non*-initial position of any UTF-8 encoded character. Which means searching is not only tractable in UTF-8, but also that whatever optimized 8-bit clean string searching routines you happen to have sitting around today can be used as-is on UTF-8 encoded strings. This is not true of UCS-2 encoded strings (in which "the first" byte is not distinguished, so 8-bit search is vulnerable to finding a hit starting "in the middle" of a character). More, to the extent that the bulk of your text is plain ASCII, the UTF-8 search will run much faster than when using a 2-byte encoding, simply because it has half as many bytes to chew over. UTF-8 is certainly slower for random-access indexing, including slicing. I don't know what "etc" means, but if it follows the pattern so far, sometimes it's faster and sometimes it's slower <wink>. > (IMO) a big reason for this new type is for interaction with the > underlying OS/platform. I don't know of any platforms right now that > really use UTF-8 as their Unicode string representation (meaning we'd > have to convert back/forth from our UTF-8 representation to talk to the > OS). No argument here. From tim_one@email.msn.com Thu Nov 11 06:56:35 1999 From: tim_one@email.msn.com (Tim Peters) Date: Thu, 11 Nov 1999 01:56:35 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <382968B7.ABFFD4C0@lemburg.com> Message-ID: <000601bf2c11$e4b07920$262d153f@tim> [MAL, on Unicode chr() and ord() > ... > Because unichr() will always have to return Unicode objects. You don't > want chr(i) to return Unicode for i>255 and strings for i<256. Indeed I do not! > OTOH, ord() could probably be extended to also work on Unicode objects. I think should be -- it's a good & natural use of polymorphism; introducing a new function *here* would be as odd as introducing a unilen() function to get the length of a Unicode string. From tim_one@email.msn.com Thu Nov 11 07:03:34 1999 From: tim_one@email.msn.com (Tim Peters) Date: Thu, 11 Nov 1999 02:03:34 -0500 Subject: [Python-Dev] RE: [String-SIG] Re: regexp performance In-Reply-To: <14377.34704.639462.794509@amarok.cnri.reston.va.us> Message-ID: <000701bf2c12$de8bca80$262d153f@tim> [Andrew M. Kuchling] > ... > Friedl's book describes several optimizations which aren't implemented > in PCRE. The problem is that PCRE never builds a parse tree, and > parse trees are easy to analyse recursively. Instead, PCRE's > functions actually look at the compiled byte codes (for example, look > at find_firstchar or is_anchored in pypcre.c), but this makes analysis > functions hard to write, and rearranging the code near-impossible. This is wonderfully & ironically Pythonic. That is, the Python compiler itself goes straight to byte code, and the optimization that's done works at the latter low level. Luckily <wink>, very little optimization is attempted, and what's there only replaces one bytecode with another of the same length. If it tried to do more, it would have to rearrange the code ... the-more-things-differ-the-more-things-don't-ly y'rs - tim From tim_one@email.msn.com Thu Nov 11 07:27:52 1999 From: tim_one@email.msn.com (Tim Peters) Date: Thu, 11 Nov 1999 02:27:52 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <382973C3.DCA77051@lemburg.com> Message-ID: <000801bf2c16$43f9a4c0$262d153f@tim> [/F] > last time I checked, there were no characters (even in the > ISO standard) outside the 16-bit range. has that changed? [MAL] > No, but people are already thinking about it and there is > a defined range in the >16-bit area for private encodings > (F0000..FFFFD and 100000..10FFFD). Over the decades I've developed a rule of thumb that has never wound up stuck in my ass <wink>: If I engineer code that I expect to be in use for N years, I make damn sure that every internal limit is at least 10x larger than the largest I can conceive of a user making reasonable use of at the end of those N years. The invariable result is that the N years pass, and fewer than half of the users have bumped into the limit <0.5 wink>. At the risk of offending everyone, I'll suggest that, qualitatively speaking, Unicode is as Eurocentric as ASCII is Anglocentric. We've just replaced "256 characters?! We'll *never* run out of those!" with 64K. But when Asian languages consume them 7K at a pop, 64K isn't even in my 10x comfort range for some individual languages. In just a few months, Unicode 3 will already have used up > 56K of the 64K slots. As I understand it, UTF-16 "only" adds 1M new code points. That's in my 10x zone, for about a decade. predicting-we'll-live-to-regret-it-either-way-ly y'rs - tim From andy@robanal.demon.co.uk Thu Nov 11 07:29:05 1999 From: andy@robanal.demon.co.uk (=?iso-8859-1?q?Andy=20Robinson?=) Date: Wed, 10 Nov 1999 23:29:05 -0800 (PST) Subject: cached encoding (Re: [Python-Dev] Internationalization Toolkit) Message-ID: <19991111072905.25203.rocketmail@web607.mail.yahoo.com> > 2. Are there plans for an internationalization > session at IPC8? Perhaps a > few key players could be locked into a room for a > couple days, to emerge > bloodied, but with an implementation in-hand... Excellent idea. - Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From tim_one@email.msn.com Thu Nov 11 07:29:50 1999 From: tim_one@email.msn.com (Tim Peters) Date: Thu, 11 Nov 1999 02:29:50 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <005701bf2bc3$980f4d60$0501a8c0@bobcat> Message-ID: <000901bf2c16$8a107420$262d153f@tim> [Mark Hammond] > Sure - that is what this customer wants, but we need to be clear about > the "best thing" for Python generally versus what this particular > client wants. > ... > Having a fixed, default encoding may make life slightly more difficult > when you want to work primarily in a different encoding, but at least > your system is predictable and reliable. Well said, Mark! Me too. It's like HP is suffering from Windows envy <wink>. From andy@robanal.demon.co.uk Thu Nov 11 07:30:53 1999 From: andy@robanal.demon.co.uk (=?iso-8859-1?q?Andy=20Robinson?=) Date: Wed, 10 Nov 1999 23:30:53 -0800 (PST) Subject: cached encoding (Re: [Python-Dev] Internationalization Toolkit) Message-ID: <19991111073053.7884.rocketmail@web602.mail.yahoo.com> --- "Barry A. Warsaw" <bwarsaw@cnri.reston.va.us> wrote: > > I'm starting to think about devday topics. Sounds > like an I18n > session would be very useful. Champions? > I'm willing to explain what the fuss is about to bemused onlookers and give some examples of problems it should be able to solve - plenty of good slides and screen shots. I'll stay well away from the C implementation issues. Regards, Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From andy@robanal.demon.co.uk Thu Nov 11 07:33:25 1999 From: andy@robanal.demon.co.uk (=?iso-8859-1?q?Andy=20Robinson?=) Date: Wed, 10 Nov 1999 23:33:25 -0800 (PST) Subject: [Python-Dev] default encodings (was: Internationalization Toolkit) Message-ID: <19991111073325.8024.rocketmail@web602.mail.yahoo.com> > > What about just explaining the rationale for the > default-less point of > view to whoever is in charge of this at HP and see > why they came up with > their rationale in the first place? They might have > a good reason, or > they might be willing to change said requirement. > > --david For that matter (I came into this a bit late), is there a statement somewhere of what HP actually want to do? - Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From andy@robanal.demon.co.uk Thu Nov 11 07:44:50 1999 From: andy@robanal.demon.co.uk (=?iso-8859-1?q?Andy=20Robinson?=) Date: Wed, 10 Nov 1999 23:44:50 -0800 (PST) Subject: [Python-Dev] Internationalization Toolkit Message-ID: <19991111074450.20451.rocketmail@web606.mail.yahoo.com> > I say axe it and say "UTF-8" is the fixed, default > encoding. If you want > something else, then do that explicitly. > Let me tell you why you would want to have an encoding which can be set: (1) sday I am on a Japanese Windows box, I have a string called 'address' and I do 'print address'. If I see utf8, I see garbage. If I see Shift-JIS, I see the correct Japanese address. At this point in time, utf8 is an interchange format but 99% of the world's data is in various native encodings. Analogous problems occur on input. (2) I'm using htmlgen, which 'prints' objects to standard output. My web site is supposed to be encoded in Shift-JIS (or EUC, or Big 5 for Taiwan, etc.) Yes, browsers CAN detect and display UTF8 but you just don't find UTF8 sites in the real world - and most users just don't know about the encoding menu, and will get pissed off if they have to reach for it. Ditto for streaming output in some protocol. Java solves this (and we could too by hacking stdout) using Writer classes which are created as wrappers around an output stream and can take an encoding, but you lose the flexibility to 'just print'. I think being able to change encoding would be useful. What I do not want is to auto-detect it from the operating system when Python boots - that would be a portability nightmare. Regards, Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From fredrik@pythonware.com Thu Nov 11 08:06:04 1999 From: fredrik@pythonware.com (Fredrik Lundh) Date: Thu, 11 Nov 1999 09:06:04 +0100 Subject: [Python-Dev] RE: [String-SIG] Re: regexp performance References: <000701bf2c12$de8bca80$262d153f@tim> Message-ID: <009201bf2c1b$9a5c1b90$f29b12c2@secret.pythonware.com> Tim Peters <tim_one@email.msn.com> wrote: > > The problem is that PCRE never builds a parse tree, and > > parse trees are easy to analyse recursively. Instead, PCRE's > > functions actually look at the compiled byte codes (for example, look > > at find_firstchar or is_anchored in pypcre.c), but this makes analysis > > functions hard to write, and rearranging the code near-impossible. > > This is wonderfully & ironically Pythonic. That is, the Python compiler > itself goes straight to byte code, and the optimization that's done works at > the latter low level. yeah, but by some reason, people (including GvR) expect a regular expression machinery to be more optimized than the language interpreter ;-) </F> From tim_one@email.msn.com Thu Nov 11 08:01:58 1999 From: tim_one@email.msn.com (Tim Peters) Date: Thu, 11 Nov 1999 03:01:58 -0500 Subject: [Python-Dev] default encodings (was: Internationalization Toolkit) In-Reply-To: <19991111073325.8024.rocketmail@web602.mail.yahoo.com> Message-ID: <000c01bf2c1b$0734c060$262d153f@tim> [Andy Robinson] > For that matter (I came into this a bit late), is > there a statement somewhere of what HP actually want > to do? On this list, the best explanation we got was from Guido: they want "internationalization", and "Perl-compatible Unicode regexps". I'm not sure they even know the two aren't identical <0.9 wink>. code-without-requirements-is-like-sex-without-consequences-ly y'rs - tim From guido@CNRI.Reston.VA.US Thu Nov 11 12:03:51 1999 From: guido@CNRI.Reston.VA.US (Guido van Rossum) Date: Thu, 11 Nov 1999 07:03:51 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: Your message of "Wed, 10 Nov 1999 23:44:50 PST." <19991111074450.20451.rocketmail@web606.mail.yahoo.com> References: <19991111074450.20451.rocketmail@web606.mail.yahoo.com> Message-ID: <199911111203.HAA24221@eric.cnri.reston.va.us> > Let me tell you why you would want to have an encoding > which can be set: > > (1) sday I am on a Japanese Windows box, I have a > string called 'address' and I do 'print address'. If > I see utf8, I see garbage. If I see Shift-JIS, I see > the correct Japanese address. At this point in time, > utf8 is an interchange format but 99% of the world's > data is in various native encodings. > > Analogous problems occur on input. > > (2) I'm using htmlgen, which 'prints' objects to > standard output. My web site is supposed to be > encoded in Shift-JIS (or EUC, or Big 5 for Taiwan, > etc.) Yes, browsers CAN detect and display UTF8 but > you just don't find UTF8 sites in the real world - and > most users just don't know about the encoding menu, > and will get pissed off if they have to reach for it. > > Ditto for streaming output in some protocol. > > Java solves this (and we could too by hacking stdout) > using Writer classes which are created as wrappers > around an output stream and can take an encoding, but > you lose the flexibility to 'just print'. > > I think being able to change encoding would be useful. > What I do not want is to auto-detect it from the > operating system when Python boots - that would be a > portability nightmare. You almost convinced me there, but I think this can still be done without changing the default encoding: simply reopen stdout with a different encoding. This is how Java does it. I/O streams with an encoding specified at open() are a very powerful feature. You can hide this in your $PYTHONSTARTUP. François Pinard might not like it though... BTW, someone asked what HP asked for: I can't reveal what exactly they asked for, basically because they don't seem to agree amongst themselves. The only firm statements I have is that they want i18n and that they want it fast (before the end of the year). The desire from Perl-compatible regexps comes from me, and the only reason is compatibility with re.py. (HP did ask for regexps, but they don't know the difference between POSIX and Perl if it poked them in the eye.) --Guido van Rossum (home page: http://www.python.org/~guido/) From gstein@lyra.org Thu Nov 11 12:20:39 1999 From: gstein@lyra.org (Greg Stein) Date: Thu, 11 Nov 1999 04:20:39 -0800 (PST) Subject: [Python-Dev] Internationalization Toolkit (fwd) Message-ID: <Pine.LNX.4.10.9911110419400.27203-100000@nebula.lyra.org> Andy originally sent this just to me... I replied in kind, but saw that he sent another copy to python-dev. Sending my reply there... ---------- Forwarded message ---------- Date: Thu, 11 Nov 1999 04:00:38 -0800 (PST) From: Greg Stein <gstein@lyra.org> To: andy@robanal.demon.co.uk Subject: Re: [Python-Dev] Internationalization Toolkit [ note: you sent direct to me; replying in kind in case that was your intent ] On Wed, 10 Nov 1999, [iso-8859-1] Andy Robinson wrote: >... > Let me tell you why you would want to have an encoding > which can be set: >...snip: two examples of how "print" fails... Neither of those examples are solid reasons for having a default encoding that can be changed. Both can easily be altered at the Python level by using an encoding function before printing. You're asking for convenience, *not* providing a reason. > Java solves this (and we could too) using Writer > classes which are created as wrappers around an output > stream and can take an encoding, but you lose the > flexibility to just print. Not flexibility: convenience. You can certainly do: print encode(u,'Shift-JIS') > I think being able to change encoding would be useful. > What I do not want is to auto-detect it from the > operating system when Python boots - that would be a > portability nightmare. Useful, but not a requirement. Keep the interpreter simple, understandable, and predictable. A module that changes the default over to 'utf-8' because it is interacting with a network object is going to screw up your app if you're relying on an encoding of 'shift-jis' to be present. Cheers, -g -- Greg Stein, http://www.lyra.org/ From andy@robanal.demon.co.uk Thu Nov 11 12:49:10 1999 From: andy@robanal.demon.co.uk (=?iso-8859-1?q?Andy=20Robinson?=) Date: Thu, 11 Nov 1999 04:49:10 -0800 (PST) Subject: [Python-Dev] Internationalization Toolkit Message-ID: <19991111124910.6373.rocketmail@web603.mail.yahoo.com> > You almost convinced me there, but I think this can > still be done > without changing the default encoding: simply reopen > stdout with a > different encoding. This is how Java does it. I/O > streams with an > encoding specified at open() are a very powerful > feature. You can > hide this in your $PYTHONSTARTUP. Good point, I'm happy with this. Make sure we specify it in the docs as the right way to do it. In an IDE, we'd have an Options screen somewhere for the output encoding. What the Java code I have seen does is to open a raw file and construct wrappers (InputStreamReader, OutputStreamWriter) around it to do an encoding conversion. This kind of obfuscates what is going on - Python just needs the extra argument. - Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From mal@lemburg.com Thu Nov 11 12:42:51 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 11 Nov 1999 13:42:51 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <19991110080926.2400.rocketmail@web602.mail.yahoo.com> <38295A08.D3928401@lemburg.com> <14377.38438.615701.231437@weyr.cnri.reston.va.us> Message-ID: <382AB9CB.634A9782@lemburg.com> "Fred L. Drake, Jr." wrote: > > M.-A. Lemburg writes: > > def encode(self,u): > > > > """ Return the Unicode object u encoded as Python string. > > This should accept an optional slice parameter, and use it in the > same way as .dump(). Ok. > > def dump(self,u,stream,slice=None): > ... > > def load(self,stream,length=None): > > Why not have something like .wrapFile(f) that returns a file-like > object with all the file methods implemented, and doing to "right > thing" regarding encoding/decoding? That way, the new file-like > object can be used directly with code that works with files and > doesn't care whether it uses 8-bit or unicode strings. See File Output of the latest version: File/Stream Output: ------------------- Since file.write(object) and most other stream writers use the 's#' argument parsing marker, the buffer interface implementation determines the encoding to use (see Buffer Interface). For explicit handling of Unicode using files, the unicodec module could provide stream wrappers which provide transparent encoding/decoding for any open stream (file-like object): import unicodec file = open('mytext.txt','rb') ufile = unicodec.stream(file,'utf-16') u = ufile.read() ... ufile.close() XXX unicodec.file(<filename>,<mode>,<encname>) could be provided as short-hand for unicodec.file(open(<filename>,<mode>),<encname>) which also assures that <mode> contains the 'b' character when needed. > > Codecs should raise an UnicodeError in case the conversion is > > not possible. > > I think that should be ValueError, or UnicodeError should be a > subclass of ValueError. Ok. > (Can the -X interpreter option be removed yet?) Doesn't Python convert class exceptions to strings when -X is used ? I would guess that many scripts already rely on the class based mechanism (much of my stuff does for sure), so by the time 1.6 is out, I think -X should be considered an option to run pre 1.5 code rather than using it for performance reasons. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 50 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Thu Nov 11 13:01:40 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 11 Nov 1999 14:01:40 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <005701bf2bc3$980f4d60$0501a8c0@bobcat> Message-ID: <382ABE34.5D27C701@lemburg.com> Mark Hammond wrote: > > Marc writes: > > > > modes are evil. python is not perl. etc. > > > > But a requirement by the customer... they want to be able to > > set the locale > > on a per thread basis. Not exactly my preference (I think all locale > > settings should be passed as parameters, not via globals). > > Sure - that is what this customer wants, but we need to be clear about > the "best thing" for Python generally versus what this particular > client wants. > > For example, if we went with UTF-8 as the only default encoding, then > HP may be forced to use a helper function to perform the conversion, > rather than the built-in functions. This helper function can use TLS > (in Python) to store the encoding. At least it is localized. > > I agree that having a default encoding that can be changed is a bad > idea. It may make 3 line scripts that need to print something easier > to work with, but at the cost of reliability in large systems. Kinda > like the existing "locale" support, which is thread specific, and is > well known to cause these sorts of problems. The end result is that > in your app, you find _someone_ has changed the default encoding, and > some code no longer works. So the solution is to change the default > encoding back, so _your_ code works again. You just know that whoever > it was that changed the default encoding in the first place is now > going to break - but what else can you do? > > Having a fixed, default encoding may make life slightly more difficult > when you want to work primarily in a different encoding, but at least > your system is predictable and reliable. I think the discussion on this is getting a little too hot. The point is simply that the option of changing the per-thread default encoding is there. You are not required to use it and if you do you are on your own when something breaks. Think of it as a HP specific feature... perhaps I should wrap the code in #ifdefs and leave it undocumented. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 50 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From fdrake@acm.org Thu Nov 11 15:02:32 1999 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Thu, 11 Nov 1999 10:02:32 -0500 (EST) Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <382AB9CB.634A9782@lemburg.com> References: <19991110080926.2400.rocketmail@web602.mail.yahoo.com> <38295A08.D3928401@lemburg.com> <14377.38438.615701.231437@weyr.cnri.reston.va.us> <382AB9CB.634A9782@lemburg.com> Message-ID: <14378.55944.371933.613604@weyr.cnri.reston.va.us> M.-A. Lemburg writes: > For explicit handling of Unicode using files, the unicodec module > could provide stream wrappers which provide transparent > encoding/decoding for any open stream (file-like object): Sounds good to me! I guess I just missed, there's been so much going on lately. > XXX unicodec.file(<filename>,<mode>,<encname>) could be provided as > short-hand for unicodec.file(open(<filename>,<mode>),<encname>) which > also assures that <mode> contains the 'b' character when needed. Actually, I'd call it unicodec.open(). I asked: > (Can the -X interpreter option be removed yet?) You commented: > Doesn't Python convert class exceptions to strings when -X is > used ? I would guess that many scripts already rely on the class > based mechanism (much of my stuff does for sure), so by the time > 1.6 is out, I think -X should be considered an option to run > pre 1.5 code rather than using it for performance reasons. Gosh, I never thought of it as a performance issue! What I'd like to do is avoid code like this: try: class UnicodeError(ValueError): # well, something would probably go here... pass except TypeError: class UnicodeError: # something slightly different for this one... pass Trying to use class exceptions can be really tedious, and often I'd like to pick up the stuff from Exception. -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives From mal@lemburg.com Thu Nov 11 14:21:50 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 11 Nov 1999 15:21:50 +0100 Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit) References: <000201bf2c0d$8b866160$262d153f@tim> Message-ID: <382AD0FE.B604876A@lemburg.com> Tim Peters wrote: > > [/F, dripping with code] > > ... > > Note that the 'u' must be followed by four hexadecimal digits. If > > fewer digits are given, the sequence is left in the resulting string > > exactly as given. > > Yuck -- don't let probable error pass without comment. "must be" == "must > be"! I second that. > [moving backwards] > > \uxxxx -- Unicode character with hexadecimal value xxxx. The > > character is stored using UTF-8 encoding, which means that this > > sequence can result in up to three encoded characters. > > The code is fine, but I've gotten confused about what the intent is now. > Expanding \uxxxx to its UTF-8 encoding made sense when MAL had UTF-8 > literals, but now he's got Unicode-escaped literals instead -- and you favor > an internal 2-byte-per-char Unicode storage format. In that combination of > worlds, is there any use in the *language* (as opposed to in a runtime > module) for \uxxxx -> UTF-8 conversion? No, no... :-) I think it was a simple misunderstanding... \uXXXX is only to be used within u'' strings and then gets expanded to *one* character encoded in the internal Python format (which is heading towards UTF-16 without surrogates). > And MAL, if you're listening, I'm not clear on what a Unicode-escaped > literal means. When you had UTF-8 literals, the meaning of something like > > u"a\340\341" > > was clear, since UTF-8 is defined as a byte stream and UTF-8 string literals > were just a way of specifying a byte stream. As a Unicode-escaped string, I > assume the "a" maps to the Unicode "a", but what of the rest? Are the octal > escapes to be taken as two separate Latin-1 characters (in their role as a > Unicode subset), or as an especially clumsy way to specify a single 16-bit > Unicode character? I'm afraid I'd vote for the former. Same issue wrt \x > escapes. Good points. The conversion goes as follows: · for single characters (and this includes all \XXX sequences except \uXXXX), take the ordinal and interpret it as Unicode ordinal · for \uXXXX sequences, insert the Unicode character with ordinal 0xXXXX instead > One other issue: are there "raw" Unicode strings too, as in ur"\u20ac"? > There probably should be; and while Guido will hate this, a ur string should > probably *not* leave \uxxxx escapes untouched. Nasties like this are why > Java defines \uxxxx expansion as occurring in a preprocessing step. Not sure whether we really need to make this even more complicated... The \uXXXX strings look ugly, adding a few \\\\ for e.g. REs or filenames won't hurt much in the context of those \uXXXX monsters :-) > BTW, the meaning of \uxxxx in a non-Unicode string is now also unclear (or > isn't \uxxxx allowed in a non-Unicode string? that's what I would do ...). Right. \uXXXX will only be allowed in u'' strings, not in "normal" strings. BTW, if you want to type in UTF-8 strings and have them converted to Unicode, you can use the standard: u = unicode('...string with UTF-8 encoded characters...','utf-8') -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 50 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Thu Nov 11 14:23:45 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 11 Nov 1999 15:23:45 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <000601bf2c11$e4b07920$262d153f@tim> Message-ID: <382AD171.D22A1D6E@lemburg.com> Tim Peters wrote: > > [MAL, on Unicode chr() and ord() > > ... > > Because unichr() will always have to return Unicode objects. You don't > > want chr(i) to return Unicode for i>255 and strings for i<256. > > Indeed I do not! > > > OTOH, ord() could probably be extended to also work on Unicode objects. > > I think should be -- it's a good & natural use of polymorphism; introducing > a new function *here* would be as odd as introducing a unilen() function to > get the length of a Unicode string. Fine. So I'll drop the uniord() API and extend ord() instead. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 50 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Thu Nov 11 14:36:41 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 11 Nov 1999 15:36:41 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <000901bf2c16$8a107420$262d153f@tim> Message-ID: <382AD479.5261B43B@lemburg.com> Tim Peters wrote: > > [Mark Hammond] > > Sure - that is what this customer wants, but we need to be clear about > > the "best thing" for Python generally versus what this particular > > client wants. > > ... > > Having a fixed, default encoding may make life slightly more difficult > > when you want to work primarily in a different encoding, but at least > > your system is predictable and reliable. > > Well said, Mark! Me too. It's like HP is suffering from Windows envy > <wink>. See my other post on the subject... Note that if we make UTF-8 the standard encoding, nearly all special Latin-1 characters will produce UTF-8 errors on input and unreadable garbage on output. That will probably be unacceptable in Europe. To remedy this, one would *always* have to use u.encode('latin-1') to get readable output for Latin-1 strings repesented in Unicode. I'd rather see this happen the other way around: *always* explicitly state the encoding you want in case you rely on it, e.g. write file.write(u.encode('utf-8')) instead of file.write(u) # let's hope this goes out as UTF-8... Using the <default encoding> as site dependent setting is useful for convenience in those cases where the output format should be readable rather than parseable. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 50 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Thu Nov 11 14:26:59 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 11 Nov 1999 15:26:59 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <000801bf2c16$43f9a4c0$262d153f@tim> Message-ID: <382AD233.BE6DE888@lemburg.com> Tim Peters wrote: > > [/F] > > last time I checked, there were no characters (even in the > > ISO standard) outside the 16-bit range. has that changed? > > [MAL] > > No, but people are already thinking about it and there is > > a defined range in the >16-bit area for private encodings > > (F0000..FFFFD and 100000..10FFFD). > > Over the decades I've developed a rule of thumb that has never wound up > stuck in my ass <wink>: If I engineer code that I expect to be in use for N > years, I make damn sure that every internal limit is at least 10x larger > than the largest I can conceive of a user making reasonable use of at the > end of those N years. The invariable result is that the N years pass, and > fewer than half of the users have bumped into the limit <0.5 wink>. > > At the risk of offending everyone, I'll suggest that, qualitatively > speaking, Unicode is as Eurocentric as ASCII is Anglocentric. We've just > replaced "256 characters?! We'll *never* run out of those!" with 64K. But > when Asian languages consume them 7K at a pop, 64K isn't even in my 10x > comfort range for some individual languages. In just a few months, Unicode > 3 will already have used up > 56K of the 64K slots. > > As I understand it, UTF-16 "only" adds 1M new code points. That's in my 10x > zone, for about a decade. If HP approves, I'd propose to use UTF-16 as if it were UCS-2 and signal failure of this assertion at Unicode object construction time via an exception. That way we are within the standard, can use reasonably fast code for Unicode manipulation and add those extra 1M character at a later stage. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 50 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Thu Nov 11 14:47:49 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 11 Nov 1999 15:47:49 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <19991111074450.20451.rocketmail@web606.mail.yahoo.com> <199911111203.HAA24221@eric.cnri.reston.va.us> Message-ID: <382AD715.66DBA125@lemburg.com> Guido van Rossum wrote: > > > Let me tell you why you would want to have an encoding > > which can be set: > > > > (1) sday I am on a Japanese Windows box, I have a > > string called 'address' and I do 'print address'. If > > I see utf8, I see garbage. If I see Shift-JIS, I see > > the correct Japanese address. At this point in time, > > utf8 is an interchange format but 99% of the world's > > data is in various native encodings. > > > > Analogous problems occur on input. > > > > (2) I'm using htmlgen, which 'prints' objects to > > standard output. My web site is supposed to be > > encoded in Shift-JIS (or EUC, or Big 5 for Taiwan, > > etc.) Yes, browsers CAN detect and display UTF8 but > > you just don't find UTF8 sites in the real world - and > > most users just don't know about the encoding menu, > > and will get pissed off if they have to reach for it. > > > > Ditto for streaming output in some protocol. > > > > Java solves this (and we could too by hacking stdout) > > using Writer classes which are created as wrappers > > around an output stream and can take an encoding, but > > you lose the flexibility to 'just print'. > > > > I think being able to change encoding would be useful. > > What I do not want is to auto-detect it from the > > operating system when Python boots - that would be a > > portability nightmare. > > You almost convinced me there, but I think this can still be done > without changing the default encoding: simply reopen stdout with a > different encoding. This is how Java does it. I/O streams with an > encoding specified at open() are a very powerful feature. You can > hide this in your $PYTHONSTARTUP. True and it probably covers all cases where setting the default encoding to something other than UTF-8 makes sense. I guess you've convinced me there ;-) The current proposal has wrappers around stream for this purpose: For explicit handling of Unicode using files, the unicodec module could provide stream wrappers which provide transparent encoding/decoding for any open stream (file-like object): import unicodec file = open('mytext.txt','rb') ufile = unicodec.stream(file,'utf-16') u = ufile.read() ... ufile.close() XXX unicodec.file(<filename>,<mode>,<encname>) could be provided as short-hand for unicodec.file(open(<filename>,<mode>),<encname>) which also assures that <mode> contains the 'b' character when needed. The above can be done using: import sys,unicodec sys.stdin = unicodec.stream(sys.stdin,'jis') sys.stdout = unicodec.stream(sys.stdout,'jis') -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 50 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From jack@oratrix.nl Thu Nov 11 15:58:39 1999 From: jack@oratrix.nl (Jack Jansen) Date: Thu, 11 Nov 1999 16:58:39 +0100 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: Message by "M.-A. Lemburg" <mal@lemburg.com> , Thu, 11 Nov 1999 15:23:45 +0100 , <382AD171.D22A1D6E@lemburg.com> Message-ID: <19991111155839.BFB0235BB1E@snelboot.oratrix.nl> > > [MAL, on Unicode chr() and ord() > > > ... > > > Because unichr() will always have to return Unicode objects. You don't > > > want chr(i) to return Unicode for i>255 and strings for i<256. > > > OTOH, ord() could probably be extended to also work on Unicode objects. > Fine. So I'll drop the uniord() API and extend ord() instead. Hmm, then wouldn't it be more logical to drop unichr() too, but add an optional parameter to chr() to specify what sort of a string you want? The type-object of a unicode string comes to mind... -- Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++ Jack.Jansen@oratrix.com | ++++ if you agree copy these lines to your sig ++++ www.oratrix.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm From bwarsaw@cnri.reston.va.us (Barry A. Warsaw) Thu Nov 11 16:04:29 1999 From: bwarsaw@cnri.reston.va.us (Barry A. Warsaw) (Barry A. Warsaw) Date: Thu, 11 Nov 1999 11:04:29 -0500 (EST) Subject: [Python-Dev] Internationalization Toolkit References: <19991110080926.2400.rocketmail@web602.mail.yahoo.com> <38295A08.D3928401@lemburg.com> <14377.38438.615701.231437@weyr.cnri.reston.va.us> <382AB9CB.634A9782@lemburg.com> Message-ID: <14378.59661.376434.449820@anthem.cnri.reston.va.us> >>>>> "M" == M <mal@lemburg.com> writes: M> Doesn't Python convert class exceptions to strings when -X is M> used ? I would guess that many scripts already rely on the M> class based mechanism (much of my stuff does for sure), so by M> the time 1.6 is out, I think -X should be considered an option M> to run pre 1.5 code rather than using it for performance M> reasons. This is a little off-topic so I'll be brief. When using -X Python never even creates the class exceptions, so it isn't really a conversion. It just uses string exceptions and tries to craft tuples for what would be the superclasses in the class-based exception hierarchy. Yes, class-based exceptions are a bit of a performance hit when you are catching exceptions in Python (because they need to be instantiated), but they're just so darn *useful*. I wouldn't mind seeing the -X option go away for 1.6. -Barry From andy@robanal.demon.co.uk Thu Nov 11 16:08:15 1999 From: andy@robanal.demon.co.uk (=?iso-8859-1?q?Andy=20Robinson?=) Date: Thu, 11 Nov 1999 08:08:15 -0800 (PST) Subject: [Python-Dev] Internationalization Toolkit Message-ID: <19991111160815.5235.rocketmail@web608.mail.yahoo.com> > See my other post on the subject... > > Note that if we make UTF-8 the standard encoding, > nearly all > special Latin-1 characters will produce UTF-8 errors > on input > and unreadable garbage on output. That will probably > be unacceptable > in Europe. To remedy this, one would *always* have > to use > u.encode('latin-1') to get readable output for > Latin-1 strings > repesented in Unicode. You beat me to it - a colleague and I were just discussing this verbally. Specifically we Brits will get annoyed as soon as we read in a text file with pound (sterling) signs. We concluded that the only reasonable default (if you have one at all) is pure ASCII. At least that way I will get a clear and intelligible warning when I load in such a file, and will remember to specify ISO-Latin-1. - Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From mal@lemburg.com Thu Nov 11 15:59:21 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 11 Nov 1999 16:59:21 +0100 Subject: [Python-Dev] Unicode proposal: %-formatting ? Message-ID: <382AE7D9.147D58CB@lemburg.com> I wonder how we could add %-formatting to Unicode strings without duplicating the PyString_Format() logic. First, do we need Unicode object %-formatting at all ? Second, here is an emulation using strings and <default encoding> that should give an idea of one could work with the different encodings: s = '%s %i abcäöü' # a Latin-1 encoded string t = (u,3) # Convert Latin-1 s to a <default encoding> string via Unicode s1 = unicode(s,'latin-1').encode() # The '%s' will now add u in <default encoding> s2 = s1 % t # Finally, convert the <default encoding> encoded string to Unicode u1 = unicode(s2) Note that .encode() defaults to the current setting of <default encoding>. Provided u maps to Latin-1, an alternative would be: u1 = unicode('%s %i abcäöü' % (u.encode('latin-1'),3), 'latin-1') -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 50 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Thu Nov 11 17:04:37 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 11 Nov 1999 18:04:37 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <19991111155839.BFB0235BB1E@snelboot.oratrix.nl> Message-ID: <382AF725.FC66C9B6@lemburg.com> Jack Jansen wrote: > > > > [MAL, on Unicode chr() and ord() > > > > ... > > > > Because unichr() will always have to return Unicode objects. You don't > > > > want chr(i) to return Unicode for i>255 and strings for i<256. > > > > > OTOH, ord() could probably be extended to also work on Unicode objects. > > > Fine. So I'll drop the uniord() API and extend ord() instead. > > Hmm, then wouldn't it be more logical to drop unichr() too, but add an > optional parameter to chr() to specify what sort of a string you want? The > type-object of a unicode string comes to mind... Like: import types uc = chr(12,types.UnicodeType) ... looks overly complicated, IMHO. uc = unichr(12) and u = unicode('abc') look pretty intuitive to me. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 50 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Thu Nov 11 15:59:21 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 11 Nov 1999 16:59:21 +0100 Subject: [Python-Dev] Unicode proposal: %-formatting ? Message-ID: <382AE7D9.147D58CB@lemburg.com> I wonder how we could add %-formatting to Unicode strings without duplicating the PyString_Format() logic. First, do we need Unicode object %-formatting at all ? Second, here is an emulation using strings and <default encoding> that should give an idea of one could work with the different encodings: s = '%s %i abcäöü' # a Latin-1 encoded string t = (u,3) # Convert Latin-1 s to a <default encoding> string via Unicode s1 = unicode(s,'latin-1').encode() # The '%s' will now add u in <default encoding> s2 = s1 % t # Finally, convert the <default encoding> encoded string to Unicode u1 = unicode(s2) Note that .encode() defaults to the current setting of <default encoding>. Provided u maps to Latin-1, an alternative would be: u1 = unicode('%s %i abcäöü' % (u.encode('latin-1'),3), 'latin-1') -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 50 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Thu Nov 11 17:31:34 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 11 Nov 1999 18:31:34 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <19991111160815.5235.rocketmail@web608.mail.yahoo.com> Message-ID: <382AFD76.A0D3FEC4@lemburg.com> Andy Robinson wrote: > > > See my other post on the subject... > > > > Note that if we make UTF-8 the standard encoding, > > nearly all > > special Latin-1 characters will produce UTF-8 errors > > on input > > and unreadable garbage on output. That will probably > > be unacceptable > > in Europe. To remedy this, one would *always* have > > to use > > u.encode('latin-1') to get readable output for > > Latin-1 strings > > repesented in Unicode. > > You beat me to it - a colleague and I were just > discussing this verbally. Specifically we Brits will > get annoyed as soon as we read in a text file with > pound (sterling) signs. > > We concluded that the only reasonable default (if you > have one at all) is pure ASCII. At least that way I > will get a clear and intelligible warning when I load > in such a file, and will remember to specify > ISO-Latin-1. Well, Guido's post made me rethink the approach... 1. Setting <default encoding> to any non UTF encoding will result in data lossage due to the encoding limits imposed by the other formats -- this is dangerous and will result in errors (some of which may not even be noticed due to the interpreter ignoring them) in case your strings use non encodable characters. 2. You basically only want to set <default encoding> to anything other than UTF-8 for stream input and output. This can be done using the unicodec stream wrapper without too much inconvenience. (We'll have to extend the wrapper a little, though, because it currently only accept Unicode objects for writing and always return Unicode object when reading.) 3. We should leave the issue open until some code is there to be tested... I have a feeling that there will be quite a few strange effects when APIs expecting strings are fed with Unicode objects returning UTF-8. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 50 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mhammond@skippinet.com.au Fri Nov 12 01:10:09 1999 From: mhammond@skippinet.com.au (Mark Hammond) Date: Fri, 12 Nov 1999 12:10:09 +1100 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <382ABE34.5D27C701@lemburg.com> Message-ID: <007a01bf2caa$aabdef60$0501a8c0@bobcat> > Mark Hammond wrote: > > Having a fixed, default encoding may make life slightly > more difficult > > when you want to work primarily in a different encoding, > but at least > > your system is predictable and reliable. > > I think the discussion on this is getting a little too hot. Really - I see it as moving to a rational consensus that doesnt support the proposal in this regard. I see no heat in it at all. Im sorry if you saw my post or any of the followups as "emotional", but I certainly not getting passionate about this. I dont see any of this as affecting me personally. I believe that I can replace my Unicode implementation with this either way we go. Just because a we are trying to get it right doesnt mean we are getting heated. > The point > is simply that the option of changing the per-thread default encoding > is there. You are not required to use it and if you do you are on > your own when something breaks. Hrm - Im having serious trouble following your logic here. If make _any_ assumptions about a default encoding, I am in danger of breaking. I may not choose to change the default, but as soon as _anyone_ does, unrelated code may break. I agree that I will be "on my own", but I wont necessarily have been the one that changed it :-( The only answer I can see is, as you suggest, to ignore the fact that there is _any_ default. Always specify the encoding. But obviously this is not good enough for HP: > Think of it as a HP specific feature... perhaps I should wrap the code > in #ifdefs and leave it undocumented. That would work - just ensure that no standard Python has those #ifdefs turned on :-) I would be sorely dissapointed if the fact that HP are throwing money for this means they get every whim implemented in the core language. Imagine the outcry if it were instead MS' money, and you were attempting to put an MS spin on all this. Are you writing a module for HP, or writing a module for Python that HP are assisting by providing some funding? Clear difference. IMO, it must also be seen that there is a clear difference. Maybe Im missing something. Can you explain why it is good enough everyone else to be required to assume there is no default encoding, but HP get their thread specific global? Are their requirements greater than anyone elses? Is everyone else not as important? What would you, as a consultant, recommend to people who arent HP, but have a similar requirement? It would seem obvious to me that HPs requirement can be met in "pure Python", thereby keeping this out of the core all together... Mark. From gmcm@hypernet.com Fri Nov 12 02:01:23 1999 From: gmcm@hypernet.com (Gordon McMillan) Date: Thu, 11 Nov 1999 21:01:23 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <007a01bf2caa$aabdef60$0501a8c0@bobcat> References: <382ABE34.5D27C701@lemburg.com> Message-ID: <1269750417-7621469@hypernet.com> [per-thread defaults] C'mon guys, hasn't anyone ever played consultant before? The idea is obviously brain-dead. OTOH, they asked for it specifically, meaning they have some assumptions about how they think they're going to use it. If you give them what they ask for, you'll only have to fix it when they realize there are other ways of doing things that don't work with per-thread defaults. So, you find out why they think it's a good thing; you make it easy for them to code this way (without actually using per-thread defaults) and you don't make a fuss about it. More than likely, they won't either. "requirements"-are-only-useful-as-clues-to-the-objectives- behind-them-ly y'rs - Gordon From tim_one@email.msn.com Fri Nov 12 05:04:44 1999 From: tim_one@email.msn.com (Tim Peters) Date: Fri, 12 Nov 1999 00:04:44 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <382AB9CB.634A9782@lemburg.com> Message-ID: <000a01bf2ccb$6f59c2c0$fd2d153f@tim> [MAL] >>> Codecs should raise an UnicodeError in case the conversion is >>> not possible. [Fred L. Drake, Jr.] >> I think that should be ValueError, or UnicodeError should be a >> subclass of ValueError. >> (Can the -X interpreter option be removed yet?) [MAL] > Doesn't Python convert class exceptions to strings when -X is > used ? I would guess that many scripts already rely on the class > based mechanism (much of my stuff does for sure), so by the time > 1.6 is out, I think -X should be considered an option to run > pre 1.5 code rather than using it for performance reasons. -X is a red herring. That is, do what seems best without regard for -X. I already added one subclass exception to the CVS tree (UnboundLocalError as a subclass of NameError), and in doing that had to figure out how to make it do the right thing under -X too. It's a bit clumsy to arrange, but not a problem. From tim_one@email.msn.com Fri Nov 12 05:18:09 1999 From: tim_one@email.msn.com (Tim Peters) Date: Fri, 12 Nov 1999 00:18:09 -0500 Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit) In-Reply-To: <382AD0FE.B604876A@lemburg.com> Message-ID: <000e01bf2ccd$4f4b0e60$fd2d153f@tim> [MAL] > ... > The conversion goes as follows: > · for single characters (and this includes all \XXX sequences > except \uXXXX), take the ordinal and interpret it as Unicode > ordinal for \uXXXX sequences, insert the Unicode character > with ordinal 0xXXXX instead Perfect! [about "raw" Unicode strings] > ... > Not sure whether we really need to make this even more complicated... > The \uXXXX strings look ugly, adding a few \\\\ for e.g. REs or > filenames won't hurt much in the context of those \uXXXX monsters :-) Alas, this won't stand over the long term. Eventually people will write Python using nothing but Unicode strings -- "regular strings" will eventurally become a backward compatibility headache <0.7 wink>. IOW, Unicode regexps and Unicode docstrings and Unicode formatting ops ... nothing will escape. Nor should it. I don't think it all needs to be done at once, though -- existing languages usually take years to graft in gimmicks to cover all the fine points. So, happy to let raw Unicode strings pass for now, as a relatively minor point, but without agreeing it can be ignored forever. > ... > BTW, if you want to type in UTF-8 strings and have them converted > to Unicode, you can use the standard: > > u = unicode('...string with UTF-8 encoded characters...','utf-8') That's what I figured, and thanks for the confirmation. From tim_one@email.msn.com Fri Nov 12 05:42:32 1999 From: tim_one@email.msn.com (Tim Peters) Date: Fri, 12 Nov 1999 00:42:32 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <382AD233.BE6DE888@lemburg.com> Message-ID: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim> [MAL] > If HP approves, I'd propose to use UTF-16 as if it were UCS-2 and > signal failure of this assertion at Unicode object construction time > via an exception. That way we are within the standard, can use > reasonably fast code for Unicode manipulation and add those extra 1M > character at a later stage. I think this is reasonable. Using UTF-8 internally is also reasonable, and if it's being rejected on the grounds of supposed slowness, that deserves a closer look (it's an ingenious encoding scheme that works correctly with a surprising number of existing 8-bit string routines as-is). Indexing UTF-8 strings is greatly speeded by adding a simple finger (i.e., store along with the string an index+offset pair identifying the most recent position indexed to -- since string indexing is overwhelmingly sequential, this makes most indexing constant-time; and UTF-8 can be scanned either forward or backward from a random internal point because "the first byte" of each encoding is recognizable as such). I expect either would work well. It's at least curious that Perl and Tcl both went with UTF-8 -- does anyone think they know *why*? I don't. The people here saying UCS-2 is the obviously better choice are all from the Microsoft camp <wink>. It's not obvious to me, but then neither do I claim that UTF-8 is obviously better. From tim_one@email.msn.com Fri Nov 12 06:02:01 1999 From: tim_one@email.msn.com (Tim Peters) Date: Fri, 12 Nov 1999 01:02:01 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <382AD479.5261B43B@lemburg.com> Message-ID: <001001bf2cd3$6fa57820$fd2d153f@tim> [MAL] > Note that if we make UTF-8 the standard encoding, nearly all > special Latin-1 characters will produce UTF-8 errors on input > and unreadable garbage on output. That will probably be unacceptable > in Europe. To remedy this, one would *always* have to use > u.encode('latin-1') to get readable output for Latin-1 strings > repesented in Unicode. I think it's time for the Europeans to pronounce on what's acceptable in Europe. To the limited extent that I can pretend I'm Eurpoean, I'm happy with Guido's rebind-stdin/stdout-in-PYTHONSTARTUP idea. > I'd rather see this happen the other way around: *always* explicitly > state the encoding you want in case you rely on it, e.g. write > > file.write(u.encode('utf-8')) > > instead of > > file.write(u) # let's hope this goes out as UTF-8... By the same argument, those pesky Europeans who are relying on Latin-1 should write file.write(u.encode('latin-1')) instead of file.write(u) # let's hope this goes out as Latin-1 > Using the <default encoding> as site dependent setting is useful > for convenience in those cases where the output format should be > readable rather than parseable. Well, "convenience" is always the argument advanced in favor of modes. Conflicts and nasty intermittent bugs are always the result. The latter will happen under Guido's idea too, as various careless modules rebind stdin & stdout to their own ideas of what "the proper" encoding should be. But at least the blame doesn't fall on the core language then <0.3 wink>. Since there doesn't appear to be anything (either or good or bad) you can do (or avoid) by using Guido's scheme instead of magical core thread state, there's no *need* for the latter. That is, it can be done with a user-level API without involving the core. From tim_one@email.msn.com Fri Nov 12 06:17:08 1999 From: tim_one@email.msn.com (Tim Peters) Date: Fri, 12 Nov 1999 01:17:08 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <007a01bf2caa$aabdef60$0501a8c0@bobcat> Message-ID: <001501bf2cd5$8c380140$fd2d153f@tim> [Mark Hammond] > ... > Are you writing a module for HP, or writing a module for Python that > HP are assisting by providing some funding? Clear difference. IMO, > it must also be seen that there is a clear difference. I can resolve this easily, but only with input from Guido. Guido, did HP's check clear yet? If so, we can ignore them <wink>. From andy@robanal.demon.co.uk Fri Nov 12 08:15:19 1999 From: andy@robanal.demon.co.uk (=?iso-8859-1?q?Andy=20Robinson?=) Date: Fri, 12 Nov 1999 00:15:19 -0800 (PST) Subject: [Python-Dev] Internationalization Toolkit Message-ID: <19991112081519.20636.rocketmail@web603.mail.yahoo.com> --- Gordon McMillan <gmcm@hypernet.com> wrote: > [per-thread defaults] > > C'mon guys, hasn't anyone ever played consultant > before? The > idea is obviously brain-dead. OTOH, they asked for > it > specifically, meaning they have some assumptions > about how > they think they're going to use it. If you give them > what they > ask for, you'll only have to fix it when they > realize there are > other ways of doing things that don't work with > per-thread > defaults. So, you find out why they think it's a > good thing; you > make it easy for them to code this way (without > actually using > per-thread defaults) and you don't make a fuss about > it. More > than likely, they won't either. > I wrote directly to ask them exactly this last night. Let's forget the per-thread thing until we get an answer. - Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From mal@lemburg.com Fri Nov 12 09:27:29 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 12 Nov 1999 10:27:29 +0100 Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit) References: <000e01bf2ccd$4f4b0e60$fd2d153f@tim> Message-ID: <382BDD81.458D3125@lemburg.com> Tim Peters wrote: > > [MAL] > > ... > > The conversion goes as follows: > > · for single characters (and this includes all \XXX sequences > > except \uXXXX), take the ordinal and interpret it as Unicode > > ordinal for \uXXXX sequences, insert the Unicode character > > with ordinal 0xXXXX instead > > Perfect! Thanks :-) > [about "raw" Unicode strings] > > ... > > Not sure whether we really need to make this even more complicated... > > The \uXXXX strings look ugly, adding a few \\\\ for e.g. REs or > > filenames won't hurt much in the context of those \uXXXX monsters :-) > > Alas, this won't stand over the long term. Eventually people will write > Python using nothing but Unicode strings -- "regular strings" will > eventurally become a backward compatibility headache <0.7 wink>. IOW, > Unicode regexps and Unicode docstrings and Unicode formatting ops ... > nothing will escape. Nor should it. > > I don't think it all needs to be done at once, though -- existing languages > usually take years to graft in gimmicks to cover all the fine points. So, > happy to let raw Unicode strings pass for now, as a relatively minor point, > but without agreeing it can be ignored forever. Agreed... note that you could also write your own codec for just this reason and then use: u = unicode('....\u1234...\...\...','raw-unicode-escaped') Put that into a function called 'ur' and you have: u = ur('...\u4545...\...\...') which is not that far away from ur'...' w/r to cosmetics. > > ... > > BTW, if you want to type in UTF-8 strings and have them converted > > to Unicode, you can use the standard: > > > > u = unicode('...string with UTF-8 encoded characters...','utf-8') > > That's what I figured, and thanks for the confirmation. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Fri Nov 12 09:00:47 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 12 Nov 1999 10:00:47 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <19991112081519.20636.rocketmail@web603.mail.yahoo.com> Message-ID: <382BD73E.E6729C79@lemburg.com> Andy Robinson wrote: > > --- Gordon McMillan <gmcm@hypernet.com> wrote: > > [per-thread defaults] > > > > C'mon guys, hasn't anyone ever played consultant > > before? The > > idea is obviously brain-dead. OTOH, they asked for > > it > > specifically, meaning they have some assumptions > > about how > > they think they're going to use it. If you give them > > what they > > ask for, you'll only have to fix it when they > > realize there are > > other ways of doing things that don't work with > > per-thread > > defaults. So, you find out why they think it's a > > good thing; you > > make it easy for them to code this way (without > > actually using > > per-thread defaults) and you don't make a fuss about > > it. More > > than likely, they won't either. > > > > I wrote directly to ask them exactly this last night. > Let's forget the per-thread thing until we get an > answer. That's the way to go, Andy. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Fri Nov 12 09:44:14 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 12 Nov 1999 10:44:14 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <007a01bf2caa$aabdef60$0501a8c0@bobcat> Message-ID: <382BE16E.D17C80E1@lemburg.com> Mark Hammond wrote: > > > Mark Hammond wrote: > > > Having a fixed, default encoding may make life slightly > > more difficult > > > when you want to work primarily in a different encoding, > > but at least > > > your system is predictable and reliable. > > > > I think the discussion on this is getting a little too hot. > > Really - I see it as moving to a rational consensus that doesnt > support the proposal in this regard. I see no heat in it at all. Im > sorry if you saw my post or any of the followups as "emotional", but I > certainly not getting passionate about this. I dont see any of this > as affecting me personally. I believe that I can replace my Unicode > implementation with this either way we go. Just because a we are > trying to get it right doesnt mean we are getting heated. Naa... with "heated" I meant the "HP wants this, HP wants that" side of things. We'll just have to wait for their answer on this one. > > The point > > is simply that the option of changing the per-thread default > encoding > > is there. You are not required to use it and if you do you are on > > your own when something breaks. > > Hrm - Im having serious trouble following your logic here. If make > _any_ assumptions about a default encoding, I am in danger of > breaking. I may not choose to change the default, but as soon as > _anyone_ does, unrelated code may break. > > I agree that I will be "on my own", but I wont necessarily have been > the one that changed it :-( Sure there are some very subtile dangers in setting the default to anything other than the default ;-) For some this risk may be worthwhile taking, for others not. In fact, in large projects I would never take such a risk... I'm sure we can get this message across to them. > The only answer I can see is, as you suggest, to ignore the fact that > there is _any_ default. Always specify the encoding. But obviously > this is not good enough for HP: > > > Think of it as a HP specific feature... perhaps I should wrap the > code > > in #ifdefs and leave it undocumented. > > That would work - just ensure that no standard Python has those > #ifdefs turned on :-) I would be sorely dissapointed if the fact that > HP are throwing money for this means they get every whim implemented > in the core language. Imagine the outcry if it were instead MS' > money, and you were attempting to put an MS spin on all this. > > Are you writing a module for HP, or writing a module for Python that > HP are assisting by providing some funding? Clear difference. IMO, > it must also be seen that there is a clear difference. > > Maybe Im missing something. Can you explain why it is good enough > everyone else to be required to assume there is no default encoding, > but HP get their thread specific global? Are their requirements > greater than anyone elses? Is everyone else not as important? What > would you, as a consultant, recommend to people who arent HP, but have > a similar requirement? It would seem obvious to me that HPs > requirement can be met in "pure Python", thereby keeping this out of > the core all together... Again, all I can try is convince them of not really needing settable default encodings. <IMO> Since this is the first time a Python Consortium member is pushing development, I think we can learn a lot here. For one, it should be clear that money doesn't buy everything, OTOH, we cannot put the whole thing at risk just because of some minor disagreement that cannot be solved between the parties. The standard solution for the latter should be a customized Python interpreter. </IMO> -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Fri Nov 12 09:04:31 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 12 Nov 1999 10:04:31 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <001001bf2cd3$6fa57820$fd2d153f@tim> Message-ID: <382BD81F.B2BC896A@lemburg.com> Tim Peters wrote: > > [MAL] > > Note that if we make UTF-8 the standard encoding, nearly all > > special Latin-1 characters will produce UTF-8 errors on input > > and unreadable garbage on output. That will probably be unacceptable > > in Europe. To remedy this, one would *always* have to use > > u.encode('latin-1') to get readable output for Latin-1 strings > > repesented in Unicode. > > I think it's time for the Europeans to pronounce on what's acceptable in > Europe. To the limited extent that I can pretend I'm Eurpoean, I'm happy > with Guido's rebind-stdin/stdout-in-PYTHONSTARTUP idea. Agreed. > > I'd rather see this happen the other way around: *always* explicitly > > state the encoding you want in case you rely on it, e.g. write > > > > file.write(u.encode('utf-8')) > > > > instead of > > > > file.write(u) # let's hope this goes out as UTF-8... > > By the same argument, those pesky Europeans who are relying on Latin-1 > should write > > file.write(u.encode('latin-1')) > > instead of > > file.write(u) # let's hope this goes out as Latin-1 Right. > > Using the <default encoding> as site dependent setting is useful > > for convenience in those cases where the output format should be > > readable rather than parseable. > > Well, "convenience" is always the argument advanced in favor of modes. > Conflicts and nasty intermittent bugs are always the result. The latter > will happen under Guido's idea too, as various careless modules rebind stdin > & stdout to their own ideas of what "the proper" encoding should be. But at > least the blame doesn't fall on the core language then <0.3 wink>. > > Since there doesn't appear to be anything (either or good or bad) you can do > (or avoid) by using Guido's scheme instead of magical core thread state, > there's no *need* for the latter. That is, it can be done with a user-level > API without involving the core. Dito :-) I have nothing against telling people to take care about the problem in user space (meaning: not done by the core interpreter) and I'm pretty sure that HP will agree on this too, provided we give them the proper user space tools like file wrappers et al. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Fri Nov 12 09:16:57 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 12 Nov 1999 10:16:57 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim> Message-ID: <382BDB09.55583F28@lemburg.com> Tim Peters wrote: > > [MAL] > > If HP approves, I'd propose to use UTF-16 as if it were UCS-2 and > > signal failure of this assertion at Unicode object construction time > > via an exception. That way we are within the standard, can use > > reasonably fast code for Unicode manipulation and add those extra 1M > > character at a later stage. > > I think this is reasonable. > > Using UTF-8 internally is also reasonable, and if it's being rejected on the > grounds of supposed slowness, that deserves a closer look (it's an ingenious > encoding scheme that works correctly with a surprising number of existing > 8-bit string routines as-is). Indexing UTF-8 strings is greatly speeded by > adding a simple finger (i.e., store along with the string an index+offset > pair identifying the most recent position indexed to -- since string > indexing is overwhelmingly sequential, this makes most indexing > constant-time; and UTF-8 can be scanned either forward or backward from a > random internal point because "the first byte" of each encoding is > recognizable as such). Here are some arguments for using the proposed UTF-16 strategy instead: · all characters have the same length; indexing is fast · conversion APIs to platform dependent wchar_t implementation are fast because they either can simply copy the content or extend the 2-bytes to 4 byte · UTF-8 needs 2 bytes for all the compound Latin-1 characters (e.g. u with two dots) which are used in many non-English languages · from the Unicode Consortium FAQ: "Most Unicode APIs are using UTF-16." Besides, the Unicode object will have a buffer containing the <default encoding> representation of the object, which, if all goes well, will always hold the UTF-8 value. RE engines etc. can then directly work with this buffer. > I expect either would work well. It's at least curious that Perl and Tcl > both went with UTF-8 -- does anyone think they know *why*? I don't. The > people here saying UCS-2 is the obviously better choice are all from the > Microsoft camp <wink>. It's not obvious to me, but then neither do I claim > that UTF-8 is obviously better. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From gstein@lyra.org Fri Nov 12 10:20:16 1999 From: gstein@lyra.org (Greg Stein) Date: Fri, 12 Nov 1999 02:20:16 -0800 (PST) Subject: [Python-Dev] the Benevolent Dictator (was: Internationalization Toolkit) In-Reply-To: <382BE16E.D17C80E1@lemburg.com> Message-ID: <Pine.LNX.4.10.9911120214521.27203-100000@nebula.lyra.org> On Fri, 12 Nov 1999, M.-A. Lemburg wrote: > <IMO> > Since this is the first time a Python Consortium member is > pushing development, I think we can learn a lot here. For one, > it should be clear that money doesn't buy everything, OTOH, > we cannot put the whole thing at risk just because > of some minor disagreement that cannot be solved between the > parties. The standard solution for the latter should be a > customized Python interpreter. > </IMO> hehe... funny you mention this. Go read the Consortium docs. Last time that I read them, there are no "parties" to reach consensus. *Every* technical decision regarding the Python language falls to the Technical Director (Guido, of course). I looked. I found nothing that can override the T.D.'s decisions and no way to force a particular decision. Guido is still the Benevolent Dictator :-) Cheers, -g p.s. yes, there is always the caveat that "sure, Guido has final say" but "Al can fire him at will for being too stubborn" :-) ... but hey, Guido's title does have the word Benevolent in it, so things are cool... -- Greg Stein, http://www.lyra.org/ From gstein@lyra.org Fri Nov 12 10:24:56 1999 From: gstein@lyra.org (Greg Stein) Date: Fri, 12 Nov 1999 02:24:56 -0800 (PST) Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <382BE16E.D17C80E1@lemburg.com> Message-ID: <Pine.LNX.4.10.9911120221010.27203-100000@nebula.lyra.org> On Fri, 12 Nov 1999, M.-A. Lemburg wrote: > Sure there are some very subtile dangers in setting the default > to anything other than the default ;-) For some this risk may > be worthwhile taking, for others not. In fact, in large projects > I would never take such a risk... I'm sure we can get this > message across to them. It's a lot easier to just never provide the rope (per-thread default encodings) in the first place. If the feature exists, then it will be used. Period. Try to get the message across until you're blue in the face, but it would be used. Anyhow... discussion is pretty moot until somebody can state that it is/isn't a "real requirement" and/or until The Guido takes a position. Cheers, -g -- Greg Stein, http://www.lyra.org/ From gstein@lyra.org Fri Nov 12 10:30:04 1999 From: gstein@lyra.org (Greg Stein) Date: Fri, 12 Nov 1999 02:30:04 -0800 (PST) Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim> Message-ID: <Pine.LNX.4.10.9911120225080.27203-100000@nebula.lyra.org> On Fri, 12 Nov 1999, Tim Peters wrote: >... > Using UTF-8 internally is also reasonable, and if it's being rejected on the > grounds of supposed slowness No... my main point was interaction with the underlying OS. I made a SWAG (Scientific Wild Ass Guess :-) and stated that UTF-8 is probably slower for various types of operations. As always, your infernal meddling has dashed that hypothesis, so I must retreat... >... > I expect either would work well. It's at least curious that Perl and Tcl > both went with UTF-8 -- does anyone think they know *why*? I don't. The > people here saying UCS-2 is the obviously better choice are all from the > Microsoft camp <wink>. It's not obvious to me, but then neither do I claim > that UTF-8 is obviously better. Probably for the exact reason that you stated in your messages: many 8-bit (7-bit?) functions continue to work quite well when given a UTF-8-encoded string. i.e. they didn't have to rewrite the entire Perl/TCL interpreter to deal with a new string type. I'd guess it is a helluva lot easier for us to add a Python Type than for Perl or TCL to whack around with new string types (since they use strings so heavily). Cheers, -g -- Greg Stein, http://www.lyra.org/ From mal@lemburg.com Fri Nov 12 10:30:28 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 12 Nov 1999 11:30:28 +0100 Subject: [Python-Dev] the Benevolent Dictator (was: Internationalization Toolkit) References: <Pine.LNX.4.10.9911120214521.27203-100000@nebula.lyra.org> Message-ID: <382BEC44.A2541C7E@lemburg.com> Greg Stein wrote: > > On Fri, 12 Nov 1999, M.-A. Lemburg wrote: > > <IMO> > > Since this is the first time a Python Consortium member is > > pushing development, I think we can learn a lot here. For one, > > it should be clear that money doesn't buy everything, OTOH, > > we cannot put the whole thing at risk just because > > of some minor disagreement that cannot be solved between the > > parties. The standard solution for the latter should be a > > customized Python interpreter. > > </IMO> > > hehe... funny you mention this. Go read the Consortium docs. Last time > that I read them, there are no "parties" to reach consensus. *Every* > technical decision regarding the Python language falls to the Technical > Director (Guido, of course). I looked. I found nothing that can override > the T.D.'s decisions and no way to force a particular decision. > > Guido is still the Benevolent Dictator :-) Sure, but have you considered the option of a member simply bailing out ? HP could always stop funding Unicode integration. That wouldn't help us either... > Cheers, > -g > > p.s. yes, there is always the caveat that "sure, Guido has final say" but > "Al can fire him at will for being too stubborn" :-) ... but hey, Guido's > title does have the word Benevolent in it, so things are cool... -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From gstein@lyra.org Fri Nov 12 10:39:45 1999 From: gstein@lyra.org (Greg Stein) Date: Fri, 12 Nov 1999 02:39:45 -0800 (PST) Subject: [Python-Dev] the Benevolent Dictator (was: Internationalization Toolkit) In-Reply-To: <382BEC44.A2541C7E@lemburg.com> Message-ID: <Pine.LNX.4.10.9911120238230.27203-100000@nebula.lyra.org> On Fri, 12 Nov 1999, M.-A. Lemburg wrote: >... > Sure, but have you considered the option of a member simply bailing > out ? HP could always stop funding Unicode integration. That wouldn't > help us either... I'm not that dumb... come on. That was my whole point about "Benevolent" below... Guido is a fair and reasonable Dictator... he wouldn't let that happen. >... > > p.s. yes, there is always the caveat that "sure, Guido has final say" but > > "Al can fire him at will for being too stubborn" :-) ... but hey, Guido's > > title does have the word Benevolent in it, so things are cool... Cheers, -g -- Greg Stein, http://www.lyra.org/ From Mike.Da.Silva@uk.fid-intl.com Fri Nov 12 11:00:49 1999 From: Mike.Da.Silva@uk.fid-intl.com (Da Silva, Mike) Date: Fri, 12 Nov 1999 11:00:49 -0000 Subject: [Python-Dev] Internationalization Toolkit Message-ID: <DBF3B37F7BF1D111B2A10000F6B14B1FDDAF22@ukhil704nts.hld.uk.fid-intl.com> Most of the ASCII string functions do indeed work for UTF-8. I have made extensive use of this feature when writing translation logic to harmonize ASCII text (an SQL statement) with substitution parameters that must be converted from IBM EBCDIC code pages (5035, 1027) into UTF8. Since UTF-8 is a superset of ASCII, this all works fine. Some of the character classification functions etc can be flaky when used with UTF8 characters outside the ASCII range, but simple string operations work fine. As I see it, the relative pros and cons of UTF-8 versus UTF-16 for use as an internal string representation are: 1. UTF-8 allows all characters to be displayed (in some form or other) on the users machine, with or without native fonts installed. Naturally anything outside the ASCII range will be garbage, but it is an immense debugging aid when working with character encodings to be able to touch and feel something recognizable. Trying to decode a block of raw UTF-16 is a pain. 2. UTF-8 works with most existing string manipulation libraries quite happily. It is also portable (a char is always 8 bits, regardless of platform; wchar_t varies between 16 and 32 bits depending on the underlying operating system (although unsigned short does seems to work across platforms, in my experience). 3. UTF-16 has some advantages in providing fixed width characters and, (ignoring surrogate pairs etc) a modeless encoding space. This is an advantage for fast string operations, especially on CPU's that have efficient operations for handling 16bit data. 4. UTF-16 would directly support a tightly coupled character properties engine, which would enable Unicode compliant case folding and character decomposition to be performed without an intermediate UTF-8 <----> UTF-16 translation step. 5. UTF-16 requires string operations that do not make assumptions about nulls - this means re-implementing most of the C runtime functions to work with unsigned shorts. Regards, Mike da Silva -----Original Message----- From: Greg Stein [SMTP:gstein@lyra.org] Sent: 12 November 1999 10:30 To: Tim Peters Cc: python-dev@python.org Subject: RE: [Python-Dev] Internationalization Toolkit On Fri, 12 Nov 1999, Tim Peters wrote: >... > Using UTF-8 internally is also reasonable, and if it's being rejected on the > grounds of supposed slowness No... my main point was interaction with the underlying OS. I made a SWAG (Scientific Wild Ass Guess :-) and stated that UTF-8 is probably slower for various types of operations. As always, your infernal meddling has dashed that hypothesis, so I must retreat... >... > I expect either would work well. It's at least curious that Perl and Tcl > both went with UTF-8 -- does anyone think they know *why*? I don't. The > people here saying UCS-2 is the obviously better choice are all from the > Microsoft camp <wink>. It's not obvious to me, but then neither do I claim > that UTF-8 is obviously better. Probably for the exact reason that you stated in your messages: many 8-bit (7-bit?) functions continue to work quite well when given a UTF-8-encoded string. i.e. they didn't have to rewrite the entire Perl/TCL interpreter to deal with a new string type. I'd guess it is a helluva lot easier for us to add a Python Type than for Perl or TCL to whack around with new string types (since they use strings so heavily). Cheers, -g -- Greg Stein, http://www.lyra.org/ _______________________________________________ Python-Dev maillist - Python-Dev@python.org http://www.python.org/mailman/listinfo/python-dev From fredrik@pythonware.com Fri Nov 12 11:23:24 1999 From: fredrik@pythonware.com (Fredrik Lundh) Date: Fri, 12 Nov 1999 12:23:24 +0100 Subject: [Python-Dev] just say no... References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim> <382BDB09.55583F28@lemburg.com> Message-ID: <027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com> > Besides, the Unicode object will have a buffer containing the > <default encoding> representation of the object, which, if all goes > well, will always hold the UTF-8 value. <rant> over my dead body, that one... (fwiw, over the last 20 years, I've implemented about a dozen image processing libraries, supporting loads of pixel layouts and file formats. one important lesson from that is to stick to a single internal representation, and let the application programmers build their own layers if they need to speed things up -- yes, they're actually happier that way. and text strings are not that different from pixel buffers or sound streams or scientific data sets, after all...) (and sticks and modes will break your bones, but you know that...) > RE engines etc. can then directly work with this buffer. sidebar: the RE engine that's being developed for this project can handle 8-bit, 16-bit, and (optionally) 32-bit text buffers. a single compiled expression can be used with any character size, and performance is about the same for all sizes (at least on any decent cpu). > > I expect either would work well. It's at least curious that Perl and Tcl > > both went with UTF-8 -- does anyone think they know *why*? I don't. The > > people here saying UCS-2 is the obviously better choice are all from the > > Microsoft camp <wink>. (hey, I'm not a microsofter. but I've been writing "i/o libraries" for various "object types" all my life, so I do have strong preferences on what works, and what doesn't... I use Python for good reasons, you know ;-) </rant> thanks. I feel better now. </F> From fredrik@pythonware.com Fri Nov 12 11:23:38 1999 From: fredrik@pythonware.com (Fredrik Lundh) Date: Fri, 12 Nov 1999 12:23:38 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <DBF3B37F7BF1D111B2A10000F6B14B1FDDAF22@ukhil704nts.hld.uk.fid-intl.com> Message-ID: <027f01bf2d00$648745e0$f29b12c2@secret.pythonware.com> > 5. UTF-16 requires string operations that do not make assumptions about > nulls - this means re-implementing most of the C runtime functions to work > with unsigned shorts. footnote: the mad scientist has been there and done that: http://www.pythonware.com/madscientist/ (and you can replace "unsigned short" with "whatever's suitable on this platform") </F> From fredrik@pythonware.com Fri Nov 12 11:36:03 1999 From: fredrik@pythonware.com (Fredrik Lundh) Date: Fri, 12 Nov 1999 12:36:03 +0100 Subject: [Python-Dev] the Benevolent Dictator (was: Internationalization Toolkit) References: <Pine.LNX.4.10.9911120238230.27203-100000@nebula.lyra.org> Message-ID: <02a701bf2d02$20c66280$f29b12c2@secret.pythonware.com> > Guido is a fair and reasonable Dictator... he wouldn't let that > happen. ...but where is he when we need him? ;-) </F> From Mike.Da.Silva@uk.fid-intl.com Fri Nov 12 11:43:21 1999 From: Mike.Da.Silva@uk.fid-intl.com (Da Silva, Mike) Date: Fri, 12 Nov 1999 11:43:21 -0000 Subject: [Python-Dev] Internationalization Toolkit Message-ID: <DBF3B37F7BF1D111B2A10000F6B14B1FDDAF24@ukhil704nts.hld.uk.fid-intl.com> Fredrik Lundh wrote: > 5. UTF-16 requires string operations that do not make assumptions about > nulls - this means re-implementing most of the C runtime functions to work > with unsigned shorts. footnote: the mad scientist has been there and done that: http://www.pythonware.com/madscientist/ <http://www.pythonware.com/madscientist/> (and you can replace "unsigned short" with "whatever's suitable on this platform") Surely using a different type on different platforms means that we throw away the concept of a platform independent Unicode string? I.e. on Solaris, wchar_t is 32 bits, on Windows it is 16 bits. Does this mean that to transfer a file between a Windows box and Solaris, an implicit conversion has to be done to go from 16 bits to 32 bits (and vice versa)? What about byte ordering issues? Or do you mean whatever 16 bit data type is available on the platform, with a standard (platform independent) byte ordering maintained? Mike da S From fredrik@pythonware.com Fri Nov 12 12:16:24 1999 From: fredrik@pythonware.com (Fredrik Lundh) Date: Fri, 12 Nov 1999 13:16:24 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <DBF3B37F7BF1D111B2A10000F6B14B1FDDAF24@ukhil704nts.hld.uk.fid-intl.com> Message-ID: <02f901bf2d07$bdf5d950$f29b12c2@secret.pythonware.com> Mike wrote: > Surely using a different type on different platforms means that we throw > away the concept of a platform independent Unicode string? > I.e. on Solaris, wchar_t is 32 bits, on Windows it is 16 bits. so? the interchange format doesn't have to be the same as the internal format, does it? > Does this mean that to transfer a file between a Windows box and Solaris, an > implicit conversion has to be done to go from 16 bits to 32 bits (and vice > versa)? What about byte ordering issues? no problem at all: unicode has special byte order marks for this purpose (and utf-8 doesn't care, of course). > Or do you mean whatever 16 bit data type is available on the platform, with > a standard (platform independent) byte ordering maintained? well, my preference is a 16-bit data type in the plat- form's native byte order (exactly how it's done in the unicode module -- for the moment, it can use the platform's wchar_t, but only if it happens to be a 16-bit unsigned type). gives you good performance, compact storage, and cleanest possible code. ... anyway, I think it would help the discussion a little bit if people looked at (and played with) the existing code base. at least that'll change arguments like "but then we have to implement that" to "but then we have to maintain that code" ;-) </F> From andy@robanal.demon.co.uk Fri Nov 12 12:13:03 1999 From: andy@robanal.demon.co.uk (=?iso-8859-1?q?Andy=20Robinson?=) Date: Fri, 12 Nov 1999 04:13:03 -0800 (PST) Subject: [Python-Dev] Internationalization Toolkit Message-ID: <19991112121303.27452.rocketmail@ web605.yahoomail.com> --- "Da Silva, Mike" <Mike.Da.Silva@uk.fid-intl.com> wrote: > As I see it, the relative pros and cons of UTF-8 > versus UTF-16 for use as an > internal string representation are: > [snip] > Regards, > Mike da Silva > Note that by going with UTF16, we get both. We will certainly have a codec for utf8, just as we will for ISO-Latin-1, Shift-JIS or whatever. And a perfectly ordinary Python string is a great place to hold UTF8; you can look at it and use most of the ordinary string algorithms on it. I presume no one is actually advocating dropping ordinary Python strings, or the ability to do rawdata = open('myfile.txt', 'rb').read() without any transformations? - Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From mhammond@skippinet.com.au Fri Nov 12 12:27:19 1999 From: mhammond@skippinet.com.au (Mark Hammond) Date: Fri, 12 Nov 1999 23:27:19 +1100 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <02f901bf2d07$bdf5d950$f29b12c2@secret.pythonware.com> Message-ID: <007e01bf2d09$44738440$0501a8c0@bobcat> /F writes > anyway, I think it would help the discussion a little bit > if people looked at (and played with) the existing code > base. at least that'll change arguments like "but then > we have to implement that" to "but then we have to > maintain that code" ;-) I second that. It is good enough for me (although my requirements arent stringent) - its been used on CE, so would slot directly into the win32 stuff. It is pretty much the consensus of the string-sig of last year, but as code! The only "problem" with it is the code that hasnt been written yet, specifically: * Encoders as streams, and a concrete proposal for them. * Decent PyArg_ParseTuple support and Py_BuildValue support. * The ord(), chr() stuff, and other stuff around the edges no doubt. Couldnt we start with Fredriks implementation, and see how the rest turns out? Even if we do choose to change the underlying Unicode implementation to use a different native encoding, the interface to the PyUnicode_Type would remain pretty similar. The advantage is that we have something now to start working with for the rest of the support we need. Mark. From mal@lemburg.com Fri Nov 12 12:38:44 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 12 Nov 1999 13:38:44 +0100 Subject: [Python-Dev] Unicode Proposal: Version 0.4 Message-ID: <382C0A54.E6E8328D@lemburg.com> I've uploaded a new version of the proposal which incorporates a lot of what has been discussed on the list. Thanks to everybody who helped so far. Note that I have extended the list of references for those who want to join in, but are in need of more background information. The latest version of the proposal is available at: http://starship.skyport.net/~lemburg/unicode-proposal.txt Older versions are available as: http://starship.skyport.net/~lemburg/unicode-proposal-X.X.txt Some POD (points of discussion) that are still open: · support for line breaks (see http://www.unicode.org/unicode/reports/tr13/ ) · support for case conversion: Problems: string lengths can change due to multiple characters being mapped to a single new one, capital letters starting a word can be different than ones occurring in the middle, there are locale dependent deviations from the standard mappings. · support for numbers, digits, whitespace, etc. · support (or no support) for private code point areas · should Unicode objects support %-formatting ? One possibility would be to emulate this via strings and <default encoding>: s = '%s %i abcäöü' # a Latin-1 encoded string t = (u,3) # Convert Latin-1 s to a <default encoding> string s1 = unicode(s,'latin-1').encode() # The '%s' will now add u in <default encoding> s2 = s1 % t # Finally, convert the <default encoding> encoded string to Unicode u1 = unicode(s2) · specifying file wrappers: Open issues: what to do with Python strings fed to the .write() method (may need to know the encoding of the strings) and when/if to return Python strings through the .read() method. Perhaps we need more than one type of wrapper here. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Fri Nov 12 13:11:26 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 12 Nov 1999 14:11:26 +0100 Subject: [Python-Dev] just say no... References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim> <382BDB09.55583F28@lemburg.com> <027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com> Message-ID: <382C11FE.D7D9F916@lemburg.com> Fredrik Lundh wrote: > > > Besides, the Unicode object will have a buffer containing the > > <default encoding> representation of the object, which, if all goes > > well, will always hold the UTF-8 value. > > <rant> > > over my dead body, that one... Such a buffer is needed to implement "s" and "s#" argument parsing. It's a simple requirement to support those two parsing markers -- there's not much to argue about, really... unless, of course, you want to give up Unicode object support for all APIs using these parsers. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Fri Nov 12 13:01:28 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 12 Nov 1999 14:01:28 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <DBF3B37F7BF1D111B2A10000F6B14B1FDDAF24@ukhil704nts.hld.uk.fid-intl.com> <02f901bf2d07$bdf5d950$f29b12c2@secret.pythonware.com> Message-ID: <382C0FA8.ACB6CCD6@lemburg.com> Fredrik Lundh wrote: > > Mike wrote: > > Surely using a different type on different platforms means that we throw > > away the concept of a platform independent Unicode string? > > I.e. on Solaris, wchar_t is 32 bits, on Windows it is 16 bits. > > so? the interchange format doesn't have to be > the same as the internal format, does it? The interchange format (marshal + pickle) is defined as UTF-8, so there's no problem with endianness or missing bits w/r to shipping Unicode data from one platform to another. > > Does this mean that to transfer a file between a Windows box and Solaris, an > > implicit conversion has to be done to go from 16 bits to 32 bits (and vice > > versa)? What about byte ordering issues? > > no problem at all: unicode has special byte order > marks for this purpose (and utf-8 doesn't care, of > course). Access to this mark will go into sys: sys.bom. > > Or do you mean whatever 16 bit data type is available on the platform, with > > a standard (platform independent) byte ordering maintained? > > well, my preference is a 16-bit data type in the plat- > form's native byte order (exactly how it's done in the > unicode module -- for the moment, it can use the > platform's wchar_t, but only if it happens to be a > 16-bit unsigned type). gives you good performance, > compact storage, and cleanest possible code. The 0.4 proposal fixes this to 16-bit unsigned short using UTF-16 encoding with checks for surrogates. This covers all defined standard Unicode character points, is fast, etc. pp... -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Fri Nov 12 11:15:15 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 12 Nov 1999 12:15:15 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <DBF3B37F7BF1D111B2A10000F6B14B1FDDAF22@ukhil704nts.hld.uk.fid-intl.com> Message-ID: <382BF6C3.D79840EC@lemburg.com> "Da Silva, Mike" wrote: > > Most of the ASCII string functions do indeed work for UTF-8. I have made > extensive use of this feature when writing translation logic to harmonize > ASCII text (an SQL statement) with substitution parameters that must be > converted from IBM EBCDIC code pages (5035, 1027) into UTF8. Since UTF-8 is > a superset of ASCII, this all works fine. > > Some of the character classification functions etc can be flaky when used > with UTF8 characters outside the ASCII range, but simple string operations > work fine. That's why there's the <defencbuf> buffer which holds the UTF-8 encoded value... > As I see it, the relative pros and cons of UTF-8 versus UTF-16 for use as an > internal string representation are: > > 1. UTF-8 allows all characters to be displayed (in some form or other) > on the users machine, with or without native fonts installed. Naturally > anything outside the ASCII range will be garbage, but it is an immense > debugging aid when working with character encodings to be able to touch and > feel something recognizable. Trying to decode a block of raw UTF-16 is a > pain. True. > 2. UTF-8 works with most existing string manipulation libraries quite > happily. It is also portable (a char is always 8 bits, regardless of > platform; wchar_t varies between 16 and 32 bits depending on the underlying > operating system (although unsigned short does seems to work across > platforms, in my experience). You mean with the compiler applying the needed 16->32 bit extension ? > 3. UTF-16 has some advantages in providing fixed width characters and, > (ignoring surrogate pairs etc) a modeless encoding space. This is an > advantage for fast string operations, especially on CPU's that have > efficient operations for handling 16bit data. Right and this is major argument for using 16 bit encodings without state internally. > 4. UTF-16 would directly support a tightly coupled character properties > engine, which would enable Unicode compliant case folding and character > decomposition to be performed without an intermediate UTF-8 <----> UTF-16 > translation step. Could you elaborate on this one ? It is one of the open issues in the proposal. > 5. UTF-16 requires string operations that do not make assumptions about > nulls - this means re-implementing most of the C runtime functions to work > with unsigned shorts. AFAIK, the RE engines in Python are 8-bit clean... BTW, wouldn't it be possible to take pcre and have it use Py_Unicode instead of char ? [Of course, there would have to be some extensions for character classes etc.] -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From fredrik@pythonware.com Fri Nov 12 13:43:12 1999 From: fredrik@pythonware.com (Fredrik Lundh) Date: Fri, 12 Nov 1999 14:43:12 +0100 Subject: [Python-Dev] just say no... References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim> <382BDB09.55583F28@lemburg.com> <027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com> <382C11FE.D7D9F916@lemburg.com> Message-ID: <005201bf2d13$ddd75ad0$f29b12c2@secret.pythonware.com> > > > Besides, the Unicode object will have a buffer containing the > > > <default encoding> representation of the object, which, if all goes > > > well, will always hold the UTF-8 value. > > > > <rant> > > > > over my dead body, that one... > > Such a buffer is needed to implement "s" and "s#" argument > parsing. It's a simple requirement to support those two > parsing markers -- there's not much to argue about, really... why? I don't understand why "s" and "s#" has to deal with encoding issues at all... > unless, of course, you want to give up Unicode object support > for all APIs using these parsers. hmm. maybe that's exactly what I want... </F> From fdrake@acm.org Fri Nov 12 14:34:56 1999 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Fri, 12 Nov 1999 09:34:56 -0500 (EST) Subject: [Python-Dev] just say no... In-Reply-To: <382C11FE.D7D9F916@lemburg.com> References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim> <382BDB09.55583F28@lemburg.com> <027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com> <382C11FE.D7D9F916@lemburg.com> Message-ID: <14380.9616.245419.138261@weyr.cnri.reston.va.us> M.-A. Lemburg writes: > Such a buffer is needed to implement "s" and "s#" argument > parsing. It's a simple requirement to support those two > parsing markers -- there's not much to argue about, really... > unless, of course, you want to give up Unicode object support > for all APIs using these parsers. Perhaps I missed the agreement that these should always receive UTF-8 from Unicode strings. Was this agreed upon, or has it simply not been argued over in favor of other topics? If this has indeed been agreed upon... at least it can be computed on demand rather than at initialization! Perhaps there should be two pointers: one to the UTF-8 buffer and one to a PyObject; if the PyObject is there it's a "old-style" string that's actually providing the buffer. This may or may not be a good idea; there's a lot of memory expense for long Unicode strings converted from UTF-8 that aren't ever converted back to UTF-8 or accessed using "s" or "s#". Ok, I've talked myself out of that. ;-) -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives From fdrake@acm.org Fri Nov 12 14:57:15 1999 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Fri, 12 Nov 1999 09:57:15 -0500 (EST) Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <382C0FA8.ACB6CCD6@lemburg.com> References: <DBF3B37F7BF1D111B2A10000F6B14B1FDDAF24@ukhil704nts.hld.uk.fid-intl.com> <02f901bf2d07$bdf5d950$f29b12c2@secret.pythonware.com> <382C0FA8.ACB6CCD6@lemburg.com> Message-ID: <14380.10955.420102.327867@weyr.cnri.reston.va.us> M.-A. Lemburg writes: > Access to this mark will go into sys: sys.bom. Can the name in sys be a little more descriptive? sys.byte_order_mark would be reasonable. I think that a support module (possibly unicodec) should provide constants for all four byte order marks as strings (2- & 4-byte, little- and big-endian). Names could be short BOM_2_LE, BOM_4_LE, etc. -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives From fredrik@pythonware.com Fri Nov 12 15:00:45 1999 From: fredrik@pythonware.com (Fredrik Lundh) Date: Fri, 12 Nov 1999 16:00:45 +0100 Subject: [Python-Dev] just say no... References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim><382BDB09.55583F28@lemburg.com><027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com><382C11FE.D7D9F916@lemburg.com> <14380.9616.245419.138261@weyr.cnri.reston.va.us> Message-ID: <009101bf2d1f$21f5b490$f29b12c2@secret.pythonware.com> Fred L. Drake, Jr. <fdrake@acm.org> wrote: > M.-A. Lemburg writes: > > Such a buffer is needed to implement "s" and "s#" argument > > parsing. It's a simple requirement to support those two > > parsing markers -- there's not much to argue about, really... > > unless, of course, you want to give up Unicode object support > > for all APIs using these parsers. > > Perhaps I missed the agreement that these should always receive > UTF-8 from Unicode strings. from unicode import * def getname(): # hidden in some database engine, or so... return unicode("Linköping", "iso-8859-1") ... name = getname() # emulate automatic conversion to utf-8 name = str(name) # print it in uppercase, in the usual way import string print string.upper(name) ## LINKöPING I don't know, but I think that I think that it perhaps should raise an exception instead... </F> From mal@lemburg.com Fri Nov 12 15:17:43 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 12 Nov 1999 16:17:43 +0100 Subject: [Python-Dev] just say no... References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim> <382BDB09.55583F28@lemburg.com> <027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com> <382C11FE.D7D9F916@lemburg.com> <005201bf2d13$ddd75ad0$f29b12c2@secret.pythonware.com> Message-ID: <382C2F97.8E7D7A4D@lemburg.com> Fredrik Lundh wrote: > > > > > Besides, the Unicode object will have a buffer containing the > > > > <default encoding> representation of the object, which, if all goes > > > > well, will always hold the UTF-8 value. > > > > > > <rant> > > > > > > over my dead body, that one... > > > > Such a buffer is needed to implement "s" and "s#" argument > > parsing. It's a simple requirement to support those two > > parsing markers -- there's not much to argue about, really... > > why? I don't understand why "s" and "s#" has > to deal with encoding issues at all... > > > unless, of course, you want to give up Unicode object support > > for all APIs using these parsers. > > hmm. maybe that's exactly what I want... If we don't add that support, lot's of existing APIs won't accept Unicode object instead of strings. While it could be argued that automatic conversion to UTF-8 is not transparent enough for the user, the other solution of using str(u) everywhere would probably make writing Unicode-aware code a rather clumsy task and introduce other pitfalls, since str(obj) calls PyObject_Str() which also works on integers, floats, etc. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Fri Nov 12 15:50:33 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 12 Nov 1999 16:50:33 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <DBF3B37F7BF1D111B2A10000F6B14B1FDDAF24@ukhil704nts.hld.uk.fid-intl.com> <02f901bf2d07$bdf5d950$f29b12c2@secret.pythonware.com> <382C0FA8.ACB6CCD6@lemburg.com> <14380.10955.420102.327867@weyr.cnri.reston.va.us> Message-ID: <382C3749.198EEBC6@lemburg.com> "Fred L. Drake, Jr." wrote: > > M.-A. Lemburg writes: > > Access to this mark will go into sys: sys.bom. > > Can the name in sys be a little more descriptive? > sys.byte_order_mark would be reasonable. The abbreviation BOM is quite common w/r to Unicode. > I think that a support module (possibly unicodec) should provide > constants for all four byte order marks as strings (2- & 4-byte, > little- and big-endian). Names could be short BOM_2_LE, BOM_4_LE, > etc. Good idea... sys.bom should return the byte order mark (BOM) for the format used internally. The unicodec module should provide symbols for all possible values of this variable: BOM_BE: '\376\377' (corresponds to Unicode 0x0000FEFF in UTF-16 == ZERO WIDTH NO-BREAK SPACE) BOM_LE: '\377\376' (corresponds to Unicode 0x0000FFFE in UTF-16 == illegal Unicode character) BOM4_BE: '\000\000\377\376' (corresponds to Unicode 0x0000FEFF in UCS-4) BOM4_LE: '\376\377\000\000' (corresponds to Unicode 0x0000FFFE in UCS-4) Note that Unicode sees big endian byte order as being "correct". The swapped order is taken to be an indicator for a "wrong" format, hence the illegal character definition. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Fri Nov 12 15:24:33 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 12 Nov 1999 16:24:33 +0100 Subject: [Python-Dev] just say no... References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim> <382BDB09.55583F28@lemburg.com> <027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com> <382C11FE.D7D9F916@lemburg.com> <14380.9616.245419.138261@weyr.cnri.reston.va.us> Message-ID: <382C3131.A8965CA5@lemburg.com> "Fred L. Drake, Jr." wrote: > > M.-A. Lemburg writes: > > Such a buffer is needed to implement "s" and "s#" argument > > parsing. It's a simple requirement to support those two > > parsing markers -- there's not much to argue about, really... > > unless, of course, you want to give up Unicode object support > > for all APIs using these parsers. > > Perhaps I missed the agreement that these should always receive > UTF-8 from Unicode strings. Was this agreed upon, or has it simply > not been argued over in favor of other topics? It's been in the proposal since version 0.1. The idea is to provide a decent way of making existing script Unicode aware. > If this has indeed been agreed upon... at least it can be computed > on demand rather than at initialization! This is what I intended to implement. The <defencbuf> buffer will be filled upon the first request to the UTF-8 encoding. "s" and "s#" are examples of such requests. The buffer will remain intact until the object is destroyed (since other code could store the pointer received via e.g. "s"). > Perhaps there should be two > pointers: one to the UTF-8 buffer and one to a PyObject; if the > PyObject is there it's a "old-style" string that's actually providing > the buffer. This may or may not be a good idea; there's a lot of > memory expense for long Unicode strings converted from UTF-8 that > aren't ever converted back to UTF-8 or accessed using "s" or "s#". > Ok, I've talked myself out of that. ;-) Note that Unicode object are completely different beast ;-) String object are not touched in any way by the proposal. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From fdrake@acm.org Fri Nov 12 16:22:24 1999 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Fri, 12 Nov 1999 11:22:24 -0500 (EST) Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <382C3749.198EEBC6@lemburg.com> References: <DBF3B37F7BF1D111B2A10000F6B14B1FDDAF24@ukhil704nts.hld.uk.fid-intl.com> <02f901bf2d07$bdf5d950$f29b12c2@secret.pythonware.com> <382C0FA8.ACB6CCD6@lemburg.com> <14380.10955.420102.327867@weyr.cnri.reston.va.us> <382C3749.198EEBC6@lemburg.com> Message-ID: <14380.16064.723277.586881@weyr.cnri.reston.va.us> M.-A. Lemburg writes: > The abbreviation BOM is quite common w/r to Unicode. Yes: "w/r to Unicode". In sys, it's out of context and should receive a more descriptive name. I think using BOM in unicodec is good. > BOM_BE: '\376\377' > (corresponds to Unicode 0x0000FEFF in UTF-16 > == ZERO WIDTH NO-BREAK SPACE) I'd also add BOM to be the same as sys.byte_order_mark. Perhaps even instead of sys.byte_order_mark (just to localize the areas of code that are affected). > Note that Unicode sees big endian byte order as being "correct". The A lot of us do. ;-) -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives From fdrake@acm.org Fri Nov 12 16:28:37 1999 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Fri, 12 Nov 1999 11:28:37 -0500 (EST) Subject: [Python-Dev] just say no... In-Reply-To: <382C3131.A8965CA5@lemburg.com> References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim> <382BDB09.55583F28@lemburg.com> <027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com> <382C11FE.D7D9F916@lemburg.com> <14380.9616.245419.138261@weyr.cnri.reston.va.us> <382C3131.A8965CA5@lemburg.com> Message-ID: <14380.16437.71847.832880@weyr.cnri.reston.va.us> M.-A. Lemburg writes: > It's been in the proposal since version 0.1. The idea is to > provide a decent way of making existing script Unicode aware. Ok, so I haven't read closely enough. > This is what I intended to implement. The <defencbuf> buffer > will be filled upon the first request to the UTF-8 encoding. > "s" and "s#" are examples of such requests. The buffer will > remain intact until the object is destroyed (since other code > could store the pointer received via e.g. "s"). Right. > Note that Unicode object are completely different beast ;-) > String object are not touched in any way by the proposal. I wasn't suggesting the PyStringObject be changed, only that the PyUnicodeObject could maintain a reference. Consider: s = fp.read() u = unicode(s, 'utf-8') u would now hold a reference to s, and s/s# would return a pointer into s instead of re-building the UTF-8 form. I talked myself out of this because it would be too easy to keep a lot more string objects around than were actually needed. -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives From jack@oratrix.nl Fri Nov 12 16:33:46 1999 From: jack@oratrix.nl (Jack Jansen) Date: Fri, 12 Nov 1999 17:33:46 +0100 Subject: [Python-Dev] just say no... In-Reply-To: Message by "M.-A. Lemburg" <mal@lemburg.com> , Fri, 12 Nov 1999 16:24:33 +0100 , <382C3131.A8965CA5@lemburg.com> Message-ID: <19991112163347.5527635BB1E@snelboot.oratrix.nl> The problem with "s" and "s#" is that they're already semantically overloaded, and will become more so with support for multiple charsets. Some modules use "s#" when they mean "give me a pointer to an area of memory and its length". Writing to binary files is an example of this. Some modules use it to mean "give me a pointer to a string". Writing to a text file is (probably) an example of this. Some modules use it to mean "give me a pointer to an 8-bit ASCII string". This is the case if we're going to actually look at the contents (think of string.upper() and such). I think that the only real solution is to define what "s" means, come up with new getarg-formats for the other two use cases and convert all modules to use the new standard. It'll still cause grief to extension modules that aren't part of the core, but at least the problem will go away after a while. -- Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++ Jack.Jansen@oratrix.com | ++++ if you agree copy these lines to your sig ++++ www.oratrix.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm From mal@lemburg.com Fri Nov 12 18:36:55 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 12 Nov 1999 19:36:55 +0100 Subject: [Python-Dev] just say no... References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim> <382BDB09.55583F28@lemburg.com> <027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com> <382C11FE.D7D9F916@lemburg.com> <14380.9616.245419.138261@weyr.cnri.reston.va.us> <382C3131.A8965CA5@lemburg.com> <14380.16437.71847.832880@weyr.cnri.reston.va.us> Message-ID: <382C5E47.21FB4DD@lemburg.com> "Fred L. Drake, Jr." wrote: > > M.-A. Lemburg writes: > > It's been in the proposal since version 0.1. The idea is to > > provide a decent way of making existing script Unicode aware. > > Ok, so I haven't read closely enough. > > > This is what I intended to implement. The <defencbuf> buffer > > will be filled upon the first request to the UTF-8 encoding. > > "s" and "s#" are examples of such requests. The buffer will > > remain intact until the object is destroyed (since other code > > could store the pointer received via e.g. "s"). > > Right. > > > Note that Unicode object are completely different beast ;-) > > String object are not touched in any way by the proposal. > > I wasn't suggesting the PyStringObject be changed, only that the > PyUnicodeObject could maintain a reference. Consider: > > s = fp.read() > u = unicode(s, 'utf-8') > > u would now hold a reference to s, and s/s# would return a pointer > into s instead of re-building the UTF-8 form. I talked myself out of > this because it would be too easy to keep a lot more string objects > around than were actually needed. Agreed. Also, the encoding would always be correct. <defencbuf> will always hold the <default encoding> version (which should be UTF-8...). -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From gstein@lyra.org Fri Nov 12 22:19:15 1999 From: gstein@lyra.org (Greg Stein) Date: Fri, 12 Nov 1999 14:19:15 -0800 (PST) Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <007e01bf2d09$44738440$0501a8c0@bobcat> Message-ID: <Pine.LNX.4.10.9911121417530.27203-100000@nebula.lyra.org> On Fri, 12 Nov 1999, Mark Hammond wrote: > Couldnt we start with Fredriks implementation, and see how the rest > turns out? Even if we do choose to change the underlying Unicode > implementation to use a different native encoding, the interface to > the PyUnicode_Type would remain pretty similar. The advantage is that > we have something now to start working with for the rest of the > support we need. I agree with "start with" here, and will go one step further (which Mark may have implied) -- *check in* Fredrik's code. Cheers, -g -- Greg Stein, http://www.lyra.org/ From gstein@lyra.org Fri Nov 12 22:59:03 1999 From: gstein@lyra.org (Greg Stein) Date: Fri, 12 Nov 1999 14:59:03 -0800 (PST) Subject: [Python-Dev] just say no... In-Reply-To: <382C11FE.D7D9F916@lemburg.com> Message-ID: <Pine.LNX.4.10.9911121456370.2535-100000@nebula.lyra.org> On Fri, 12 Nov 1999, M.-A. Lemburg wrote: > Fredrik Lundh wrote: > > > Besides, the Unicode object will have a buffer containing the > > > <default encoding> representation of the object, which, if all goes > > > well, will always hold the UTF-8 value. > > > > <rant> > > > > over my dead body, that one... > > Such a buffer is needed to implement "s" and "s#" argument > parsing. It's a simple requirement to support those two > parsing markers -- there's not much to argue about, really... > unless, of course, you want to give up Unicode object support > for all APIs using these parsers. Bull! You can easily support "s#" support by returning the pointer to the Unicode buffer. The *entire* reason for introducing "t#" is to differentiate between returning a pointer to an 8-bit [character] buffer and a not-8-bit buffer. In other words, the work done to introduce "t#" was done *SPECIFICALLY* to allow "s#" to return a pointer to the Unicode data. I am with Fredrik on that auxilliary buffer. You'll have two dead bodies to deal with :-) Cheers, -g -- Greg Stein, http://www.lyra.org/ From gstein@lyra.org Fri Nov 12 23:05:11 1999 From: gstein@lyra.org (Greg Stein) Date: Fri, 12 Nov 1999 15:05:11 -0800 (PST) Subject: [Python-Dev] just say no... In-Reply-To: <19991112163347.5527635BB1E@snelboot.oratrix.nl> Message-ID: <Pine.LNX.4.10.9911121501460.2535-100000@nebula.lyra.org> This was done last year!! We have "s#" meaning "give me some bytes." We have "t#" meaning "give me some 8-bit characters." The Python distribution has been completely updated to use the appropriate format in each call. The was done *specifically* to support the introduction of a Unicode type. The intent was that "s#" returns the *raw* bytes of the Unicode string -- NOT a UTF-8 encoding! As a separate argument, MAL can argue that "t#" should create an internal, associated buffer to hold a UTF-8 encoding and then return that. But the "s#" should return the raw bytes! [ and I'll argue against the response to "t#" anyhow... ] -g On Fri, 12 Nov 1999, Jack Jansen wrote: > The problem with "s" and "s#" is that they're already semantically > overloaded, and will become more so with support for multiple charsets. > > Some modules use "s#" when they mean "give me a pointer to an area of memory > and its length". Writing to binary files is an example of this. > > Some modules use it to mean "give me a pointer to a string". Writing to a text > file is (probably) an example of this. > > Some modules use it to mean "give me a pointer to an 8-bit ASCII string". This > is the case if we're going to actually look at the contents (think of > string.upper() and such). > > I think that the only real solution is to define what "s" means, come up with > new getarg-formats for the other two use cases and convert all modules to use > the new standard. It'll still cause grief to extension modules that aren't > part of the core, but at least the problem will go away after a while. > -- > Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++ > Jack.Jansen@oratrix.com | ++++ if you agree copy these lines to your sig ++++ > www.oratrix.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm > > > > _______________________________________________ > Python-Dev maillist - Python-Dev@python.org > http://www.python.org/mailman/listinfo/python-dev > -- Greg Stein, http://www.lyra.org/ From gstein@lyra.org Fri Nov 12 23:09:13 1999 From: gstein@lyra.org (Greg Stein) Date: Fri, 12 Nov 1999 15:09:13 -0800 (PST) Subject: [Python-Dev] just say no... In-Reply-To: <382C2F97.8E7D7A4D@lemburg.com> Message-ID: <Pine.LNX.4.10.9911121505210.2535-100000@nebula.lyra.org> On Fri, 12 Nov 1999, M.-A. Lemburg wrote: > Fredrik Lundh wrote: >... > > why? I don't understand why "s" and "s#" has > > to deal with encoding issues at all... > > > > > unless, of course, you want to give up Unicode object support > > > for all APIs using these parsers. > > > > hmm. maybe that's exactly what I want... > > If we don't add that support, lot's of existing APIs won't > accept Unicode object instead of strings. While it could be > argued that automatic conversion to UTF-8 is not transparent > enough for the user, the other solution of using str(u) > everywhere would probably make writing Unicode-aware code a > rather clumsy task and introduce other pitfalls, since str(obj) > calls PyObject_Str() which also works on integers, floats, > etc. No no no... "s" and "s#" are NOT SUPPOSED TO return a UTF-8 encoding. They are supposed to return the raw bytes. If a caller wants 8-bit characters, then that caller will use "t#". If you want to argue for that separate, encoded buffer, then argue for it for support for the "t#" format. But do NOT say that it is needed for "s#" which simply means "give me some bytes." -g -- Greg Stein, http://www.lyra.org/ From gstein@lyra.org Fri Nov 12 23:26:08 1999 From: gstein@lyra.org (Greg Stein) Date: Fri, 12 Nov 1999 15:26:08 -0800 (PST) Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <14380.16064.723277.586881@weyr.cnri.reston.va.us> Message-ID: <Pine.LNX.4.10.9911121519440.2535-100000@nebula.lyra.org> On Fri, 12 Nov 1999, Fred L. Drake, Jr. wrote: > M.-A. Lemburg writes: > > The abbreviation BOM is quite common w/r to Unicode. True. > Yes: "w/r to Unicode". In sys, it's out of context and should > receive a more descriptive name. I think using BOM in unicodec is > good. I agree and believe that we can avoid putting it into sys altogether. > > BOM_BE: '\376\377' > > (corresponds to Unicode 0x0000FEFF in UTF-16 > > == ZERO WIDTH NO-BREAK SPACE) Are you sure about that interpretation? I thought the BOM characters (0xFEFF and 0xFFFE) were *reserved* in the UCS-2 space. > I'd also add BOM to be the same as sys.byte_order_mark. Perhaps > even instead of sys.byte_order_mark (just to localize the areas of > code that are affected). ### unicodec.py ### import struct BOM = struct.pack('h', 0x0000FEFF) BOM_BE = '\376\377' ... If somebody needs the BOM, then they should go to unicodec.py (or some other module). I do not believe we need to put that stuff into the sys module. It is just too easy to create the value in Python. Cheers, -g p.s. to be pedantic, the pack() format could be '@h' -- Greg Stein, http://www.lyra.org/ From mhammond@skippinet.com.au Fri Nov 12 23:41:16 1999 From: mhammond@skippinet.com.au (Mark Hammond) Date: Sat, 13 Nov 1999 10:41:16 +1100 Subject: [Python-Dev] just say no... In-Reply-To: <Pine.LNX.4.10.9911121501460.2535-100000@nebula.lyra.org> Message-ID: <008601bf2d67$6a9982b0$0501a8c0@bobcat> [Greg writes] > As a separate argument, MAL can argue that "t#" should create > an internal, > associated buffer to hold a UTF-8 encoding and then return > that. But the > "s#" should return the raw bytes! > [ and I'll argue against the response to "t#" anyhow... ] Hmm. Climbing over these dead bodies could get a bit smelly :-) Im inclined to agree that holding 2 internal buffers for the unicode object is not ideal. However, I _am_ concerned with getting decent PyArg_ParseTuple and Py_BuildValue support, and if the cost is an extra buffer I will survive. So lets look for solutions that dont require it, rather than holding it up as evil when no other solution is obvious. My requirements appear to me to be very simple (for an anglophile): Lets say I have a platform Unicode value - eg, I got a Unicode value from some external library (say COM :-) Lets assume for now that the Unicode string is fully representable as ASCII - say a file or directory name that COM gave me. I simply want to be able to pass this Unicode object to "open()", and have it work. This assumes that open() will not become "native unicode", simply as the underlying C support is not unicode aware - it needs to be converted to a "char *" (ie, will use the "t#" format) The second side of the equation is when I expose a Python function that talks Unicode - eg, I need to _pass_ a platform Unicode value to an external library. The Python programmer should be able to pass a Unicode object (no problem), or a PyString object. In code terms: Prob1: name = SomeComObject.GetFileName() # A Unicode object f = open(name) Prob2: SomeComObject.SetFileName("foo.txt") IMO it is important that we have a good strategy for dealing with this for extensions. MAL addresses one direction, but not the other. Maybe if we toss around general solutions for this the implementation will fall out. MALs idea of the additional buffer starts to address this, but isnt the whole story. Any ideas on this? From gstein@lyra.org Sat Nov 13 00:49:34 1999 From: gstein@lyra.org (Greg Stein) Date: Fri, 12 Nov 1999 16:49:34 -0800 (PST) Subject: [Python-Dev] argument parsing (was: just say no...) In-Reply-To: <008601bf2d67$6a9982b0$0501a8c0@bobcat> Message-ID: <Pine.LNX.4.10.9911121615170.2535-100000@nebula.lyra.org> On Sat, 13 Nov 1999, Mark Hammond wrote: >... > Im inclined to agree that holding 2 internal buffers for the unicode > object is not ideal. However, I _am_ concerned with getting decent > PyArg_ParseTuple and Py_BuildValue support, and if the cost is an > extra buffer I will survive. So lets look for solutions that dont > require it, rather than holding it up as evil when no other solution > is obvious. I believe Py_BuildValue is pretty straight-forward. Simply state that it is allowed to perform conversions and place the resulting object into the resulting tuple. (with appropriate refcounting) In other words: tuple = Py_BuildValue("U", stringOb); The stringOb will be converted to a Unicode object. The new Unicode object will go into the tuple (with the tuple holding the only reference!). The stringOb will NOT acquire any additional references. [ "U" format may be wrong; it is here for example purposes ] Okay... now the PyArg_ParseTuple() is the *real* kicker. >... > Prob1: > name = SomeComObject.GetFileName() # A Unicode object > f = open(name) > Prob2: > SomeComObject.SetFileName("foo.txt") Both of these issues are due to PyArg_ParseTuple. In Prob1, you want a string-like object which can be passed to the OS as an 8-bit string. In Prob2, you want a string-like object which can be passed to the OS as a Unicode string. I see three options for PyArg_ParseTuple: 1) allow it to return NEW objects which must be DECREF'd. [ current policy only loans out references ] This option could be difficult in the presence of errors during the parse. For example, the current idiom is: if (!PyArg_ParseTuple(args, "...")) return NULL; If an object was produced, but then a later argument cause a failure, then who is responsible for freeing the object? 2) like step 1, but PyArg_ParseTuple is smart enough to NOT return any new objects when an error occurred. This basically answers the last question in option (1) -- ParseTuple is responsible. 3) Return loaned-out-references to objects which have been tested for convertability. Helper functions perform the conversion and the caller will then free the reference. [ this is the model used in PyWin32 ] Code in PyWin32 typically looks like: if (!PyArg_ParseTuple(args, "O", &ob)) return NULL; if ((unicodeOb = GiveMeUnicode(ob)) == NULL) return NULL; ... Py_DECREF(unicodeOb); [ GiveMeUnicode is descriptive here; I forget the name used in PyWin32 ] In a "real" situation, the ParseTuple format would be "U" and the object would be type-tested for PyStringType or PyUnicodeType. Note that GiveMeUnicode() would also do a type-test, but it can't produce a *specific* error like ParseTuple (e.g. "string/unicode object expected" vs "parameter 3 must be a string/unicode object") Are there more options? Anybody? All three of these avoid the secondary buffer. The last is cleanest w.r.t. to keeping the existing "loaned references" behavior, but can get a bit wordy when you need to convert a bunch of string arguments. Option (2) adds a good amount of complexity to PyArg_ParseTuple -- it would need to keep a "free list" in case an error occurred. Option (1) adds DECREF logic to callers to ensure they clean up. The add'l logic isn't much more than the other two options (the only change is adding DECREFs before returning NULL from the "if (!PyArg_ParseTuple..." condition). Note that the caller would probably need to initialize each object to NULL before calling ParseTuple. Personally, I prefer (3) as it makes it very clear that a new object has been created and must be DECREF'd at some point. Also note that GiveMeUnicode() could also accept a second argument for the type of decoding to do (or NULL meaning "UTF-8"). Oh: note there are equivalents of all options for going from unicode-to-string; the above is all about string-to-unicode. However, the tricky part of unicode-to-string is determining whether backwards compatibility will be a requirement. i.e. does existing code that uses the "t" format suddenly achieve the capability to accept a Unicode object? This obviously causes problems in all three options: since a new reference must be created to handle the situation, then who DECREF's it? The old code certainly doesn't. [ <IMO> I'm with Fredrik in saying "no, old code *doesn't* suddenly get the ability to accept a Unicode object." The Python code must use str() to do the encoding manually (until the old code is upgraded to one of the above three options). </IMO> ] I think that's it for me. In the several years I've been thinking on this problem, I haven't come up with anything but the above three. There may be a whole new paradigm for argument parsing, but I haven't tried to think on that one (and just fit in around ParseTuple). Cheers, -g -- Greg Stein, http://www.lyra.org/ From mal@lemburg.com Fri Nov 12 18:49:52 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 12 Nov 1999 19:49:52 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <DBF3B37F7BF1D111B2A10000F6B14B1FDDAF24@ukhil704nts.hld.uk.fid-intl.com> <02f901bf2d07$bdf5d950$f29b12c2@secret.pythonware.com> <382C0FA8.ACB6CCD6@lemburg.com> <14380.10955.420102.327867@weyr.cnri.reston.va.us> <382C3749.198EEBC6@lemburg.com> <14380.16064.723277.586881@weyr.cnri.reston.va.us> Message-ID: <382C6150.53BDC803@lemburg.com> "Fred L. Drake, Jr." wrote: > > M.-A. Lemburg writes: > > The abbreviation BOM is quite common w/r to Unicode. > > Yes: "w/r to Unicode". In sys, it's out of context and should > receive a more descriptive name. I think using BOM in unicodec is > good. Guido proposed to add it to sys. I originally had it defined in unicodec. Perhaps a sys.endian would be more appropriate for sys with values 'little' and 'big' or '<' and '>' to be conform to the struct module. unicodec could then define unicodec.bom depending on the setting in sys. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Sat Nov 13 09:37:35 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Sat, 13 Nov 1999 10:37:35 +0100 Subject: [Python-Dev] just say no... References: <Pine.LNX.4.10.9911121505210.2535-100000@nebula.lyra.org> Message-ID: <382D315F.A7ADEC42@lemburg.com> Greg Stein wrote: > > On Fri, 12 Nov 1999, M.-A. Lemburg wrote: > > Fredrik Lundh wrote: > >... > > > why? I don't understand why "s" and "s#" has > > > to deal with encoding issues at all... > > > > > > > unless, of course, you want to give up Unicode object support > > > > for all APIs using these parsers. > > > > > > hmm. maybe that's exactly what I want... > > > > If we don't add that support, lot's of existing APIs won't > > accept Unicode object instead of strings. While it could be > > argued that automatic conversion to UTF-8 is not transparent > > enough for the user, the other solution of using str(u) > > everywhere would probably make writing Unicode-aware code a > > rather clumsy task and introduce other pitfalls, since str(obj) > > calls PyObject_Str() which also works on integers, floats, > > etc. > > No no no... > > "s" and "s#" are NOT SUPPOSED TO return a UTF-8 encoding. They are > supposed to return the raw bytes. [I've waited quite some time for you to chime in on this one ;-)] Let me summarize a bit on the general ideas behind "s", "s#" and the extra buffer: First, we have a general design question here: should old code become Unicode compatible or not. As I recall the original idea about Unicode integration was to follow Perl's idea to have scripts become Unicode aware by simply adding a 'use utf8;'. If this is still the case, then we'll have to come with a resonable approach for integrating classical string based APIs with the new type. Since UTF-8 is a standard (some would probably prefer UTF-7,5 e.g. the Latin-1 folks) which has some very nice features (see http://czyborra.com/utf/ ) and which is a true extension of ASCII, this encoding seems best fit for the purpose. However, one should not forget that UTF-8 is in fact a variable length encoding of Unicode characters, that is up to 3 bytes form a *single* character. This is obviously not compatible with definitions that explicitly state data to be using a 8-bit single character encoding, e.g. indexing in UTF-8 doesn't work like it does in Latin-1 text. So if we are to do the integration, we'll have to choose argument parser markers that allow for multi byte characters. "t#" does not fall into this category, "s#" certainly does, "s" is argueable. Also note that we have to watch out for embedded NULL bytes. UTF-16 has NULL bytes for every character from the Latin-1 domain. If "s" were to give back a pointer to the internal buffer which is encoded in UTF-16, you would loose data. UTF-8 doesn't have this problem, since only NULL bytes map to (single) NULL bytes. Now Greg would chime in with the buffer interface and argue that it should make the underlying internal format accessible. This is a bad idea, IMHO, since you shouldn't really have to know what the internal data format is. Defining "s#" to return UTF-8 data does not only make "s" and "s#" return the same data format (which should always be the case, IMO), but also hides the internal format from the user and gives him a reliable cross-platform data representation of Unicode data (note that UTF-8 doesn't have the byte order problems of UTF-16). If you are still with, let's look at what "s" and "s#" do: they return pointers into data areas which have to be kept alive until the corresponding object dies. The only way to support this feature is by allocating a buffer for just this purpose (on the fly and only if needed to prevent excessive memory load). The other options of adding new magic parser markers or switching to more generic one all have one downside: you need to change existing code which is in conflict with the idea we started out with. So, again, the question is: do we want this magical intergration or not ? Note that this is a design question, not one of memory consumption... -- Ok, the above covered Unicode -> String conversion. Mark mentioned that he wanted the other way around to also work in the same fashion, ie. automatic String -> Unicode conversion. This could also be done in the same way by interpreting the string as UTF-8 encoded Unicode... but we have the same problem: where to put the data without generating new intermediate objects. Since only newly written code will use this feature there is a way to do this though: PyArg_ParseTuple(args,"s#",&utf8,&len); If your C API understands UTF-8 there's nothing more to do, if not, take Greg's option 3 approach: PyArg_ParseTuple(args,"O",&obj); unicode = PyUnicode_FromObject(obj); ... Py_DECREF(unicode); Here PyUnicode_FromObject() will return a new reference if obj is an Unicode object or create a new Unicode object by interpreting str(obj) as UTF-8 encoded string. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 48 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From guido@CNRI.Reston.VA.US Sat Nov 13 12:12:41 1999 From: guido@CNRI.Reston.VA.US (Guido van Rossum) Date: Sat, 13 Nov 1999 07:12:41 -0500 Subject: [Python-Dev] just say no... In-Reply-To: Your message of "Fri, 12 Nov 1999 14:59:03 PST." <Pine.LNX.4.10.9911121456370.2535-100000@nebula.lyra.org> References: <Pine.LNX.4.10.9911121456370.2535-100000@nebula.lyra.org> Message-ID: <199911131212.HAA25895@eric.cnri.reston.va.us> > I am with Fredrik on that auxilliary buffer. You'll have two dead bodies > to deal with :-) I haven't made up my mind yet (due to a very successful Python-promoting visit to SD'99 east, I'm about 100 msgs behind in this thread alone) but let me warn you that I can deal with the carnage, if necessary. :-) --Guido van Rossum (home page: http://www.python.org/~guido/) From gstein@lyra.org Sat Nov 13 12:23:54 1999 From: gstein@lyra.org (Greg Stein) Date: Sat, 13 Nov 1999 04:23:54 -0800 (PST) Subject: [Python-Dev] just say no... In-Reply-To: <199911131212.HAA25895@eric.cnri.reston.va.us> Message-ID: <Pine.LNX.4.10.9911130423400.2535-100000@nebula.lyra.org> On Sat, 13 Nov 1999, Guido van Rossum wrote: > > I am with Fredrik on that auxilliary buffer. You'll have two dead bodies > > to deal with :-) > > I haven't made up my mind yet (due to a very successful > Python-promoting visit to SD'99 east, I'm about 100 msgs behind in > this thread alone) but let me warn you that I can deal with the > carnage, if necessary. :-) Bring it on, big boy! :-) -- Greg Stein, http://www.lyra.org/ From mhammond@skippinet.com.au Sat Nov 13 12:52:18 1999 From: mhammond@skippinet.com.au (Mark Hammond) Date: Sat, 13 Nov 1999 23:52:18 +1100 Subject: [Python-Dev] argument parsing (was: just say no...) In-Reply-To: <Pine.LNX.4.10.9911121615170.2535-100000@nebula.lyra.org> Message-ID: <00b301bf2dd5$ec4df840$0501a8c0@bobcat> [Lamenting about PyArg_ParseTuple and managing memory buffers for String/Unicode conversions.] So what is really wrong with Marc's proposal about the extra pointer on the Unicode object? And to double the carnage, who not add the equivilent native Unicode buffer to the PyString object? These would only ever be filled when requested by the conversion routines. They have no other effect than their memory is managed by the object itself; simply a convenience to avoid having extension modules manage the conversion buffers. The only overheads appear to be: * The conversion buffers may be slightly (or much :-) longer-lived - ie, they are not freed until the object itself is freed. * String object slightly bigger, and slightly slower to destroy. It appears to solve the problems, and the cost doesnt seem too high... Mark. From guido@CNRI.Reston.VA.US Sat Nov 13 13:06:26 1999 From: guido@CNRI.Reston.VA.US (Guido van Rossum) Date: Sat, 13 Nov 1999 08:06:26 -0500 Subject: [Python-Dev] just say no... In-Reply-To: Your message of "Sat, 13 Nov 1999 10:37:35 +0100." <382D315F.A7ADEC42@lemburg.com> References: <Pine.LNX.4.10.9911121505210.2535-100000@nebula.lyra.org> <382D315F.A7ADEC42@lemburg.com> Message-ID: <199911131306.IAA26030@eric.cnri.reston.va.us> I think I have a reasonable grasp of the issues here, even though I still haven't read about 100 msgs in this thread. Note that t# and the charbuffer addition to the buffer API were added by Greg Stein with my support; I'll attempt to reconstruct our thinking at the time... [MAL] > Let me summarize a bit on the general ideas behind "s", "s#" > and the extra buffer: I think you left out t#. > First, we have a general design question here: should old code > become Unicode compatible or not. As I recall the original idea > about Unicode integration was to follow Perl's idea to have > scripts become Unicode aware by simply adding a 'use utf8;'. I've never heard of this idea before -- or am I taking it too literal? It smells of a mode to me :-) I'd rather live in a world where Unicode just works as long as you use u'...' literals or whatever convention we decide. > If this is still the case, then we'll have to come with a > resonable approach for integrating classical string based > APIs with the new type. > > Since UTF-8 is a standard (some would probably prefer UTF-7,5 e.g. > the Latin-1 folks) which has some very nice features (see > http://czyborra.com/utf/ ) and which is a true extension of ASCII, > this encoding seems best fit for the purpose. Yes, especially if we fix the default encoding as UTF-8. (I'm expecting feedback from HP on this next week, hopefully when I see the details, it'll be clear that don't need a per-thread default encoding to solve their problems; that's quite a likely outcome. If not, we have a real-world argument for allowing a variable default encoding, without carnage.) > However, one should not forget that UTF-8 is in fact a > variable length encoding of Unicode characters, that is up to > 3 bytes form a *single* character. This is obviously not compatible > with definitions that explicitly state data to be using a > 8-bit single character encoding, e.g. indexing in UTF-8 doesn't > work like it does in Latin-1 text. Sure, but where in current Python are there such requirements? > So if we are to do the integration, we'll have to choose > argument parser markers that allow for multi byte characters. > "t#" does not fall into this category, "s#" certainly does, > "s" is argueable. I disagree. I grepped through the source for s# and t#. Here's a bit of background. Before t# was introduced, s# was being used for two distinct purposes: (1) to get an 8-bit text string plus its length, in situations where the length was needed; (2) to get binary data (e.g. GIF data read from a file in "rb" mode). Greg pointed out that if we ever introduced some form of Unicode support, these two had to be disambiguated. We found that the majority of uses was for (2)! Therefore we decided to change the definition of s# to mean only (2), and introduced t# to mean (1). Also, we introduced getcharbuffer corresponding to t#, while getreadbuffer was meant for s#. Note that the definition of the 's' format was left alone -- as before, it means you need an 8-bit text string not containing null bytes. Our expectation was that a Unicode string passed to an s# situation would give a pointer to the internal format plus a byte count (not a character count!) while t# would get a pointer to some kind of 8-bit translation/encoding plus a byte count, with the explicit requirement that the 8-bit translation would have the same lifetime as the original unicode object. We decided to leave it up to the next generation (i.e., Marc-Andre :-) to decide what kind of translation to use and what to do when there is no reasonable translation. Any of the following choices is acceptable (from the point of view of not breaking the intended t# semantics; we can now start deciding which we like best): - utf-8 - latin-1 - ascii - shift-jis - lower byte of unicode ordinal - some user- or os-specified multibyte encoding As far as t# is concerned, for encodings that don't encode all of Unicode, untranslatable characters could be dealt with in any number of ways (raise an exception, ignore, replace with '?', make best effort, etc.). Given the current context, it should probably be the same as the default encoding -- i.e., utf-8. If we end up making the default user-settable, we'll have to decide what to do with untranslatable characters -- but that will probably be decided by the user too (it would be a property of a specific translation specification). In any case, I feel that t# could receive a multi-byte encoding, s# should receive raw binary data, and they should correspond to getcharbuffer and getreadbuffer, respectively. (Aside: the symmetry between 's' and 's#' is now lost; 's' matches 't#', there's no match for 's#'.) > Also note that we have to watch out for embedded NULL bytes. > UTF-16 has NULL bytes for every character from the Latin-1 > domain. If "s" were to give back a pointer to the internal > buffer which is encoded in UTF-16, you would loose data. > UTF-8 doesn't have this problem, since only NULL bytes > map to (single) NULL bytes. This is a red herring given my explanation above. > Now Greg would chime in with the buffer interface and > argue that it should make the underlying internal > format accessible. This is a bad idea, IMHO, since you > shouldn't really have to know what the internal data format > is. This is for C code. Quite likely it *does* know what the internal data format is! > Defining "s#" to return UTF-8 data does not only > make "s" and "s#" return the same data format (which should > always be the case, IMO), That was before t# was introduced. No more, alas. If you replace s# with t#, I agree with you completely. > but also hides the internal > format from the user and gives him a reliable cross-platform > data representation of Unicode data (note that UTF-8 doesn't > have the byte order problems of UTF-16). > > If you are still with, let's look at what "s" and "s#" (and t#, which is more relevant here) > do: they return pointers into data areas which have to > be kept alive until the corresponding object dies. > > The only way to support this feature is by allocating > a buffer for just this purpose (on the fly and only if > needed to prevent excessive memory load). The other > options of adding new magic parser markers or switching > to more generic one all have one downside: you need to > change existing code which is in conflict with the idea > we started out with. Agreed. I think this was our thinking when Greg & I introduced t#. My own preference would be to allocate a whole string object, not just a buffer; this could then also be used for the .encode() method using the default encoding. > So, again, the question is: do we want this magical > intergration or not ? Note that this is a design question, > not one of memory consumption... Yes, I want it. Note that this doesn't guarantee that all old extensions will work flawlessly when passed Unicode objects; but I think that it covers most cases where you could have a reasonable expectation that it works. (Hm, unfortunately many reasonable expectations seem to involve the current user's preferred encoding. :-( ) > -- > > Ok, the above covered Unicode -> String conversion. Mark > mentioned that he wanted the other way around to also > work in the same fashion, ie. automatic String -> Unicode > conversion. > > This could also be done in the same way by > interpreting the string as UTF-8 encoded Unicode... but we > have the same problem: where to put the data without > generating new intermediate objects. Since only newly > written code will use this feature there is a way to do > this though: > > PyArg_ParseTuple(args,"s#",&utf8,&len); No! That is supposed to give the native representation of the string object. I agree that Mark's problem requires a solution too, but it doesn't have to use existing formatting characters, since there's no backwards compatibility issue. > If your C API understands UTF-8 there's nothing more to do, > if not, take Greg's option 3 approach: > > PyArg_ParseTuple(args,"O",&obj); > unicode = PyUnicode_FromObject(obj); > ... > Py_DECREF(unicode); > > Here PyUnicode_FromObject() will return a new > reference if obj is an Unicode object or create a new > Unicode object by interpreting str(obj) as UTF-8 encoded string. This might work. --Guido van Rossum (home page: http://www.python.org/~guido/) From mal@lemburg.com Sat Nov 13 13:06:35 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Sat, 13 Nov 1999 14:06:35 +0100 Subject: [Python-Dev] Unicode Proposal: Version 0.5 References: <382C0A54.E6E8328D@lemburg.com> Message-ID: <382D625B.DC14DBDE@lemburg.com> FYI, I've uploaded a new version of the proposal which incorporates proposals for line breaks, case mapping, character properties and private code points support. The latest version of the proposal is available at: http://starship.skyport.net/~lemburg/unicode-proposal.txt Older versions are available as: http://starship.skyport.net/~lemburg/unicode-proposal-X.X.txt Some POD (points of discussion) that are still open: · should Unicode objects support %-formatting ? One possibility would be to emulate this via strings and <default encoding>: s = '%s %i abcäöü' # a Latin-1 encoded string t = (u,3) # Convert Latin-1 s to a <default encoding> string s1 = unicode(s,'latin-1').encode() # The '%s' will now add u in <default encoding> s2 = s1 % t # Finally, convert the <default encoding> encoded string to Unicode u1 = unicode(s2) · specifying file wrappers: Open issues: what to do with Python strings fed to the .write() method (may need to know the encoding of the strings) and when/if to return Python strings through the .read() method. Perhaps we need more than one type of wrapper here. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 48 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From jack@oratrix.nl Sat Nov 13 16:40:34 1999 From: jack@oratrix.nl (Jack Jansen) Date: Sat, 13 Nov 1999 17:40:34 +0100 Subject: [Python-Dev] just say no... In-Reply-To: Message by Greg Stein <gstein@lyra.org> , Fri, 12 Nov 1999 15:05:11 -0800 (PST) , <Pine.LNX.4.10.9911121501460.2535-100000@nebula.lyra.org> Message-ID: <19991113164039.9B697EA11A@oratrix.oratrix.nl> Recently, Greg Stein <gstein@lyra.org> said: > This was done last year!! We have "s#" meaning "give me some bytes." We > have "t#" meaning "give me some 8-bit characters." The Python distribution > has been completely updated to use the appropriate format in each call. Oops... I remember the discussion but I wasn't aware that somone had actually _implemented_ this:-). Part of my misunderstanding was also caused by the fact that I inspected what I thought would be the prime candidate for t#: file.write() to a non-binary file, and it doesn't use the new format. I also noted a few inconsistencies at first glance, by the way: most modules seem to use s# for things like filenames and other data-that-is-readable-but-shouldn't-be-messed-with, but binascii is an exception and it uses t# for uuencoded strings... -- Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++ Jack.Jansen@oratrix.com | ++++ if you agree copy these lines to your sig ++++ www.oratrix.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm From guido@CNRI.Reston.VA.US Sat Nov 13 19:20:51 1999 From: guido@CNRI.Reston.VA.US (Guido van Rossum) Date: Sat, 13 Nov 1999 14:20:51 -0500 Subject: [Python-Dev] just say no... In-Reply-To: Your message of "Sat, 13 Nov 1999 17:40:34 +0100." <19991113164039.9B697EA11A@oratrix.oratrix.nl> References: <19991113164039.9B697EA11A@oratrix.oratrix.nl> Message-ID: <199911131920.OAA26165@eric.cnri.reston.va.us> > I remember the discussion but I wasn't aware that somone had actually > _implemented_ this:-). Part of my misunderstanding was also caused by > the fact that I inspected what I thought would be the prime candidate > for t#: file.write() to a non-binary file, and it doesn't use the new > format. I guess that's because file.write() doesn't distinguish between text and binary files. Maybe it should: the current implementation together with my proposed semantics for Unicode strings would mean that printing a unicode string (to stdout) would dump the internal encoding to the file. I guess it should do so only when the file is opened in binary mode; for files opened in text mode it should use an encoding (opening a file can specify an encoding; can we change the encoding of an existing file?). > I also noted a few inconsistencies at first glance, by the way: most > modules seem to use s# for things like filenames and other > data-that-is-readable-but-shouldn't-be-messed-with, but binascii is an > exception and it uses t# for uuencoded strings... Actually, binascii seems to do it right: s# for binary data, t# for text (uuencoded, hqx, base64). That is, the b2a variants use s# while the a2b variants use t#. The only thing I'm not sure about in that module are binascii_rledecode_hqx() and binascii_rlecode_hqx() -- I don't understand where these stand in the complexity of binhex en/decoding. --Guido van Rossum (home page: http://www.python.org/~guido/) From mal@lemburg.com Sun Nov 14 22:11:54 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Sun, 14 Nov 1999 23:11:54 +0100 Subject: [Python-Dev] just say no... References: <Pine.LNX.4.10.9911121505210.2535-100000@nebula.lyra.org> <382D315F.A7ADEC42@lemburg.com> <199911131306.IAA26030@eric.cnri.reston.va.us> Message-ID: <382F33AA.C3EE825A@lemburg.com> Guido van Rossum wrote: > > I think I have a reasonable grasp of the issues here, even though I > still haven't read about 100 msgs in this thread. Note that t# and > the charbuffer addition to the buffer API were added by Greg Stein > with my support; I'll attempt to reconstruct our thinking at the > time... > > [MAL] > > Let me summarize a bit on the general ideas behind "s", "s#" > > and the extra buffer: > > I think you left out t#. On purpose -- according to my thinking. I see "t#" as an interface to bf_getcharbuf which I understand as 8-bit character buffer... UTF-8 is a multi byte encoding. It still is character data, but not necessarily 8 bits in length (up to 24 bits are used). Anyway, I'm not really interested in having an argument about this. If you say, "t#" fits the purpose, then that's fine with me. Still, we should clearly define that "t#" returns text data and "s#" binary data. Encoding, bit length, etc. should explicitly remain left undefined. > > First, we have a general design question here: should old code > > become Unicode compatible or not. As I recall the original idea > > about Unicode integration was to follow Perl's idea to have > > scripts become Unicode aware by simply adding a 'use utf8;'. > > I've never heard of this idea before -- or am I taking it too literal? > It smells of a mode to me :-) I'd rather live in a world where > Unicode just works as long as you use u'...' literals or whatever > convention we decide. > > > If this is still the case, then we'll have to come with a > > resonable approach for integrating classical string based > > APIs with the new type. > > > > Since UTF-8 is a standard (some would probably prefer UTF-7,5 e.g. > > the Latin-1 folks) which has some very nice features (see > > http://czyborra.com/utf/ ) and which is a true extension of ASCII, > > this encoding seems best fit for the purpose. > > Yes, especially if we fix the default encoding as UTF-8. (I'm > expecting feedback from HP on this next week, hopefully when I see the > details, it'll be clear that don't need a per-thread default encoding > to solve their problems; that's quite a likely outcome. If not, we > have a real-world argument for allowing a variable default encoding, > without carnage.) Fair enough :-) > > However, one should not forget that UTF-8 is in fact a > > variable length encoding of Unicode characters, that is up to > > 3 bytes form a *single* character. This is obviously not compatible > > with definitions that explicitly state data to be using a > > 8-bit single character encoding, e.g. indexing in UTF-8 doesn't > > work like it does in Latin-1 text. > > Sure, but where in current Python are there such requirements? It was my understanding that "t#" refers to single byte character data. That's where the above arguments were aiming at... > > So if we are to do the integration, we'll have to choose > > argument parser markers that allow for multi byte characters. > > "t#" does not fall into this category, "s#" certainly does, > > "s" is argueable. > > I disagree. I grepped through the source for s# and t#. Here's a bit > of background. Before t# was introduced, s# was being used for two > distinct purposes: (1) to get an 8-bit text string plus its length, in > situations where the length was needed; (2) to get binary data (e.g. > GIF data read from a file in "rb" mode). Greg pointed out that if we > ever introduced some form of Unicode support, these two had to be > disambiguated. We found that the majority of uses was for (2)! > Therefore we decided to change the definition of s# to mean only (2), > and introduced t# to mean (1). Also, we introduced getcharbuffer > corresponding to t#, while getreadbuffer was meant for s#. I know its too late now, but I can't really follow the arguments here: in what ways are (1) and (2) different from the implementations point of view ? If "t#" is to return UTF-8 then <length of the buffer> will not equal <text length>, so both parser markers return essentially the same information. The only difference would be on the semantic side: (1) means: give me text data, while (2) does not specify the data type. Perhaps I'm missing something... > Note that the definition of the 's' format was left alone -- as > before, it means you need an 8-bit text string not containing null > bytes. This definition should then be changed to "text string without null bytes" dropping the 8-bit reference. > Our expectation was that a Unicode string passed to an s# situation > would give a pointer to the internal format plus a byte count (not a > character count!) while t# would get a pointer to some kind of 8-bit > translation/encoding plus a byte count, with the explicit requirement > that the 8-bit translation would have the same lifetime as the > original unicode object. We decided to leave it up to the next > generation (i.e., Marc-Andre :-) to decide what kind of translation to > use and what to do when there is no reasonable translation. Hmm, I would strongly object to making "s#" return the internal format. file.write() would then default to writing UTF-16 data instead of UTF-8 data. This could result in strange errors due to the UTF-16 format being endian dependent. It would also break the symmetry between file.write(u) and unicode(file.read()), since the default encoding is not used as internal format for other reasons (see proposal). > Any of the following choices is acceptable (from the point of view of > not breaking the intended t# semantics; we can now start deciding > which we like best): I think we have already agreed on using UTF-8 for the default encoding. It has quite a few advantages. See http://czyborra.com/utf/ for a good overview of the pros and cons. > - utf-8 > - latin-1 > - ascii > - shift-jis > - lower byte of unicode ordinal > - some user- or os-specified multibyte encoding > > As far as t# is concerned, for encodings that don't encode all of > Unicode, untranslatable characters could be dealt with in any number > of ways (raise an exception, ignore, replace with '?', make best > effort, etc.). The usual Python way would be: raise an exception. This is what the proposal defines for Codecs in case an encoding/decoding mapping is not possible, BTW. (UTF-8 will always succeed on output.) > Given the current context, it should probably be the same as the > default encoding -- i.e., utf-8. If we end up making the default > user-settable, we'll have to decide what to do with untranslatable > characters -- but that will probably be decided by the user too (it > would be a property of a specific translation specification). > > In any case, I feel that t# could receive a multi-byte encoding, > s# should receive raw binary data, and they should correspond to > getcharbuffer and getreadbuffer, respectively. Why would you want to have "s#" return the raw binary data for Unicode objects ? Note that it is not mentioned anywhere that "s#" and "t#" do have to necessarily return different things (binary being a superset of text). I'd opt for "s#" and "t#" both returning UTF-8 data. This can be implemented by delegating the buffer slots to the <defencstr> object (see below). > > Now Greg would chime in with the buffer interface and > > argue that it should make the underlying internal > > format accessible. This is a bad idea, IMHO, since you > > shouldn't really have to know what the internal data format > > is. > > This is for C code. Quite likely it *does* know what the internal > data format is! C code can use the PyUnicode_* APIs to access the data. I don't think that argument parsing is powerful enough to provide the C code with enough information about the data contents, e.g. it can only state the encoding length, not the string length. > > Defining "s#" to return UTF-8 data does not only > > make "s" and "s#" return the same data format (which should > > always be the case, IMO), > > That was before t# was introduced. No more, alas. If you replace s# > with t#, I agree with you completely. Done :-) > > but also hides the internal > > format from the user and gives him a reliable cross-platform > > data representation of Unicode data (note that UTF-8 doesn't > > have the byte order problems of UTF-16). > > > > If you are still with, let's look at what "s" and "s#" > > (and t#, which is more relevant here) > > > do: they return pointers into data areas which have to > > be kept alive until the corresponding object dies. > > > > The only way to support this feature is by allocating > > a buffer for just this purpose (on the fly and only if > > needed to prevent excessive memory load). The other > > options of adding new magic parser markers or switching > > to more generic one all have one downside: you need to > > change existing code which is in conflict with the idea > > we started out with. > > Agreed. I think this was our thinking when Greg & I introduced t#. > My own preference would be to allocate a whole string object, not > just a buffer; this could then also be used for the .encode() method > using the default encoding. Good point. I'll change <defencbuf> to <defencstr>, a Python string object created on request. > > So, again, the question is: do we want this magical > > intergration or not ? Note that this is a design question, > > not one of memory consumption... > > Yes, I want it. > > Note that this doesn't guarantee that all old extensions will work > flawlessly when passed Unicode objects; but I think that it covers > most cases where you could have a reasonable expectation that it > works. > > (Hm, unfortunately many reasonable expectations seem to involve > the current user's preferred encoding. :-( ) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 47 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From akuchlin@mems-exchange.org Mon Nov 15 01:49:08 1999 From: akuchlin@mems-exchange.org (A.M. Kuchling) Date: Sun, 14 Nov 1999 20:49:08 -0500 Subject: [Python-Dev] PyErr_Format security note Message-ID: <199911150149.UAA00408@mira.erols.com> I noticed this in PyErr_Format(exception, format, va_alist): char buffer[500]; /* Caller is responsible for limiting the format */ ... vsprintf(buffer, format, vargs); Making the caller responsible for this is error-prone. The danger, of course, is a buffer overflow caused by generating an error string that's larger than the buffer, possibly letting people execute arbitrary code. We could add a test to the configure script for vsnprintf() and use it when possible, but that only fixes the problem on platforms which have it. Can we find an implementation of vsnprintf() someplace? -- A.M. Kuchling http://starship.python.net/crew/amk/ One form to rule them all, one form to find them, one form to bring them all and in the darkness rewrite the hell out of them. -- Digital Equipment Corporation, in a comment from SENDMAIL Ruleset 3 From gstein@lyra.org Mon Nov 15 02:11:39 1999 From: gstein@lyra.org (Greg Stein) Date: Sun, 14 Nov 1999 18:11:39 -0800 (PST) Subject: [Python-Dev] PyErr_Format security note In-Reply-To: <199911150149.UAA00408@mira.erols.com> Message-ID: <Pine.LNX.4.10.9911141807390.2535-100000@nebula.lyra.org> On Sun, 14 Nov 1999, A.M. Kuchling wrote: > Making the caller responsible for this is error-prone. The danger, of > course, is a buffer overflow caused by generating an error string > that's larger than the buffer, possibly letting people execute > arbitrary code. We could add a test to the configure script for > vsnprintf() and use it when possible, but that only fixes the problem > on platforms which have it. Can we find an implementation of > vsnprintf() someplace? Apache has a safe implementation (they have reviewed the heck out of it for obvious reasons :-). In the Apache source distribution, it is located in src/ap/ap_snprintf.c. Cheers, -g -- Greg Stein, http://www.lyra.org/ From mal@lemburg.com Mon Nov 15 08:09:07 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 15 Nov 1999 09:09:07 +0100 Subject: [Python-Dev] PyErr_Format security note References: <199911150149.UAA00408@mira.erols.com> Message-ID: <382FBFA3.B28B8E1E@lemburg.com> "A.M. Kuchling" wrote: > > I noticed this in PyErr_Format(exception, format, va_alist): > > char buffer[500]; /* Caller is responsible for limiting the format */ > ... > vsprintf(buffer, format, vargs); > > Making the caller responsible for this is error-prone. The danger, of > course, is a buffer overflow caused by generating an error string > that's larger than the buffer, possibly letting people execute > arbitrary code. We could add a test to the configure script for > vsnprintf() and use it when possible, but that only fixes the problem > on platforms which have it. Can we find an implementation of > vsnprintf() someplace? In sysmodule.c, this check is done which should be safe enough since no "return" is issued (Py_FatalError() does an abort()): if (vsprintf(buffer, format, va) >= sizeof(buffer)) Py_FatalError("PySys_WriteStdout/err: buffer overrun"); -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 46 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From gstein@lyra.org Mon Nov 15 09:28:06 1999 From: gstein@lyra.org (Greg Stein) Date: Mon, 15 Nov 1999 01:28:06 -0800 (PST) Subject: [Python-Dev] PyErr_Format security note In-Reply-To: <382FBFA3.B28B8E1E@lemburg.com> Message-ID: <Pine.LNX.4.10.9911150127320.2535-100000@nebula.lyra.org> On Mon, 15 Nov 1999, M.-A. Lemburg wrote: >... > In sysmodule.c, this check is done which should be safe enough > since no "return" is issued (Py_FatalError() does an abort()): > > if (vsprintf(buffer, format, va) >= sizeof(buffer)) > Py_FatalError("PySys_WriteStdout/err: buffer overrun"); I believe the return from vsprintf() itself would be the problem. Cheers, -g -- Greg Stein, http://www.lyra.org/ From mal@lemburg.com Mon Nov 15 09:49:26 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 15 Nov 1999 10:49:26 +0100 Subject: [Python-Dev] PyErr_Format security note References: <Pine.LNX.4.10.9911150127320.2535-100000@nebula.lyra.org> Message-ID: <382FD726.6ACB912F@lemburg.com> Greg Stein wrote: > > On Mon, 15 Nov 1999, M.-A. Lemburg wrote: > >... > > In sysmodule.c, this check is done which should be safe enough > > since no "return" is issued (Py_FatalError() does an abort()): > > > > if (vsprintf(buffer, format, va) >= sizeof(buffer)) > > Py_FatalError("PySys_WriteStdout/err: buffer overrun"); > > I believe the return from vsprintf() itself would be the problem. Ouch, yes, you are right... but who could exploit this security hole ? Since PyErr_Format() is only reachable for C code, only bad programming style in extensions could make it exploitable via user input. Wouldn't it be possible to assign thread globals for these functions to use ? These would live on the heap instead of on the stack and eliminate the buffer overrun possibilities (I guess -- I don't have any experience with these...). -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 46 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From akuchlin@mems-exchange.org Mon Nov 15 15:17:58 1999 From: akuchlin@mems-exchange.org (Andrew M. Kuchling) Date: Mon, 15 Nov 1999 10:17:58 -0500 (EST) Subject: [Python-Dev] PyErr_Format security note In-Reply-To: <382FD726.6ACB912F@lemburg.com> References: <Pine.LNX.4.10.9911150127320.2535-100000@nebula.lyra.org> <382FD726.6ACB912F@lemburg.com> Message-ID: <14384.9254.152604.11688@amarok.cnri.reston.va.us> M.-A. Lemburg writes: >Ouch, yes, you are right... but who could exploit this security >hole ? Since PyErr_Format() is only reachable for C code, only >bad programming style in extensions could make it exploitable >via user input. 99% of security holes arise out of carelessness, and besides, this buffer size doesn't seem to be documented in either api.tex or ext.tex. I'll look into borrowing Apache's implementation and modifying it into a varargs form. -- A.M. Kuchling http://starship.python.net/crew/amk/ I can also withstand considerably more G-force than most people, even though I do say so myself. -- The Doctor, in "The Ambassadors of Death" From guido@CNRI.Reston.VA.US Mon Nov 15 15:23:57 1999 From: guido@CNRI.Reston.VA.US (Guido van Rossum) Date: Mon, 15 Nov 1999 10:23:57 -0500 Subject: [Python-Dev] PyErr_Format security note In-Reply-To: Your message of "Sun, 14 Nov 1999 20:49:08 EST." <199911150149.UAA00408@mira.erols.com> References: <199911150149.UAA00408@mira.erols.com> Message-ID: <199911151523.KAA27163@eric.cnri.reston.va.us> > I noticed this in PyErr_Format(exception, format, va_alist): > > char buffer[500]; /* Caller is responsible for limiting the format */ > ... > vsprintf(buffer, format, vargs); > > Making the caller responsible for this is error-prone. Agreed. The limit of 500 chars, while technically undocumented, is part of the specs for PyErr_Format (which is currently wholly undocumented). The current callers all have explicit precautions, but of course I agree that this is a potential danger. > The danger, of > course, is a buffer overflow caused by generating an error string > that's larger than the buffer, possibly letting people execute > arbitrary code. We could add a test to the configure script for > vsnprintf() and use it when possible, but that only fixes the problem > on platforms which have it. Can we find an implementation of > vsnprintf() someplace? Assuming that Linux and Solaris have vsnprintf(), can't we just use the configure script to detect it, and issue a warning blaming the platform for those platforms that don't have it? That seems much simpler (from a maintenance perspective) than carrying our own implementation around (even if we can borrow the Apache version). --Guido van Rossum (home page: http://www.python.org/~guido/) From fdrake@acm.org Mon Nov 15 15:24:27 1999 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Mon, 15 Nov 1999 10:24:27 -0500 (EST) Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <382C6150.53BDC803@lemburg.com> References: <DBF3B37F7BF1D111B2A10000F6B14B1FDDAF24@ukhil704nts.hld.uk.fid-intl.com> <02f901bf2d07$bdf5d950$f29b12c2@secret.pythonware.com> <382C0FA8.ACB6CCD6@lemburg.com> <14380.10955.420102.327867@weyr.cnri.reston.va.us> <382C3749.198EEBC6@lemburg.com> <14380.16064.723277.586881@weyr.cnri.reston.va.us> <382C6150.53BDC803@lemburg.com> Message-ID: <14384.9643.145759.816037@weyr.cnri.reston.va.us> M.-A. Lemburg writes: > Guido proposed to add it to sys. I originally had it defined in > unicodec. Well, he clearly didn't ask me! ;-) > Perhaps a sys.endian would be more appropriate for sys > with values 'little' and 'big' or '<' and '>' to be conform > to the struct module. > > unicodec could then define unicodec.bom depending on the setting > in sys. This seems more reasonable, though I'd go with BOM instead of bom. But that's a style issue, so not so important. If your write bom, I'll write bom. -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives From andy@robanal.demon.co.uk Mon Nov 15 15:30:45 1999 From: andy@robanal.demon.co.uk (=?iso-8859-1?q?Andy=20Robinson?=) Date: Mon, 15 Nov 1999 07:30:45 -0800 (PST) Subject: [Python-Dev] Some thoughts on the codecs... Message-ID: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> Some thoughts on the codecs... 1. Stream interface At the moment a codec has dump and load methods which read a (slice of a) stream into a string in memory and vice versa. As the proposal notes, this could lead to errors if you take a slice out of a stream. This is not just due to character truncation; some Asian encodings are modal and have shift-in and shift-out sequences as they move from Western single-byte characters to double-byte ones. It also seems a bit pointless to me as the source (or target) is still a Unicode string in memory. This is a real problem - a filter to convert big files between two encodings should be possible without knowledge of the particular encoding, as should one on the input/output of some server. We can still give a default implementation for single-byte encodings. What's a good API for real stream conversion? just Codec.encodeStream(infile, outfile) ? or is it more useful to feed the codec with data a chunk at a time? 2. Data driven codecs I really like codecs being objects, and believe we could build support for a lot more encodings, a lot sooner than is otherwise possible, by making them data driven rather making each one compiled C code with static mapping tables. What do people think about the approach below? First of all, the ISO8859-1 series are straight mappings to Unicode code points. So one Python script could parse these files and build the mapping table, and a very small data file could hold these encodings. A compiled helper function analogous to string.translate() could deal with most of them. Secondly, the double-byte ones involve a mixture of algorithms and data. The worst cases I know are modal encodings which need a single-byte lookup table, a double-byte lookup table, and have some very simple rules about escape sequences in between them. A simple state machine could still handle these (and the single-byte mappings above become extra-simple special cases); I could imagine feeding it a totally data-driven set of rules. Third, we can massively compress the mapping tables using a notation which just lists contiguous ranges; and very often there are relationships between encodings. For example, "cpXYZ is just like cpXYY but with an extra 'smiley' at 0XFE32". In these cases, a script can build a family of related codecs in an auditable manner. 3. What encodings to distribute? The only clean answers to this are 'almost none', or 'everything that Unicode 3.0 has a mapping for'. The latter is going to add some weight to the distribution. What are people's feelings? Do we ship any at all apart from the Unicode ones? Should new encodings be downloadable from www.python.org? Should there be an optional package outside the main distribution? Thanks, Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From akuchlin@mems-exchange.org Mon Nov 15 15:36:47 1999 From: akuchlin@mems-exchange.org (Andrew M. Kuchling) Date: Mon, 15 Nov 1999 10:36:47 -0500 (EST) Subject: [Python-Dev] PyErr_Format security note In-Reply-To: <199911151523.KAA27163@eric.cnri.reston.va.us> References: <199911150149.UAA00408@mira.erols.com> <199911151523.KAA27163@eric.cnri.reston.va.us> Message-ID: <14384.10383.718373.432606@amarok.cnri.reston.va.us> Guido van Rossum writes: >Assuming that Linux and Solaris have vsnprintf(), can't we just use >the configure script to detect it, and issue a warning blaming the >platform for those platforms that don't have it? That seems much But people using an already-installed Python binary won't see any such configure-time warning, and won't find out about the potential problem. Plus, how do people fix the problem on platforms that don't have vsnprintf() -- switch to Solaris or Linux? Not much of a solution. (vsnprintf() isn't ANSI C, though it's a common extension, so platforms that lack it aren't really deficient.) Hmm... could we maybe use Python's existing (string % vars) machinery? <think think> No, that seems to be hard, because it would want PyObjects, and we can't know what Python types to convert the varargs to, unless we parse the format string (at which point we may as well get a vsnprintf() implementation. -- A.M. Kuchling http://starship.python.net/crew/amk/ A successful tool is one that was used to do something undreamed of by its author. -- S.C. Johnson From guido@CNRI.Reston.VA.US Mon Nov 15 15:50:24 1999 From: guido@CNRI.Reston.VA.US (Guido van Rossum) Date: Mon, 15 Nov 1999 10:50:24 -0500 Subject: [Python-Dev] just say no... In-Reply-To: Your message of "Sun, 14 Nov 1999 23:11:54 +0100." <382F33AA.C3EE825A@lemburg.com> References: <Pine.LNX.4.10.9911121505210.2535-100000@nebula.lyra.org> <382D315F.A7ADEC42@lemburg.com> <199911131306.IAA26030@eric.cnri.reston.va.us> <382F33AA.C3EE825A@lemburg.com> Message-ID: <199911151550.KAA27188@eric.cnri.reston.va.us> > On purpose -- according to my thinking. I see "t#" as an interface > to bf_getcharbuf which I understand as 8-bit character buffer... > UTF-8 is a multi byte encoding. It still is character data, but > not necessarily 8 bits in length (up to 24 bits are used). > > Anyway, I'm not really interested in having an argument about > this. If you say, "t#" fits the purpose, then that's fine with > me. Still, we should clearly define that "t#" returns > text data and "s#" binary data. Encoding, bit length, etc. should > explicitly remain left undefined. Thanks for not picking an argument. Multibyte encodings typically have ASCII as a subset (in such a way that an ASCII string is represented as itself in bytes). This is the characteristic that's needed in my view. > > > First, we have a general design question here: should old code > > > become Unicode compatible or not. As I recall the original idea > > > about Unicode integration was to follow Perl's idea to have > > > scripts become Unicode aware by simply adding a 'use utf8;'. > > > > I've never heard of this idea before -- or am I taking it too literal? > > It smells of a mode to me :-) I'd rather live in a world where > > Unicode just works as long as you use u'...' literals or whatever > > convention we decide. > > > > > If this is still the case, then we'll have to come with a > > > resonable approach for integrating classical string based > > > APIs with the new type. > > > > > > Since UTF-8 is a standard (some would probably prefer UTF-7,5 e.g. > > > the Latin-1 folks) which has some very nice features (see > > > http://czyborra.com/utf/ ) and which is a true extension of ASCII, > > > this encoding seems best fit for the purpose. > > > > Yes, especially if we fix the default encoding as UTF-8. (I'm > > expecting feedback from HP on this next week, hopefully when I see the > > details, it'll be clear that don't need a per-thread default encoding > > to solve their problems; that's quite a likely outcome. If not, we > > have a real-world argument for allowing a variable default encoding, > > without carnage.) > > Fair enough :-) > > > > However, one should not forget that UTF-8 is in fact a > > > variable length encoding of Unicode characters, that is up to > > > 3 bytes form a *single* character. This is obviously not compatible > > > with definitions that explicitly state data to be using a > > > 8-bit single character encoding, e.g. indexing in UTF-8 doesn't > > > work like it does in Latin-1 text. > > > > Sure, but where in current Python are there such requirements? > > It was my understanding that "t#" refers to single byte character > data. That's where the above arguments were aiming at... t# refers to byte-encoded data. Multibyte encodings are explicitly designed to be passed cleanly through processing steps that handle single-byte character data, as long as they are 8-bit clean and don't do too much processing. > > > So if we are to do the integration, we'll have to choose > > > argument parser markers that allow for multi byte characters. > > > "t#" does not fall into this category, "s#" certainly does, > > > "s" is argueable. > > > > I disagree. I grepped through the source for s# and t#. Here's a bit > > of background. Before t# was introduced, s# was being used for two > > distinct purposes: (1) to get an 8-bit text string plus its length, in > > situations where the length was needed; (2) to get binary data (e.g. > > GIF data read from a file in "rb" mode). Greg pointed out that if we > > ever introduced some form of Unicode support, these two had to be > > disambiguated. We found that the majority of uses was for (2)! > > Therefore we decided to change the definition of s# to mean only (2), > > and introduced t# to mean (1). Also, we introduced getcharbuffer > > corresponding to t#, while getreadbuffer was meant for s#. > > I know its too late now, but I can't really follow the arguments > here: in what ways are (1) and (2) different from the implementations > point of view ? If "t#" is to return UTF-8 then <length of the > buffer> will not equal <text length>, so both parser markers return > essentially the same information. The only difference would be > on the semantic side: (1) means: give me text data, while (2) does > not specify the data type. > > Perhaps I'm missing something... The idea is that (1)/s# disallows any translation of the data, while (2)/t# requires translation of the data to an ASCII superset (possibly multibyte, such as UTF-8 or shift-JIS). (2)/t# assumes that the data contains text and that if the text consists of only ASCII characters they are represented as themselves. (1)/s# makes no such assumption. In terms of implementation, Unicode objects should translate themselves to the default encoding for t# (if possible), but they should make the native representation available for s#. For example, take an encryption engine. While it is defined in terms of byte streams, there's no requirement that the bytes represent characters -- they could be the bytes of a GIF file, an MP3 file, or a gzipped tar file. If we pass Unicode to an encryption engine, we want Unicode to come out at the other end, not UTF-8. (If we had wanted to encrypt UTF-8, we should have fed it UTF-8.) > > Note that the definition of the 's' format was left alone -- as > > before, it means you need an 8-bit text string not containing null > > bytes. > > This definition should then be changed to "text string without > null bytes" dropping the 8-bit reference. Aha, I think there's a confusion about what "8-bit" means. For me, a multibyte encoding like UTF-8 is still 8-bit. Am I alone in this? (As far as I know, C uses char* to represent multibyte characters.) Maybe we should disambiguate it more explicitly? > > Our expectation was that a Unicode string passed to an s# situation > > would give a pointer to the internal format plus a byte count (not a > > character count!) while t# would get a pointer to some kind of 8-bit > > translation/encoding plus a byte count, with the explicit requirement > > that the 8-bit translation would have the same lifetime as the > > original unicode object. We decided to leave it up to the next > > generation (i.e., Marc-Andre :-) to decide what kind of translation to > > use and what to do when there is no reasonable translation. > > Hmm, I would strongly object to making "s#" return the internal > format. file.write() would then default to writing UTF-16 data > instead of UTF-8 data. This could result in strange errors > due to the UTF-16 format being endian dependent. But this was the whole design. file.write() needs to be changed to use s# when the file is open in binary mode and t# when the file is open in text mode. > It would also break the symmetry between file.write(u) and > unicode(file.read()), since the default encoding is not used as > internal format for other reasons (see proposal). If the file is encoded using UTF-16 or UCS-2, you should open it in binary mode and use unicode(file.read(), 'utf-16'). (Or perhaps the app should read the first 2 bytes and check for a BOM and then decide to choose bewteen 'utf-16-be' and 'utf-16-le'.) > > Any of the following choices is acceptable (from the point of view of > > not breaking the intended t# semantics; we can now start deciding > > which we like best): > > I think we have already agreed on using UTF-8 for the default > encoding. It has quite a few advantages. See > > http://czyborra.com/utf/ > > for a good overview of the pros and cons. Of course. I was just presenting the list as an argument that if we changed our mind about the default encoding, t# should follow the default encoding (and not pick an encoding by other means). > > - utf-8 > > - latin-1 > > - ascii > > - shift-jis > > - lower byte of unicode ordinal > > - some user- or os-specified multibyte encoding > > > > As far as t# is concerned, for encodings that don't encode all of > > Unicode, untranslatable characters could be dealt with in any number > > of ways (raise an exception, ignore, replace with '?', make best > > effort, etc.). > > The usual Python way would be: raise an exception. This is what > the proposal defines for Codecs in case an encoding/decoding > mapping is not possible, BTW. (UTF-8 will always succeed on > output.) Did you read Andy Robinson's case study? He suggested that for certain encodings there may be other things you can do that are more user-friendly than raising an exception, depending on the application. I am proposing to leave this a detail of each specific translation. There may even be translations that do the same thing except they have a different behavior for untranslatable cases -- e.g. a strict version that raises an exception and a non-strict version that replaces bad characters with '?'. I think this is one of the powers of having an extensible set of encodings. > > Given the current context, it should probably be the same as the > > default encoding -- i.e., utf-8. If we end up making the default > > user-settable, we'll have to decide what to do with untranslatable > > characters -- but that will probably be decided by the user too (it > > would be a property of a specific translation specification). > > > > In any case, I feel that t# could receive a multi-byte encoding, > > s# should receive raw binary data, and they should correspond to > > getcharbuffer and getreadbuffer, respectively. > > Why would you want to have "s#" return the raw binary data for > Unicode objects ? Because file.write() for a binary file, and other similar things (e.g. the encryption engine example I mentioned above) must have *some* way to get at the raw bits. > Note that it is not mentioned anywhere that > "s#" and "t#" do have to necessarily return different things > (binary being a superset of text). I'd opt for "s#" and "t#" both > returning UTF-8 data. This can be implemented by delegating the > buffer slots to the <defencstr> object (see below). This would defeat the whole purpose of introducing t#. We might as well drop t# then altogether if we adopt this. > > > Now Greg would chime in with the buffer interface and > > > argue that it should make the underlying internal > > > format accessible. This is a bad idea, IMHO, since you > > > shouldn't really have to know what the internal data format > > > is. > > > > This is for C code. Quite likely it *does* know what the internal > > data format is! > > C code can use the PyUnicode_* APIs to access the data. I > don't think that argument parsing is powerful enough to > provide the C code with enough information about the data > contents, e.g. it can only state the encoding length, not the > string length. Typically, all the C code does is pass multibyte encoded strings on to other library routines that know what to do to them, or simply give them back unchanged at a later time. It is essential to know the number of bytes, for memory allocation purposes. The number of characters is totally immaterial (and multibyte-handling code knows how to calculate the number of characters anyway). > > > Defining "s#" to return UTF-8 data does not only > > > make "s" and "s#" return the same data format (which should > > > always be the case, IMO), > > > > That was before t# was introduced. No more, alas. If you replace s# > > with t#, I agree with you completely. > > Done :-) > > > > but also hides the internal > > > format from the user and gives him a reliable cross-platform > > > data representation of Unicode data (note that UTF-8 doesn't > > > have the byte order problems of UTF-16). > > > > > > If you are still with, let's look at what "s" and "s#" > > > > (and t#, which is more relevant here) > > > > > do: they return pointers into data areas which have to > > > be kept alive until the corresponding object dies. > > > > > > The only way to support this feature is by allocating > > > a buffer for just this purpose (on the fly and only if > > > needed to prevent excessive memory load). The other > > > options of adding new magic parser markers or switching > > > to more generic one all have one downside: you need to > > > change existing code which is in conflict with the idea > > > we started out with. > > > > Agreed. I think this was our thinking when Greg & I introduced t#. > > My own preference would be to allocate a whole string object, not > > just a buffer; this could then also be used for the .encode() method > > using the default encoding. > > Good point. I'll change <defencbuf> to <defencstr>, a Python > string object created on request. > > > > So, again, the question is: do we want this magical > > > intergration or not ? Note that this is a design question, > > > not one of memory consumption... > > > > Yes, I want it. > > > > Note that this doesn't guarantee that all old extensions will work > > flawlessly when passed Unicode objects; but I think that it covers > > most cases where you could have a reasonable expectation that it > > works. > > > > (Hm, unfortunately many reasonable expectations seem to involve > > the current user's preferred encoding. :-( ) > > -- > Marc-Andre Lemburg --Guido van Rossum (home page: http://www.python.org/~guido/) From Mike.Da.Silva@uk.fid-intl.com Mon Nov 15 16:01:59 1999 From: Mike.Da.Silva@uk.fid-intl.com (Da Silva, Mike) Date: Mon, 15 Nov 1999 16:01:59 -0000 Subject: [Python-Dev] Some thoughts on the codecs... Message-ID: <DBF3B37F7BF1D111B2A10000F6B14B1FDDAF2C@ukhil704nts.hld.uk.fid-intl.com> Andy Robinson wrote: 1. Stream interface At the moment a codec has dump and load methods which read a (slice of a) stream into a string in memory and vice versa. As the proposal notes, this could lead to errors if you take a slice out of a stream. This is not just due to character truncation; some Asian encodings are modal and have shift-in and shift-out sequences as they move from Western single-byte characters to double-byte ones. It also seems a bit pointless to me as the source (or target) is still a Unicode string in memory. This is a real problem - a filter to convert big files between two encodings should be possible without knowledge of the particular encoding, as should one on the input/output of some server. We can still give a default implementation for single-byte encodings. What's a good API for real stream conversion? just Codec.encodeStream(infile, outfile) ? or is it more useful to feed the codec with data a chunk at a time? A user defined chunking factor (suitably defaulted) would be useful for processing large files. 2. Data driven codecs I really like codecs being objects, and believe we could build support for a lot more encodings, a lot sooner than is otherwise possible, by making them data driven rather making each one compiled C code with static mapping tables. What do people think about the approach below? First of all, the ISO8859-1 series are straight mappings to Unicode code points. So one Python script could parse these files and build the mapping table, and a very small data file could hold these encodings. A compiled helper function analogous to string.translate() could deal with most of them. Secondly, the double-byte ones involve a mixture of algorithms and data. The worst cases I know are modal encodings which need a single-byte lookup table, a double-byte lookup table, and have some very simple rules about escape sequences in between them. A simple state machine could still handle these (and the single-byte mappings above become extra-simple special cases); I could imagine feeding it a totally data-driven set of rules. Third, we can massively compress the mapping tables using a notation which just lists contiguous ranges; and very often there are relationships between encodings. For example, "cpXYZ is just like cpXYY but with an extra 'smiley' at 0XFE32". In these cases, a script can build a family of related codecs in an auditable manner. The problem here is that we need to decide whether we are Unicode-centric, or whether Unicode is just another supported encoding. If we are Unicode-centric, then all code-page translations will require static mapping tables between the appropriate Unicode character and the relevant code points in the other encoding. This would involve (worst case) 64k static tables for each supported encoding. Unfortunately this also precludes the use of algorithmic conversions and or sparse conversion tables because most of these transformations are relative to a source and target non-Unicode encoding, eg JIS <---->EUCJIS. If we are taking the IBM approach (see CDRA), then we can mix and match approaches, and treat Unicode strings as just Unicode, and normal strings as being any arbitrary MBCS encoding. To guarantee the utmost interoperability and Unicode 3.0 (and beyond) compliance, we should probably assume that all core encodings are relative to Unicode as the pivot encoding. This should hopefully avoid any gotcha's with roundtrips between any two arbitrary native encodings. The downside is this will probably be slower than an optimised algorithmic transformation. 3. What encodings to distribute? The only clean answers to this are 'almost none', or 'everything that Unicode 3.0 has a mapping for'. The latter is going to add some weight to the distribution. What are people's feelings? Do we ship any at all apart from the Unicode ones? Should new encodings be downloadable from www.python.org <http://www.python.org> ? Should there be an optional package outside the main distribution? Ship with Unicode encodings in the core, the rest should be an add on package. If we are truly Unicode-centric, this gives us the most value in terms of accessing a Unicode character properties database, which will provide language neutral case folding, Hankaku <----> Zenkaku folding (Japan specific), and composition / normalisation between composed characters and their component nonspacing characters. Regards, Mike da Silva From andy@robanal.demon.co.uk Mon Nov 15 16:18:13 1999 From: andy@robanal.demon.co.uk (=?iso-8859-1?q?Andy=20Robinson?=) Date: Mon, 15 Nov 1999 08:18:13 -0800 (PST) Subject: [Python-Dev] just say no... Message-ID: <19991115161813.13111.rocketmail@web606.mail.yahoo.com> --- Guido van Rossum <guido@CNRI.Reston.VA.US> wrote: > Did you read Andy Robinson's case study? He > suggested that for certain encodings there may be > other things you can do that are more > user-friendly than raising an exception, depending > on the application. I am proposing to leave this a > detail of each specific translation. > There may even be translations that do the same thing > except they have a different behavior for > untranslatable cases -- e.g. a strict version > that raises an exception and a non-strict version > that replaces bad characters with '?'. I think this > is one of the powers of having an extensible set of > encodings. This would be a desirable option in almost every case. Default is an exception (I want to know my data is not clean), but an option to specify an error character. It is usually a question mark but Mike tells me that some encodings specify the error character to use. Example - I query a Sybase Unicode database containing European accents or Japanese. By default it will give me question marks. If I issue the command 'set char_convert utf8', then I see the lot (as garbage, but never mind). If it always errored whenever a query result contained unexpected data, it would be almost impossible to maintain the database. If I wrote my own codec class for a family of encodings, I'd give it an even wider variety of error-logging options - maybe a mode where it told me where in the file the dodgy characters were. We've already taken the key step by allowing codecs to be separate objects registered at run-time, implemented in either C or Python. This means that once again Python will have the most flexible solution around. - Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From jim@digicool.com Mon Nov 15 16:29:13 1999 From: jim@digicool.com (Jim Fulton) Date: Mon, 15 Nov 1999 11:29:13 -0500 Subject: [Python-Dev] PyErr_Format security note References: <199911150149.UAA00408@mira.erols.com> Message-ID: <383034D9.6E1E74D4@digicool.com> "A.M. Kuchling" wrote: > > I noticed this in PyErr_Format(exception, format, va_alist): > > char buffer[500]; /* Caller is responsible for limiting the format */ > ... > vsprintf(buffer, format, vargs); > > Making the caller responsible for this is error-prone. The danger, of > course, is a buffer overflow caused by generating an error string > that's larger than the buffer, possibly letting people execute > arbitrary code. We could add a test to the configure script for > vsnprintf() and use it when possible, but that only fixes the problem > on platforms which have it. Can we find an implementation of > vsnprintf() someplace? I would prefer to see a different interface altogether: PyObject *PyErr_StringFormat(errtype, format, buildformat, ...) So, you could generate an error like this: return PyErr_StringFormat(ErrorObject, "You had too many, %d, foos. The last one was %s", "iO", n, someObject) I implemented this in cPickle. See cPickle_ErrFormat. (Note that it always returns NULL.) Jim -- Jim Fulton mailto:jim@digicool.com Python Powered! Technical Director (888) 344-4332 http://www.python.org Digital Creations http://www.digicool.com http://www.zope.org Under US Code Title 47, Sec.227(b)(1)(C), Sec.227(a)(2)(B) This email address may not be added to any commercial mail list with out my permission. Violation of my privacy with advertising or SPAM will result in a suit for a MINIMUM of $500 damages/incident, $1500 for repeats. From bwarsaw@cnri.reston.va.us (Barry A. Warsaw) Mon Nov 15 16:54:10 1999 From: bwarsaw@cnri.reston.va.us (Barry A. Warsaw) (Barry A. Warsaw) Date: Mon, 15 Nov 1999 11:54:10 -0500 (EST) Subject: [Python-Dev] PyErr_Format security note References: <199911150149.UAA00408@mira.erols.com> <199911151523.KAA27163@eric.cnri.reston.va.us> Message-ID: <14384.15026.392781.151886@anthem.cnri.reston.va.us> >>>>> "Guido" == Guido van Rossum <guido@cnri.reston.va.us> writes: Guido> Assuming that Linux and Solaris have vsnprintf(), can't we Guido> just use the configure script to detect it, and issue a Guido> warning blaming the platform for those platforms that don't Guido> have it? That seems much simpler (from a maintenance Guido> perspective) than carrying our own implementation around Guido> (even if we can borrow the Apache version). Mailman uses vsnprintf in it's C wrapper. There's a simple configure test... # Checks for library functions. AC_CHECK_FUNCS(vsnprintf) ...and for systems that don't have a vsnprintf, I modified a version from GNU screen. It may not have gone through the scrutiny of Apache's implementation, but for Mailman it was more important that it be GPL'd (not a Python requirement). -Barry From jim@digicool.com Mon Nov 15 16:56:38 1999 From: jim@digicool.com (Jim Fulton) Date: Mon, 15 Nov 1999 11:56:38 -0500 Subject: [Python-Dev] PyErr_Format security note References: <199911150149.UAA00408@mira.erols.com> <199911151523.KAA27163@eric.cnri.reston.va.us> <14384.10383.718373.432606@amarok.cnri.reston.va.us> Message-ID: <38303B46.F6AEEDF1@digicool.com> "Andrew M. Kuchling" wrote: > > Guido van Rossum writes: > >Assuming that Linux and Solaris have vsnprintf(), can't we just use > >the configure script to detect it, and issue a warning blaming the > >platform for those platforms that don't have it? That seems much > > But people using an already-installed Python binary won't see any such > configure-time warning, and won't find out about the potential > problem. Plus, how do people fix the problem on platforms that don't > have vsnprintf() -- switch to Solaris or Linux? Not much of a > solution. (vsnprintf() isn't ANSI C, though it's a common extension, > so platforms that lack it aren't really deficient.) > > Hmm... could we maybe use Python's existing (string % vars) machinery? > <think think> No, that seems to be hard, because it would want > PyObjects, and we can't know what Python types to convert the varargs > to, unless we parse the format string (at which point we may as well > get a vsnprintf() implementation. It's easy. You use two format strings. One a Python string format, and the other a Py_BuildValue format. See my other note. Jim -- Jim Fulton mailto:jim@digicool.com Python Powered! Technical Director (888) 344-4332 http://www.python.org Digital Creations http://www.digicool.com http://www.zope.org Under US Code Title 47, Sec.227(b)(1)(C), Sec.227(a)(2)(B) This email address may not be added to any commercial mail list with out my permission. Violation of my privacy with advertising or SPAM will result in a suit for a MINIMUM of $500 damages/incident, $1500 for repeats. From tismer@appliedbiometrics.com Mon Nov 15 17:02:20 1999 From: tismer@appliedbiometrics.com (Christian Tismer) Date: Mon, 15 Nov 1999 18:02:20 +0100 Subject: [Python-Dev] PyErr_Format security note References: <199911150149.UAA00408@mira.erols.com> <199911151523.KAA27163@eric.cnri.reston.va.us> Message-ID: <38303C9C.42C5C830@appliedbiometrics.com> Guido van Rossum wrote: > > > I noticed this in PyErr_Format(exception, format, va_alist): > > > > char buffer[500]; /* Caller is responsible for limiting the format */ > > ... > > vsprintf(buffer, format, vargs); > > > > Making the caller responsible for this is error-prone. > > Agreed. The limit of 500 chars, while technically undocumented, is > part of the specs for PyErr_Format (which is currently wholly > undocumented). The current callers all have explicit precautions, but > of course I agree that this is a potential danger. All but one (checked them all): In ceval.c, function call_builtin, there is a possible security hole. If an extension module happens to create a very long type name (maybe just via a bug), we will crash. } PyErr_Format(PyExc_TypeError, "call of non-function (type %s)", func->ob_type->tp_name); return NULL; } ciao - chris -- Christian Tismer :^) <mailto:tismer@appliedbiometrics.com> Applied Biometrics GmbH : Have a break! Take a ride on Python's Kaiserin-Augusta-Allee 101 : *Starship* http://starship.python.net 10553 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF we're tired of banana software - shipped green, ripens at home From guido@CNRI.Reston.VA.US Mon Nov 15 19:32:00 1999 From: guido@CNRI.Reston.VA.US (Guido van Rossum) Date: Mon, 15 Nov 1999 14:32:00 -0500 Subject: [Python-Dev] PyErr_Format security note In-Reply-To: Your message of "Mon, 15 Nov 1999 18:02:20 +0100." <38303C9C.42C5C830@appliedbiometrics.com> References: <199911150149.UAA00408@mira.erols.com> <199911151523.KAA27163@eric.cnri.reston.va.us> <38303C9C.42C5C830@appliedbiometrics.com> Message-ID: <199911151932.OAA28008@eric.cnri.reston.va.us> > All but one (checked them all): Thanks for checking. > In ceval.c, function call_builtin, there is a possible security hole. > If an extension module happens to create a very long type name > (maybe just via a bug), we will crash. > > } > PyErr_Format(PyExc_TypeError, "call of non-function (type %s)", > func->ob_type->tp_name); > return NULL; > } I would think that an extension module with a name of nearly 500 characters would draw a lot of attention as being ridiculous. If there was a bug through which you could make tp_name point to such a long string, you could probably exploit that bug without having to use this particular PyErr_Format() statement. However, I agree it's better to be safe than sorry, so I've checked in a fix making it %.400s. --Guido van Rossum (home page: http://www.python.org/~guido/) From tismer@appliedbiometrics.com Mon Nov 15 19:41:14 1999 From: tismer@appliedbiometrics.com (Christian Tismer) Date: Mon, 15 Nov 1999 20:41:14 +0100 Subject: [Python-Dev] PyErr_Format security note References: <199911150149.UAA00408@mira.erols.com> <199911151523.KAA27163@eric.cnri.reston.va.us> <38303C9C.42C5C830@appliedbiometrics.com> <199911151932.OAA28008@eric.cnri.reston.va.us> Message-ID: <383061DA.CA5CB373@appliedbiometrics.com> Guido van Rossum wrote: > > > All but one (checked them all): [ceval.c without limits] > I would think that an extension module with a name of nearly 500 > characters would draw a lot of attention as being ridiculous. If > there was a bug through which you could make tp_name point to such a > long string, you could probably exploit that bug without having to use > this particular PyErr_Format() statement. Of course this case is very unlikely. My primary intent was to create such a mess without an extension, and ExtensionClass seemed to be a candidate since it synthetizes a type name at runtime (!). This would have been dangerous since EC is in the heart of Zope. But, I could not get at this special case since EC always stands the class/instance checks and so this case can never happen :( The above lousy result was just to say *something* after no success. > However, I agree it's better to be safe than sorry, so I've checked in > a fix making it %.400s. cheap, consistent, fine - thanks - chris -- Christian Tismer :^) <mailto:tismer@appliedbiometrics.com> Applied Biometrics GmbH : Have a break! Take a ride on Python's Kaiserin-Augusta-Allee 101 : *Starship* http://starship.python.net 10553 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF we're tired of banana software - shipped green, ripens at home From mal@lemburg.com Mon Nov 15 19:04:59 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 15 Nov 1999 20:04:59 +0100 Subject: [Python-Dev] just say no... References: <Pine.LNX.4.10.9911121505210.2535-100000@nebula.lyra.org> <382D315F.A7ADEC42@lemburg.com> <199911131306.IAA26030@eric.cnri.reston.va.us> <382F33AA.C3EE825A@lemburg.com> <199911151550.KAA27188@eric.cnri.reston.va.us> Message-ID: <3830595B.348E8CC7@lemburg.com> Guido van Rossum wrote: > > [Misunderstanding in the reasoning behind "t#" and "s#"] > > Thanks for not picking an argument. Multibyte encodings typically > have ASCII as a subset (in such a way that an ASCII string is > represented as itself in bytes). This is the characteristic that's > needed in my view. > > > It was my understanding that "t#" refers to single byte character > > data. That's where the above arguments were aiming at... > > t# refers to byte-encoded data. Multibyte encodings are explicitly > designed to be passed cleanly through processing steps that handle > single-byte character data, as long as they are 8-bit clean and don't > do too much processing. Ah, ok. I interpreted 8-bit to mean: 8 bits in length, not "8-bit clean" as you obviously did. > > Perhaps I'm missing something... > > The idea is that (1)/s# disallows any translation of the data, while > (2)/t# requires translation of the data to an ASCII superset (possibly > multibyte, such as UTF-8 or shift-JIS). (2)/t# assumes that the data > contains text and that if the text consists of only ASCII characters > they are represented as themselves. (1)/s# makes no such assumption. > > In terms of implementation, Unicode objects should translate > themselves to the default encoding for t# (if possible), but they > should make the native representation available for s#. > > For example, take an encryption engine. While it is defined in terms > of byte streams, there's no requirement that the bytes represent > characters -- they could be the bytes of a GIF file, an MP3 file, or a > gzipped tar file. If we pass Unicode to an encryption engine, we want > Unicode to come out at the other end, not UTF-8. (If we had wanted to > encrypt UTF-8, we should have fed it UTF-8.) > > > > Note that the definition of the 's' format was left alone -- as > > > before, it means you need an 8-bit text string not containing null > > > bytes. > > > > This definition should then be changed to "text string without > > null bytes" dropping the 8-bit reference. > > Aha, I think there's a confusion about what "8-bit" means. For me, a > multibyte encoding like UTF-8 is still 8-bit. Am I alone in this? > (As far as I know, C uses char* to represent multibyte characters.) > Maybe we should disambiguate it more explicitly? There should be some definition for the two markers and the ideas behind them in the API guide, I guess. > > Hmm, I would strongly object to making "s#" return the internal > > format. file.write() would then default to writing UTF-16 data > > instead of UTF-8 data. This could result in strange errors > > due to the UTF-16 format being endian dependent. > > But this was the whole design. file.write() needs to be changed to > use s# when the file is open in binary mode and t# when the file is > open in text mode. Ok, that would make the situation a little clearer (even though I expect the two different encodings to produce some FAQs). I still don't feel very comfortable about the fact that all existing APIs using "s#" will suddenly receive UTF-16 data if being passed Unicode objects: this probably won't get us the "magical" Unicode integration we invision, since "t#" usage is not very wide spread and character handling code will probably not work well with UTF-16 encoded strings. Anyway, we should probably try out both methods... > > It would also break the symmetry between file.write(u) and > > unicode(file.read()), since the default encoding is not used as > > internal format for other reasons (see proposal). > > If the file is encoded using UTF-16 or UCS-2, you should open it in > binary mode and use unicode(file.read(), 'utf-16'). (Or perhaps the > app should read the first 2 bytes and check for a BOM and then decide > to choose bewteen 'utf-16-be' and 'utf-16-le'.) Right, that's the idea (there is a note on this in the Standard Codec section of the proposal). > > > Any of the following choices is acceptable (from the point of view of > > > not breaking the intended t# semantics; we can now start deciding > > > which we like best): > > > > I think we have already agreed on using UTF-8 for the default > > encoding. It has quite a few advantages. See > > > > http://czyborra.com/utf/ > > > > for a good overview of the pros and cons. > > Of course. I was just presenting the list as an argument that if > we changed our mind about the default encoding, t# should follow the > default encoding (and not pick an encoding by other means). Ok. > > > - utf-8 > > > - latin-1 > > > - ascii > > > - shift-jis > > > - lower byte of unicode ordinal > > > - some user- or os-specified multibyte encoding > > > > > > As far as t# is concerned, for encodings that don't encode all of > > > Unicode, untranslatable characters could be dealt with in any number > > > of ways (raise an exception, ignore, replace with '?', make best > > > effort, etc.). > > > > The usual Python way would be: raise an exception. This is what > > the proposal defines for Codecs in case an encoding/decoding > > mapping is not possible, BTW. (UTF-8 will always succeed on > > output.) > > Did you read Andy Robinson's case study? He suggested that for > certain encodings there may be other things you can do that are more > user-friendly than raising an exception, depending on the application. > I am proposing to leave this a detail of each specific translation. > There may even be translations that do the same thing except they have > a different behavior for untranslatable cases -- e.g. a strict version > that raises an exception and a non-strict version that replaces bad > characters with '?'. I think this is one of the powers of having an > extensible set of encodings. Agreed, the Codecs should decide for themselves what to do. I'll add a note to the next version of the proposal. > > > Given the current context, it should probably be the same as the > > > default encoding -- i.e., utf-8. If we end up making the default > > > user-settable, we'll have to decide what to do with untranslatable > > > characters -- but that will probably be decided by the user too (it > > > would be a property of a specific translation specification). > > > > > > In any case, I feel that t# could receive a multi-byte encoding, > > > s# should receive raw binary data, and they should correspond to > > > getcharbuffer and getreadbuffer, respectively. > > > > Why would you want to have "s#" return the raw binary data for > > Unicode objects ? > > Because file.write() for a binary file, and other similar things > (e.g. the encryption engine example I mentioned above) must have > *some* way to get at the raw bits. What for ? Any lossless encoding should do the trick... UTF-8 is just as good as UTF-16 for binary files; plus it's more compact for ASCII data. I don't really see a need to get explicitly at the internal data representation because both encodings are in fact "internal" w/r to Unicode objects. The only argument I can come up with is that using UTF-16 for binary files could (possibly) eliminate the UTF-8 conversion step which is otherwise always needed. > > Note that it is not mentioned anywhere that > > "s#" and "t#" do have to necessarily return different things > > (binary being a superset of text). I'd opt for "s#" and "t#" both > > returning UTF-8 data. This can be implemented by delegating the > > buffer slots to the <defencstr> object (see below). > > This would defeat the whole purpose of introducing t#. We might as > well drop t# then altogether if we adopt this. Well... yes ;-) > > > > Now Greg would chime in with the buffer interface and > > > > argue that it should make the underlying internal > > > > format accessible. This is a bad idea, IMHO, since you > > > > shouldn't really have to know what the internal data format > > > > is. > > > > > > This is for C code. Quite likely it *does* know what the internal > > > data format is! > > > > C code can use the PyUnicode_* APIs to access the data. I > > don't think that argument parsing is powerful enough to > > provide the C code with enough information about the data > > contents, e.g. it can only state the encoding length, not the > > string length. > > Typically, all the C code does is pass multibyte encoded strings on to > other library routines that know what to do to them, or simply give > them back unchanged at a later time. It is essential to know the > number of bytes, for memory allocation purposes. The number of > characters is totally immaterial (and multibyte-handling code knows > how to calculate the number of characters anyway). -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 46 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Mon Nov 15 19:20:55 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 15 Nov 1999 20:20:55 +0100 Subject: [Python-Dev] Some thoughts on the codecs... References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> Message-ID: <38305D17.60EC94D0@lemburg.com> Andy Robinson wrote: > > Some thoughts on the codecs... > > 1. Stream interface > At the moment a codec has dump and load methods which > read a (slice of a) stream into a string in memory and > vice versa. As the proposal notes, this could lead to > errors if you take a slice out of a stream. This is > not just due to character truncation; some Asian > encodings are modal and have shift-in and shift-out > sequences as they move from Western single-byte > characters to double-byte ones. It also seems a bit > pointless to me as the source (or target) is still a > Unicode string in memory. > > This is a real problem - a filter to convert big files > between two encodings should be possible without > knowledge of the particular encoding, as should one on > the input/output of some server. We can still give a > default implementation for single-byte encodings. > > What's a good API for real stream conversion? just > Codec.encodeStream(infile, outfile) ? or is it more > useful to feed the codec with data a chunk at a time? The idea was to use Unicode as intermediate for all encoding conversions. What you invision here are stream recoders. The can easily be implemented as an useful addition to the Codec subclasses, but I don't think that these have to go into the core. > 2. Data driven codecs > I really like codecs being objects, and believe we > could build support for a lot more encodings, a lot > sooner than is otherwise possible, by making them data > driven rather making each one compiled C code with > static mapping tables. What do people think about the > approach below? > > First of all, the ISO8859-1 series are straight > mappings to Unicode code points. So one Python script > could parse these files and build the mapping table, > and a very small data file could hold these encodings. > A compiled helper function analogous to > string.translate() could deal with most of them. The problem with these large tables is that currently Python modules are not shared among processes since every process builds its own table. Static C data has the advantage of being shareable at the OS level. You can of course implement Python based lookup tables, but these should be too large... > Secondly, the double-byte ones involve a mixture of > algorithms and data. The worst cases I know are modal > encodings which need a single-byte lookup table, a > double-byte lookup table, and have some very simple > rules about escape sequences in between them. A > simple state machine could still handle these (and the > single-byte mappings above become extra-simple special > cases); I could imagine feeding it a totally > data-driven set of rules. > > Third, we can massively compress the mapping tables > using a notation which just lists contiguous ranges; > and very often there are relationships between > encodings. For example, "cpXYZ is just like cpXYY but > with an extra 'smiley' at 0XFE32". In these cases, a > script can build a family of related codecs in an > auditable manner. These are all great ideas, but I think they unnecessarily complicate the proposal. > 3. What encodings to distribute? > The only clean answers to this are 'almost none', or > 'everything that Unicode 3.0 has a mapping for'. The > latter is going to add some weight to the > distribution. What are people's feelings? Do we ship > any at all apart from the Unicode ones? Should new > encodings be downloadable from www.python.org? Should > there be an optional package outside the main > distribution? Since Codecs can be registered at runtime, there is quite some potential there for extension writers coding their own fast codecs. E.g. one could use mxTextTools as codec engine working at C speeds. I would propose to only add some very basic encodings to the standard distribution, e.g. the ones mentioned under Standard Codecs in the proposal: 'utf-8': 8-bit variable length encoding 'utf-16': 16-bit variable length encoding (litte/big endian) 'utf-16-le': utf-16 but explicitly little endian 'utf-16-be': utf-16 but explicitly big endian 'ascii': 7-bit ASCII codepage 'latin-1': Latin-1 codepage 'html-entities': Latin-1 + HTML entities; see htmlentitydefs.py from the standard Pythin Lib 'jis' (a popular version XXX): Japanese character encoding 'unicode-escape': See Unicode Constructors for a definition 'native': Dump of the Internal Format used by Python Perhaps not even 'html-entities' (even though it would make a cool replacement for cgi.escape()) and maybe we should also place the JIS encoding into a separate Unicode package. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 46 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Mon Nov 15 19:26:16 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 15 Nov 1999 20:26:16 +0100 Subject: [Python-Dev] Some thoughts on the codecs... References: <DBF3B37F7BF1D111B2A10000F6B14B1FDDAF2C@ukhil704nts.hld.uk.fid-intl.com> Message-ID: <38305E58.28B20E24@lemburg.com> "Da Silva, Mike" wrote: > > Andy Robinson wrote: > -- > 1. Stream interface > At the moment a codec has dump and load methods which read a (slice of a) > stream into a string in memory and vice versa. As the proposal notes, this > could lead to errors if you take a slice out of a stream. This is not just > due to character truncation; some Asian encodings are modal and have > shift-in and shift-out sequences as they move from Western single-byte > characters to double-byte ones. It also seems a bit pointless to me as the > source (or target) is still a Unicode string in memory. > This is a real problem - a filter to convert big files between two encodings > should be possible without knowledge of the particular encoding, as should > one on the input/output of some server. We can still give a default > implementation for single-byte encodings. > What's a good API for real stream conversion? just > Codec.encodeStream(infile, outfile) ? or is it more useful to feed the > codec with data a chunk at a time? > -- > A user defined chunking factor (suitably defaulted) would be useful for > processing large files. > -- > 2. Data driven codecs > I really like codecs being objects, and believe we could build support for a > lot more encodings, a lot sooner than is otherwise possible, by making them > data driven rather making each one compiled C code with static mapping > tables. What do people think about the approach below? > First of all, the ISO8859-1 series are straight mappings to Unicode code > points. So one Python script could parse these files and build the mapping > table, and a very small data file could hold these encodings. A compiled > helper function analogous to string.translate() could deal with most of > them. > Secondly, the double-byte ones involve a mixture of algorithms and data. > The worst cases I know are modal encodings which need a single-byte lookup > table, a double-byte lookup table, and have some very simple rules about > escape sequences in between them. A simple state machine could still handle > these (and the single-byte mappings above become extra-simple special > cases); I could imagine feeding it a totally data-driven set of rules. > Third, we can massively compress the mapping tables using a notation which > just lists contiguous ranges; and very often there are relationships between > encodings. For example, "cpXYZ is just like cpXYY but with an extra > 'smiley' at 0XFE32". In these cases, a script can build a family of related > codecs in an auditable manner. > -- > The problem here is that we need to decide whether we are Unicode-centric, > or whether Unicode is just another supported encoding. If we are > Unicode-centric, then all code-page translations will require static mapping > tables between the appropriate Unicode character and the relevant code > points in the other encoding. This would involve (worst case) 64k static > tables for each supported encoding. Unfortunately this also precludes the > use of algorithmic conversions and or sparse conversion tables because most > of these transformations are relative to a source and target non-Unicode > encoding, eg JIS <---->EUCJIS. If we are taking the IBM approach (see > CDRA), then we can mix and match approaches, and treat Unicode strings as > just Unicode, and normal strings as being any arbitrary MBCS encoding. > > To guarantee the utmost interoperability and Unicode 3.0 (and beyond) > compliance, we should probably assume that all core encodings are relative > to Unicode as the pivot encoding. This should hopefully avoid any gotcha's > with roundtrips between any two arbitrary native encodings. The downside is > this will probably be slower than an optimised algorithmic transformation. Optimizations should go into separate packages for direct EncodingA -> EncodingB conversions. I don't think we need them in the core. > -- > 3. What encodings to distribute? > The only clean answers to this are 'almost none', or 'everything that > Unicode 3.0 has a mapping for'. The latter is going to add some weight to > the distribution. What are people's feelings? Do we ship any at all apart > from the Unicode ones? Should new encodings be downloadable from > www.python.org <http://www.python.org> ? Should there be an optional > package outside the main distribution? > -- > Ship with Unicode encodings in the core, the rest should be an add on > package. > > If we are truly Unicode-centric, this gives us the most value in terms of > accessing a Unicode character properties database, which will provide > language neutral case folding, Hankaku <----> Zenkaku folding (Japan > specific), and composition / normalisation between composed characters and > their component nonspacing characters. >From the proposal: """ Unicode Character Properties: ----------------------------- A separate module "unicodedata" should provide a compact interface to all Unicode character properties defined in the standard's UnicodeData.txt file. Among other things, these properties provide ways to recognize numbers, digits, spaces, whitespace, etc. Since this module will have to provide access to all Unicode characters, it will eventually have to contain the data from UnicodeData.txt which takes up around 200kB. For this reason, the data should be stored in static C data. This enables compilation as shared module which the underlying OS can shared between processes (unlike normal Python code modules). XXX Define the interface... """ Special CJK packages can then access this data for the purposes you mentioned above. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 46 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From guido@CNRI.Reston.VA.US Mon Nov 15 21:37:28 1999 From: guido@CNRI.Reston.VA.US (Guido van Rossum) Date: Mon, 15 Nov 1999 16:37:28 -0500 Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: Your message of "Mon, 15 Nov 1999 20:20:55 +0100." <38305D17.60EC94D0@lemburg.com> References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com> Message-ID: <199911152137.QAA28280@eric.cnri.reston.va.us> > Andy Robinson wrote: > > > > Some thoughts on the codecs... > > > > 1. Stream interface > > At the moment a codec has dump and load methods which > > read a (slice of a) stream into a string in memory and > > vice versa. As the proposal notes, this could lead to > > errors if you take a slice out of a stream. This is > > not just due to character truncation; some Asian > > encodings are modal and have shift-in and shift-out > > sequences as they move from Western single-byte > > characters to double-byte ones. It also seems a bit > > pointless to me as the source (or target) is still a > > Unicode string in memory. > > > > This is a real problem - a filter to convert big files > > between two encodings should be possible without > > knowledge of the particular encoding, as should one on > > the input/output of some server. We can still give a > > default implementation for single-byte encodings. > > > > What's a good API for real stream conversion? just > > Codec.encodeStream(infile, outfile) ? or is it more > > useful to feed the codec with data a chunk at a time? M.-A. Lemburg responds: > The idea was to use Unicode as intermediate for all > encoding conversions. > > What you invision here are stream recoders. The can > easily be implemented as an useful addition to the Codec > subclasses, but I don't think that these have to go > into the core. What I wanted was a codec API that acts somewhat like a buffered file; the buffer makes it possible to efficient handle shift states. This is not exactly what Andy shows, but it's not what Marc's current spec has either. I had thought something more like what Java does: an output stream codec's constructor takes a writable file object and the object returned by the constructor has a write() method, a flush() method and a close() method. It acts like a buffering interface to the underlying file; this allows it to generate the minimal number of shift sequeuces. Similar for input stream codecs. Andy's file translation example could then be written as follows: # assuming variables input_file, input_encoding, output_file, # output_encoding, and constant BUFFER_SIZE f = open(input_file, "rb") f1 = unicodec.codecs[input_encoding].stream_reader(f) g = open(output_file, "wb") g1 = unicodec.codecs[output_encoding].stream_writer(f) while 1: buffer = f1.read(BUFFER_SIZE) if not buffer: break f2.write(buffer) f2.close() f1.close() Note that we could possibly make these the only API that a codec needs to provide; the string object <--> unicode object conversions can be done using this and the cStringIO module. (On the other hand it seems a common case that would be quite useful.) > > 2. Data driven codecs > > I really like codecs being objects, and believe we > > could build support for a lot more encodings, a lot > > sooner than is otherwise possible, by making them data > > driven rather making each one compiled C code with > > static mapping tables. What do people think about the > > approach below? > > > > First of all, the ISO8859-1 series are straight > > mappings to Unicode code points. So one Python script > > could parse these files and build the mapping table, > > and a very small data file could hold these encodings. > > A compiled helper function analogous to > > string.translate() could deal with most of them. > > The problem with these large tables is that currently > Python modules are not shared among processes since > every process builds its own table. > > Static C data has the advantage of being shareable at > the OS level. Don't worry about it. 128K is too small to care, I think... > You can of course implement Python based lookup tables, > but these should be too large... > > > Secondly, the double-byte ones involve a mixture of > > algorithms and data. The worst cases I know are modal > > encodings which need a single-byte lookup table, a > > double-byte lookup table, and have some very simple > > rules about escape sequences in between them. A > > simple state machine could still handle these (and the > > single-byte mappings above become extra-simple special > > cases); I could imagine feeding it a totally > > data-driven set of rules. > > > > Third, we can massively compress the mapping tables > > using a notation which just lists contiguous ranges; > > and very often there are relationships between > > encodings. For example, "cpXYZ is just like cpXYY but > > with an extra 'smiley' at 0XFE32". In these cases, a > > script can build a family of related codecs in an > > auditable manner. > > These are all great ideas, but I think they unnecessarily > complicate the proposal. Agreed, let's leave the *implementation* of codecs out of the current efforts. However I want to make sure that the *interface* to codecs is defined right, because changing it will be expensive. (This is Linus Torvald's philosophy on drivers -- he doesn't care about bugs in drivers, as they will get fixed; however he greatly cares about defining the driver APIs correctly.) > > 3. What encodings to distribute? > > The only clean answers to this are 'almost none', or > > 'everything that Unicode 3.0 has a mapping for'. The > > latter is going to add some weight to the > > distribution. What are people's feelings? Do we ship > > any at all apart from the Unicode ones? Should new > > encodings be downloadable from www.python.org? Should > > there be an optional package outside the main > > distribution? > > Since Codecs can be registered at runtime, there is quite > some potential there for extension writers coding their > own fast codecs. E.g. one could use mxTextTools as codec > engine working at C speeds. (Do you think you'll be able to extort some money from HP for these? :-) > I would propose to only add some very basic encodings to > the standard distribution, e.g. the ones mentioned under > Standard Codecs in the proposal: > > 'utf-8': 8-bit variable length encoding > 'utf-16': 16-bit variable length encoding (litte/big endian) > 'utf-16-le': utf-16 but explicitly little endian > 'utf-16-be': utf-16 but explicitly big endian > 'ascii': 7-bit ASCII codepage > 'latin-1': Latin-1 codepage > 'html-entities': Latin-1 + HTML entities; > see htmlentitydefs.py from the standard Pythin Lib > 'jis' (a popular version XXX): > Japanese character encoding > 'unicode-escape': See Unicode Constructors for a definition > 'native': Dump of the Internal Format used by Python > > Perhaps not even 'html-entities' (even though it would make > a cool replacement for cgi.escape()) and maybe we should > also place the JIS encoding into a separate Unicode package. I'd drop html-entities, it seems too cutesie. (And who uses these anyway, outside browsers?) For JIS (shift-JIS?) I hope that Andy can help us with some pointers and validation. And unicode-escape: now that you mention it, this is a section of the proposal that I don't understand. I quote it here: | Python should provide a built-in constructor for Unicode strings which | is available through __builtins__: | | u = unicode(<encoded Python string>[,<encoding name>=<default encoding>]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ What do you mean by this notation? Since encoding names are not always legal Python identifiers (most contain hyphens), I don't understand what you really meant here. Do you mean to say that it has to be a keyword argument? I would disagree; and then I would have expected the notation [,encoding=<default encoding>]. | With the 'unicode-escape' encoding being defined as: | | u = u'<unicode-escape encoded Python string>' | | · for single characters (and this includes all \XXX sequences except \uXXXX), | take the ordinal and interpret it as Unicode ordinal; | | · for \uXXXX sequences, insert the Unicode character with ordinal 0xXXXX | instead, e.g. \u03C0 to represent the character Pi. I've looked at this several times and I don't see the difference between the two bullets. (Ironically, you are using a non-ASCII character here that doesn't always display, depending on where I look at your mail :-). Can you give some examples? Is u'\u0020' different from u'\x20' (a space)? Does '\u0020' (no u prefix) have a meaning? Also, I remember reading Tim Peters who suggested that a "raw unicode" notation (ur"...") might be necessary, to encode regular expressions. I tend to agree. While I'm on the topic, I don't see in your proposal a description of the source file character encoding. Currently, this is undefined, and in fact can be (ab)used to enter non-ASCII in string literals. For example, a programmer named François might write a file containing this statement: print "Written by François." # (There's a cedilla in there!) (He assumes his source character encoding is Latin-1, and he doesn't want to have to type \347 when he can type a cedilla on his keyboard.) If his source file (or .pyc file!) is executed by a Japanese user, this will probably print some garbage. Using the new Unicode strings, François could change his program as follows: print unicode("Written by François.", "latin-1") Assuming that François sets his sys.stdout to use Latin-1, while the Japanese user sets his to shift-JIS (or whatever his kanjiterm uses). But when the Japanese user views François' source file, he will again see garbage. If he uses a generic tool to translate latin-1 files to shift-JIS (assuming shift-JIS has a cedilla character) the program will no longer work correctly -- the string "latin-1" has to be changed to "shift-jis". What should we do about this? The safest and most radical solution is to disallow non-ASCII source characters; François will then have to type print u"Written by Fran\u00E7ois." but, knowing François, he probably won't like this solution very much (since he didn't like the \347 version either). --Guido van Rossum (home page: http://www.python.org/~guido/) From andy@robanal.demon.co.uk Mon Nov 15 21:41:21 1999 From: andy@robanal.demon.co.uk (Andy Robinson) Date: Mon, 15 Nov 1999 21:41:21 GMT Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: <38305D17.60EC94D0@lemburg.com> References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com> Message-ID: <38307984.12653394@post.demon.co.uk> On Mon, 15 Nov 1999 20:20:55 +0100, you wrote: >These are all great ideas, but I think they unnecessarily >complicate the proposal. However, to claim that Python is properly internationalized, we will need a large number of multi-byte encodings to be available. It's a large amount of work, it must be provably correct, and someone's going to have to do it. So if anyone with more C expertise than me - not hard :-) - is interested I'm not suggesting putting my points in the Unicode proposal - in fact, I'm very happy we have a proposal which allows for extension, and lets us work on the encodings separately (and later). >Since Codecs can be registered at runtime, there is quite >some potential there for extension writers coding their >own fast codecs. E.g. one could use mxTextTools as codec >engine working at C speeds. Exactly my thoughts , although I was thinking of a more slimmed down and specialized one. The right tool might be usable for things like compression algorithms too. Separate project to the Unicode stuff, but if anyone is interested, talk to me. >I would propose to only add some very basic encodings to >the standard distribution, e.g. the ones mentioned under >Standard Codecs in the proposal: > > 'utf-8': 8-bit variable length encoding > 'utf-16': 16-bit variable length encoding (litte/big endian) > 'utf-16-le': utf-16 but explicitly little endian > 'utf-16-be': utf-16 but explicitly big endian > 'ascii': 7-bit ASCII codepage > 'latin-1': Latin-1 codepage > 'html-entities': Latin-1 + HTML entities; > see htmlentitydefs.py from the standard Pythin Lib > 'jis' (a popular version XXX): > Japanese character encoding > 'unicode-escape': See Unicode Constructors for a definition > 'native': Dump of the Internal Format used by Python > Leave JISXXX and the CJK stuff out. If you get into Japanese, you really need to cover ShiftJIS, EUC-JP and JIS, they are big, and there are lots of options about how to do it. The other ones are algorithmic and can be small and fast and fit into the core. Ditto with HTML, and maybe even escaped-unicode too. In summary, the current discussion is clearly doing the right things, but is only covering a small percentage of what needs to be done to internationalize Python fully. - Andy From guido@CNRI.Reston.VA.US Mon Nov 15 21:49:26 1999 From: guido@CNRI.Reston.VA.US (Guido van Rossum) Date: Mon, 15 Nov 1999 16:49:26 -0500 Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: Your message of "Mon, 15 Nov 1999 21:41:21 GMT." <38307984.12653394@post.demon.co.uk> References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com> <38307984.12653394@post.demon.co.uk> Message-ID: <199911152149.QAA28345@eric.cnri.reston.va.us> > In summary, the current discussion is clearly doing the right things, > but is only covering a small percentage of what needs to be done to > internationalize Python fully. Agreed. So let's focus on defining interfaces that are correct and convenient so others who want to add codecs won't have to fight our architecture! Is the current architecture good enough so that the Japanese codecs will fit in it? (I'm particularly worried about the stream codecs, see my previous message.) --Guido van Rossum (home page: http://www.python.org/~guido/) From andy@robanal.demon.co.uk Mon Nov 15 21:58:34 1999 From: andy@robanal.demon.co.uk (Andy Robinson) Date: Mon, 15 Nov 1999 21:58:34 GMT Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: <199911152149.QAA28345@eric.cnri.reston.va.us> References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com> <38307984.12653394@post.demon.co.uk> <199911152149.QAA28345@eric.cnri.reston.va.us> Message-ID: <3831806d.14422147@post.demon.co.uk> On Mon, 15 Nov 1999 16:49:26 -0500, you wrote: >> In summary, the current discussion is clearly doing the right things, >> but is only covering a small percentage of what needs to be done to >> internationalize Python fully. > >Agreed. So let's focus on defining interfaces that are correct and >convenient so others who want to add codecs won't have to fight our >architecture! > >Is the current architecture good enough so that the Japanese codecs >will fit in it? (I'm particularly worried about the stream codecs, >see my previous message.) > No, I don't think it is good enough. We need a stream codec, and as you said the string and file interfaces can be built out of that. You guys will know better than me what the best patterns for that are... - Andy From andy@robanal.demon.co.uk Mon Nov 15 22:30:53 1999 From: andy@robanal.demon.co.uk (Andy Robinson) Date: Mon, 15 Nov 1999 22:30:53 GMT Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: <199911152137.QAA28280@eric.cnri.reston.va.us> References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com> <199911152137.QAA28280@eric.cnri.reston.va.us> Message-ID: <383086da.16067684@post.demon.co.uk> On Mon, 15 Nov 1999 16:37:28 -0500, you wrote: ># assuming variables input_file, input_encoding, output_file, ># output_encoding, and constant BUFFER_SIZE > >f = open(input_file, "rb") >f1 = unicodec.codecs[input_encoding].stream_reader(f) >g = open(output_file, "wb") >g1 = unicodec.codecs[output_encoding].stream_writer(f) > >while 1: > buffer = f1.read(BUFFER_SIZE) > if not buffer: > break > f2.write(buffer) > >f2.close() >f1.close() > >Note that we could possibly make these the only API that a codec needs >to provide; the string object <--> unicode object conversions can be >done using this and the cStringIO module. (On the other hand it seems >a common case that would be quite useful.) Perfect. I'd keep the string ones - easy to implement but a big convenience. The proposal also says: >For explicit handling of Unicode using files, the unicodec module >could provide stream wrappers which provide transparent >encoding/decoding for any open stream (file-like object): > > import unicodec > file = open('mytext.txt','rb') > ufile = unicodec.stream(file,'utf-16') > u = ufile.read() > ... > ufile.close() It seems to me that if we go for stream_reader, it replaces this bit of the proposal too - no need for unicodec to provide anything. If you want to have a convenience function there to save a line or two, you could have unicodec.open(filename, mode, encoding) which returned a stream_reader. - Andy From mal@lemburg.com Mon Nov 15 22:54:38 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 15 Nov 1999 23:54:38 +0100 Subject: [Python-Dev] Some thoughts on the codecs... References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com> <199911152137.QAA28280@eric.cnri.reston.va.us> Message-ID: <38308F2E.44B9C6BF@lemburg.com> [I'll get back on this tomorrow, just some quick notes here...] Guido van Rossum wrote: > > > Andy Robinson wrote: > > > > > > Some thoughts on the codecs... > > > > > > 1. Stream interface > > > At the moment a codec has dump and load methods which > > > read a (slice of a) stream into a string in memory and > > > vice versa. As the proposal notes, this could lead to > > > errors if you take a slice out of a stream. This is > > > not just due to character truncation; some Asian > > > encodings are modal and have shift-in and shift-out > > > sequences as they move from Western single-byte > > > characters to double-byte ones. It also seems a bit > > > pointless to me as the source (or target) is still a > > > Unicode string in memory. > > > > > > This is a real problem - a filter to convert big files > > > between two encodings should be possible without > > > knowledge of the particular encoding, as should one on > > > the input/output of some server. We can still give a > > > default implementation for single-byte encodings. > > > > > > What's a good API for real stream conversion? just > > > Codec.encodeStream(infile, outfile) ? or is it more > > > useful to feed the codec with data a chunk at a time? > > M.-A. Lemburg responds: > > > The idea was to use Unicode as intermediate for all > > encoding conversions. > > > > What you invision here are stream recoders. The can > > easily be implemented as an useful addition to the Codec > > subclasses, but I don't think that these have to go > > into the core. > > What I wanted was a codec API that acts somewhat like a buffered file; > the buffer makes it possible to efficient handle shift states. This > is not exactly what Andy shows, but it's not what Marc's current spec > has either. > > I had thought something more like what Java does: an output stream > codec's constructor takes a writable file object and the object > returned by the constructor has a write() method, a flush() method and > a close() method. It acts like a buffering interface to the > underlying file; this allows it to generate the minimal number of > shift sequeuces. Similar for input stream codecs. The Codecs provide implementations for encoding and decoding, they are not intended as complete wrappers for e.g. files or sockets. The unicodec module will define a generic stream wrapper (which is yet to be defined) for dealing with files, sockets, etc. It will use the codec registry to do the actual codec work. >From the proposal: """ For explicit handling of Unicode using files, the unicodec module could provide stream wrappers which provide transparent encoding/decoding for any open stream (file-like object): import unicodec file = open('mytext.txt','rb') ufile = unicodec.stream(file,'utf-16') u = ufile.read() ... ufile.close() XXX unicodec.file(<filename>,<mode>,<encname>) could be provided as short-hand for unicodec.file(open(<filename>,<mode>),<encname>) which also assures that <mode> contains the 'b' character when needed. XXX Specify the wrapper(s)... Open issues: what to do with Python strings fed to the .write() method (may need to know the encoding of the strings) and when/if to return Python strings through the .read() method. Perhaps we need more than one type of wrapper here. """ > Andy's file translation example could then be written as follows: > > # assuming variables input_file, input_encoding, output_file, > # output_encoding, and constant BUFFER_SIZE > > f = open(input_file, "rb") > f1 = unicodec.codecs[input_encoding].stream_reader(f) > g = open(output_file, "wb") > g1 = unicodec.codecs[output_encoding].stream_writer(f) > > while 1: > buffer = f1.read(BUFFER_SIZE) > if not buffer: > break > f2.write(buffer) > > f2.close() > f1.close() > Note that we could possibly make these the only API that a codec needs > to provide; the string object <--> unicode object conversions can be > done using this and the cStringIO module. (On the other hand it seems > a common case that would be quite useful.) You wouldn't want to go via cStringIO for *every* encoding translation. The Codec interface defines two pairs of methods on purpose: one which works internally (ie. directly between strings and Unicode objects), and one which works externally (directly between a stream and Unicode objects). > > > 2. Data driven codecs > > > I really like codecs being objects, and believe we > > > could build support for a lot more encodings, a lot > > > sooner than is otherwise possible, by making them data > > > driven rather making each one compiled C code with > > > static mapping tables. What do people think about the > > > approach below? > > > > > > First of all, the ISO8859-1 series are straight > > > mappings to Unicode code points. So one Python script > > > could parse these files and build the mapping table, > > > and a very small data file could hold these encodings. > > > A compiled helper function analogous to > > > string.translate() could deal with most of them. > > > > The problem with these large tables is that currently > > Python modules are not shared among processes since > > every process builds its own table. > > > > Static C data has the advantage of being shareable at > > the OS level. > > Don't worry about it. 128K is too small to care, I think... Huh ? 128K for every process using Python ? That quickly sums up to lots of megabytes lying around pretty much unused. > > You can of course implement Python based lookup tables, > > but these should be too large... > > > > > Secondly, the double-byte ones involve a mixture of > > > algorithms and data. The worst cases I know are modal > > > encodings which need a single-byte lookup table, a > > > double-byte lookup table, and have some very simple > > > rules about escape sequences in between them. A > > > simple state machine could still handle these (and the > > > single-byte mappings above become extra-simple special > > > cases); I could imagine feeding it a totally > > > data-driven set of rules. > > > > > > Third, we can massively compress the mapping tables > > > using a notation which just lists contiguous ranges; > > > and very often there are relationships between > > > encodings. For example, "cpXYZ is just like cpXYY but > > > with an extra 'smiley' at 0XFE32". In these cases, a > > > script can build a family of related codecs in an > > > auditable manner. > > > > These are all great ideas, but I think they unnecessarily > > complicate the proposal. > > Agreed, let's leave the *implementation* of codecs out of the current > efforts. > > However I want to make sure that the *interface* to codecs is defined > right, because changing it will be expensive. (This is Linus > Torvald's philosophy on drivers -- he doesn't care about bugs in > drivers, as they will get fixed; however he greatly cares about > defining the driver APIs correctly.) > > > > 3. What encodings to distribute? > > > The only clean answers to this are 'almost none', or > > > 'everything that Unicode 3.0 has a mapping for'. The > > > latter is going to add some weight to the > > > distribution. What are people's feelings? Do we ship > > > any at all apart from the Unicode ones? Should new > > > encodings be downloadable from www.python.org? Should > > > there be an optional package outside the main > > > distribution? > > > > Since Codecs can be registered at runtime, there is quite > > some potential there for extension writers coding their > > own fast codecs. E.g. one could use mxTextTools as codec > > engine working at C speeds. > > (Do you think you'll be able to extort some money from HP for these? :-) Don't know, it depends on what their specs look like. I use mxTextTools for fast HTML file processing. It uses a small Turing machine with some extra magic and is progammable via Python tuples. > > I would propose to only add some very basic encodings to > > the standard distribution, e.g. the ones mentioned under > > Standard Codecs in the proposal: > > > > 'utf-8': 8-bit variable length encoding > > 'utf-16': 16-bit variable length encoding (litte/big endian) > > 'utf-16-le': utf-16 but explicitly little endian > > 'utf-16-be': utf-16 but explicitly big endian > > 'ascii': 7-bit ASCII codepage > > 'latin-1': Latin-1 codepage > > 'html-entities': Latin-1 + HTML entities; > > see htmlentitydefs.py from the standard Pythin Lib > > 'jis' (a popular version XXX): > > Japanese character encoding > > 'unicode-escape': See Unicode Constructors for a definition > > 'native': Dump of the Internal Format used by Python > > > > Perhaps not even 'html-entities' (even though it would make > > a cool replacement for cgi.escape()) and maybe we should > > also place the JIS encoding into a separate Unicode package. > > I'd drop html-entities, it seems too cutesie. (And who uses these > anyway, outside browsers?) Ok. > For JIS (shift-JIS?) I hope that Andy can help us with some pointers > and validation. > > And unicode-escape: now that you mention it, this is a section of > the proposal that I don't understand. I quote it here: > > | Python should provide a built-in constructor for Unicode strings which > | is available through __builtins__: > | > | u = unicode(<encoded Python string>[,<encoding name>=<default encoding>]) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ I meant this as optional second argument defaulting to whatever we define <default encoding> to mean, e.g. 'utf-8'. u = unicode("string","utf-8") == unicode("string") The <encoding name> argument must be a string identifying one of the registered codecs. > | With the 'unicode-escape' encoding being defined as: > | > | u = u'<unicode-escape encoded Python string>' > | > | · for single characters (and this includes all \XXX sequences except \uXXXX), > | take the ordinal and interpret it as Unicode ordinal; > | > | · for \uXXXX sequences, insert the Unicode character with ordinal 0xXXXX > | instead, e.g. \u03C0 to represent the character Pi. > > I've looked at this several times and I don't see the difference > between the two bullets. (Ironically, you are using a non-ASCII > character here that doesn't always display, depending on where I look > at your mail :-). The first bullet covers the normal Python string characters and escapes, e.g. \n and \267 (the center dot ;-), while the second explains how \uXXXX is interpreted. > Can you give some examples? > > Is u'\u0020' different from u'\x20' (a space)? No, they both map to the same Unicode ordinal. > Does '\u0020' (no u prefix) have a meaning? No, \uXXXX is only defined for u"" strings or strings that are used to build Unicode objects with this encoding: u = u'\u0020' == unicode(r'\u0020','unicode-escape') Note that writing \uXX is an error, e.g. u"\u12 " will cause cause a syntax error. Aside: I just noticed that '\x2010' doesn't give '\x20' + '10' but instead '\x10' -- is this intended ? > Also, I remember reading Tim Peters who suggested that a "raw unicode" > notation (ur"...") might be necessary, to encode regular expressions. > I tend to agree. This can be had via unicode(): u = unicode(r'\a\b\c\u0020','unicode-escaped') If that's too long, define a ur() function which wraps up the above line in a function. > While I'm on the topic, I don't see in your proposal a description of > the source file character encoding. Currently, this is undefined, and > in fact can be (ab)used to enter non-ASCII in string literals. For > example, a programmer named François might write a file containing > this statement: > > print "Written by François." # (There's a cedilla in there!) > > (He assumes his source character encoding is Latin-1, and he doesn't > want to have to type \347 when he can type a cedilla on his keyboard.) > > If his source file (or .pyc file!) is executed by a Japanese user, > this will probably print some garbage. > > Using the new Unicode strings, François could change his program as > follows: > > print unicode("Written by François.", "latin-1") > > Assuming that François sets his sys.stdout to use Latin-1, while the > Japanese user sets his to shift-JIS (or whatever his kanjiterm uses). > > But when the Japanese user views François' source file, he will again > see garbage. If he uses a generic tool to translate latin-1 files to > shift-JIS (assuming shift-JIS has a cedilla character) the program > will no longer work correctly -- the string "latin-1" has to be > changed to "shift-jis". > > What should we do about this? The safest and most radical solution is > to disallow non-ASCII source characters; François will then have to > type > > print u"Written by Fran\u00E7ois." > > but, knowing François, he probably won't like this solution very much > (since he didn't like the \347 version either). I think best is to leave it undefined... as with all files, only the programmer knows what format and encoding it contains, e.g. a Japanese programmer might want to use a shift-JIS editor to enter strings directly in shift-JIS via u = unicode("...shift-JIS encoded text...","shift-jis") Of course, this is not readable using an ASCII editor, but Python will continue to produce the intended string. NLS strings don't belong into program text anyway: i10n usually takes the gettext() approach to handle these issues. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 46 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From andy@robanal.demon.co.uk Tue Nov 16 00:09:28 1999 From: andy@robanal.demon.co.uk (Andy Robinson) Date: Tue, 16 Nov 1999 00:09:28 GMT Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: <38308F2E.44B9C6BF@lemburg.com> References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com> <199911152137.QAA28280@eric.cnri.reston.va.us> <38308F2E.44B9C6BF@lemburg.com> Message-ID: <3839a078.22625844@post.demon.co.uk> On Mon, 15 Nov 1999 23:54:38 +0100, you wrote: >[I'll get back on this tomorrow, just some quick notes here...] >The Codecs provide implementations for encoding and decoding, >they are not intended as complete wrappers for e.g. files or >sockets. > >The unicodec module will define a generic stream wrapper >(which is yet to be defined) for dealing with files, sockets, >etc. It will use the codec registry to do the actual codec >work. > >XXX unicodec.file(<filename>,<mode>,<encname>) could be provided as > short-hand for unicodec.file(open(<filename>,<mode>),<encname>) which > also assures that <mode> contains the 'b' character when needed. > >The Codec interface defines two pairs of methods >on purpose: one which works internally (ie. directly between >strings and Unicode objects), and one which works externally >(directly between a stream and Unicode objects). That's the problem Guido and I are worried about. Your present API is not enough to build stream encoders. The 'slurp it into a unicode string in one go' approach fails for big files or for network connections. And you just cannot build a generic stream reader/writer by slicing it into strings. The solution must be specific to the codec - only it knows how much to buffer, when to flip states etc. So the codec should provide proper stream reading and writing services. Unicodec can then wrap those up in labour-saving ways - I'm not fussy which but I like the one-line file-open utility. - Andy From tim_one@email.msn.com Tue Nov 16 05:38:32 1999 From: tim_one@email.msn.com (Tim Peters) Date: Tue, 16 Nov 1999 00:38:32 -0500 Subject: [Python-Dev] Unicode proposal: %-formatting ? In-Reply-To: <382AE7D9.147D58CB@lemburg.com> Message-ID: <000001bf2ff4$d36e2540$042d153f@tim> [MAL] > I wonder how we could add %-formatting to Unicode strings without > duplicating the PyString_Format() logic. > > First, do we need Unicode object %-formatting at all ? Sure -- in the end, all the world speaks Unicode natively and encodings become historical baggage. Granted I won't live that long, but I may last long enough to see encodings become almost purely an I/O hassle, with all computation done in Unicode. > Second, here is an emulation using strings and <default encoding> > that should give an idea of one could work with the different > encodings: > > s = '%s %i abcäöü' # a Latin-1 encoded string > t = (u,3) What's u? A Unicode object? Another Latin-1 string? A default-encoded string? How does the following know the difference? > # Convert Latin-1 s to a <default encoding> string via Unicode > s1 = unicode(s,'latin-1').encode() > > # The '%s' will now add u in <default encoding> > s2 = s1 % t > > # Finally, convert the <default encoding> encoded string to Unicode > u1 = unicode(s2) I don't expect this actually works: for example, change %s to %4s. Assuming u is either UTF-8 or Unicode, PyString_Format isn't smart enough to know that some (or all) characters in u consume multiple bytes, so can't extract "the right" number of bytes from u. I think % formating has to know the truth of what you're doing. > Note that .encode() defaults to the current setting of > <default encoding>. > > Provided u maps to Latin-1, an alternative would be: > > u1 = unicode('%s %i abcäöü' % (u.encode('latin-1'),3), 'latin-1') More interesting is fmt % tuple where everything is Unicode; people can muck with Latin-1 directly today using regular strings, so the example above mostly shows artificial convolution. From tim_one@email.msn.com Tue Nov 16 05:38:40 1999 From: tim_one@email.msn.com (Tim Peters) Date: Tue, 16 Nov 1999 00:38:40 -0500 Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit) In-Reply-To: <382BDD81.458D3125@lemburg.com> Message-ID: <000101bf2ff4$d636bb20$042d153f@tim> [MAL, on raw Unicode strings] > ... > Agreed... note that you could also write your own codec for just this > reason and then use: > > u = unicode('....\u1234...\...\...','raw-unicode-escaped') > > Put that into a function called 'ur' and you have: > > u = ur('...\u4545...\...\...') > > which is not that far away from ur'...' w/r to cosmetics. Well, not quite. In general you need to pass raw strings: u = unicode(r'....\u1234...\...\...','raw-unicode-escaped') ^ u = ur(r'...\u4545...\...\...') ^ else Python will replace all the other backslash sequences. This is a crucial distinction at times; e.g., else \b in a Unicode regexp will expand into a backspace character before the regexp processor ever sees it (\b is supposed to be a word boundary assertion). From tim_one@email.msn.com Tue Nov 16 05:44:42 1999 From: tim_one@email.msn.com (Tim Peters) Date: Tue, 16 Nov 1999 00:44:42 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <Pine.LNX.4.10.9911120225080.27203-100000@nebula.lyra.org> Message-ID: <000201bf2ff5$ae6aefc0$042d153f@tim> [Tim, wonders why Perl and Tcl went w/ UTF-8 internally] [Greg Stein] > Probably for the exact reason that you stated in your messages: many > 8-bit (7-bit?) functions continue to work quite well when given a > UTF-8-encoded string. i.e. they didn't have to rewrite the entire > Perl/TCL interpreter to deal with a new string type. > > I'd guess it is a helluva lot easier for us to add a Python Type than > for Perl or TCL to whack around with new string types (since they use > strings so heavily). Sounds convincing to me! Bumped into an old thread on c.l.p.m. that suggested Perl was also worried about UCS-2's 64K code point limit. But I'm already on record as predicting we'll regret any decision <wink>. From tim_one@email.msn.com Tue Nov 16 05:52:12 1999 From: tim_one@email.msn.com (Tim Peters) Date: Tue, 16 Nov 1999 00:52:12 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <DBF3B37F7BF1D111B2A10000F6B14B1FDDAF22@ukhil704nts.hld.uk.fid-intl.com> Message-ID: <000501bf2ff6$ba943a80$042d153f@tim> [Da Silva, Mike] > ... > 5. UTF-16 requires string operations that do not make assumptions > about nulls - this means re-implementing most of the C runtime > functions to work with unsigned shorts. Python strings are already null-friendly, so Python has already recoded everything it needs to get away from the no-null assumption; stropmodule.c is < 1,500 lines of code, and MAL can turn it into C++ template functions in his sleep <wink -- but stuff "like this" really is easier in C++>. From tim_one@email.msn.com Tue Nov 16 05:56:18 1999 From: tim_one@email.msn.com (Tim Peters) Date: Tue, 16 Nov 1999 00:56:18 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <19991112121303.27452.rocketmail@ web605.yahoomail.com> Message-ID: <000601bf2ff7$4d8a4c80$042d153f@tim> [Andy Robinson] > ... > I presume no one is actually advocating dropping > ordinary Python strings, or the ability to do > rawdata = open('myfile.txt', 'rb').read() > without any transformations? If anyone has advocated either, they've successfully hidden it from me. Anyone? From tim_one@email.msn.com Tue Nov 16 06:09:04 1999 From: tim_one@email.msn.com (Tim Peters) Date: Tue, 16 Nov 1999 01:09:04 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <382BF6C3.D79840EC@lemburg.com> Message-ID: <000701bf2ff9$15cecda0$042d153f@tim> [MAL] > BTW, wouldn't it be possible to take pcre and have it > use Py_Unicode instead of char ? [Of course, there would have to > be some extensions for character classes etc.] No, alas. The assumption that characters are 8 bits is ubiquitous, in both obvious and subtle ways. if ((start_bits[c/8] & (1 << (c&7))) == 0) start_match++; else break; From tim_one@email.msn.com Tue Nov 16 06:19:16 1999 From: tim_one@email.msn.com (Tim Peters) Date: Tue, 16 Nov 1999 01:19:16 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <382C3749.198EEBC6@lemburg.com> Message-ID: <000801bf2ffa$82273400$042d153f@tim> [MAL] > sys.bom should return the byte order mark (BOM) for the format used > internally. The unicodec module should provide symbols for all > possible values of this variable: > > BOM_BE: '\376\377' > (corresponds to Unicode 0x0000FEFF in UTF-16 > == ZERO WIDTH NO-BREAK SPACE) > > BOM_LE: '\377\376' > (corresponds to Unicode 0x0000FFFE in UTF-16 > == illegal Unicode character) > > BOM4_BE: '\000\000\377\376' > (corresponds to Unicode 0x0000FEFF in UCS-4) Should be BOM4_BE: '\000\000\376\377' > BOM4_LE: '\376\377\000\000' > (corresponds to Unicode 0x0000FFFE in UCS-4) Should be BOM4_LE: '\377\376\000\000' From tim_one@email.msn.com Tue Nov 16 06:31:39 1999 From: tim_one@email.msn.com (Tim Peters) Date: Tue, 16 Nov 1999 01:31:39 -0500 Subject: [Python-Dev] just say no... In-Reply-To: <14380.16437.71847.832880@weyr.cnri.reston.va.us> Message-ID: <000901bf2ffc$3d4bb8e0$042d153f@tim> [Fred L. Drake, Jr.] > ... > I wasn't suggesting the PyStringObject be changed, only that the > PyUnicodeObject could maintain a reference. Consider: > > s = fp.read() > u = unicode(s, 'utf-8') > > u would now hold a reference to s, and s/s# would return a pointer > into s instead of re-building the UTF-8 form. I talked myself out of > this because it would be too easy to keep a lot more string objects > around than were actually needed. Yet another use for a weak reference <0.5 wink>. From tim_one@email.msn.com Tue Nov 16 06:41:44 1999 From: tim_one@email.msn.com (Tim Peters) Date: Tue, 16 Nov 1999 01:41:44 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <Pine.LNX.4.10.9911121519440.2535-100000@nebula.lyra.org> Message-ID: <000b01bf2ffd$a5ad69a0$042d153f@tim> [MAL] > BOM_BE: '\376\377' > (corresponds to Unicode 0x0000FEFF in UTF-16 > == ZERO WIDTH NO-BREAK SPACE) [Greg Stein] > Are you sure about that interpretation? I thought the BOM characters > (0xFEFF and 0xFFFE) were *reserved* in the UCS-2 space. I can't speak to MAL's degree of certainty <wink>, but he's right about this stuff. There is only one BOM character, U+FEFF, which is the zero-width no-break space. The byte-swapped form is not only reserved, it's guaranteed never to be assigned to a character. From tim_one@email.msn.com Tue Nov 16 07:47:06 1999 From: tim_one@email.msn.com (Tim Peters) Date: Tue, 16 Nov 1999 02:47:06 -0500 Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: <199911152137.QAA28280@eric.cnri.reston.va.us> Message-ID: <000d01bf3006$c7823700$042d153f@tim> [Guido] > ... > While I'm on the topic, I don't see in your proposal a description of > the source file character encoding. Currently, this is undefined, and > in fact can be (ab)used to enter non-ASCII in string literals. > ... > What should we do about this? The safest and most radical solution is > to disallow non-ASCII source characters; François will then have to > type > > print u"Written by Fran\u00E7ois." > > but, knowing François, he probably won't like this solution very much > (since he didn't like the \347 version either). So long as Python opens source files using libc text mode, it can't guarantee more than C does: the presence of any character other than tab, newline, and ASCII 32-126 inclusive renders the file contents undefined. Go beyond that, and you've got the same problem as mailers and browsers, and so also the same solution: open source files in binary mode, and add a pragma specifying the intended charset. As a practical matter, declare that Python source is Latin-1 for now, and declare any *system* that doesn't support that non-conforming <wink>. python-is-the-measure-of-all-things-ly y'rs - tim From tim_one@email.msn.com Tue Nov 16 07:47:08 1999 From: tim_one@email.msn.com (Tim Peters) Date: Tue, 16 Nov 1999 02:47:08 -0500 Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: <38308F2E.44B9C6BF@lemburg.com> Message-ID: <000e01bf3006$c8c11fa0$042d153f@tim> [Guido] >> Does '\u0020' (no u prefix) have a meaning? [MAL] > No, \uXXXX is only defined for u"" strings or strings that are > used to build Unicode objects with this encoding: I believe your intent is that '\u0020' be exactly those 6 characters, just as today. That is, it does have a meaning, but its meaning differs between Unicode string literals and regular string literals. > Note that writing \uXX is an error, e.g. u"\u12 " will cause > cause a syntax error. Although I believe your intent <wink> is that, just as today, '\u12' is not an error. > Aside: I just noticed that '\x2010' doesn't give '\x20' + '10' > but instead '\x10' -- is this intended ? Yes; see 2.4.1 ("String literals") of the Lang Ref. Blame the C committee for not defining \x in a platform-independent way. Note that a Python \x escape consumes *all* following hex characters, no matter how many -- and ignores all but the last two. > This [raw Unicode strings] can be had via unicode(): > > u = unicode(r'\a\b\c\u0020','unicode-escaped') > > If that's too long, define a ur() function which wraps up the > above line in a function. As before, I think that's fine for now, but won't stand forever. From fredrik@pythonware.com Tue Nov 16 08:39:20 1999 From: fredrik@pythonware.com (Fredrik Lundh) Date: Tue, 16 Nov 1999 09:39:20 +0100 Subject: [Python-Dev] Some thoughts on the codecs... References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com> <199911152137.QAA28280@eric.cnri.reston.va.us> Message-ID: <010001bf300e$14741310$f29b12c2@secret.pythonware.com> Guido van Rossum <guido@CNRI.Reston.VA.US> wrote: > I had thought something more like what Java does: an output stream > codec's constructor takes a writable file object and the object > returned by the constructor has a write() method, a flush() method and > a close() method. It acts like a buffering interface to the > underlying file; this allows it to generate the minimal number of > shift sequeuces. Similar for input stream codecs. note that the html/sgml/xml parsers generally support the feed/close protocol. to be able to use these codecs in that context, we need 1) codes written according to the "data consumer model", instead of the "stream" model. class myDecoder: def __init__(self, target): self.target = target self.state = ... def feed(self, data): ... extract as much data as possible ... self.target.feed(extracted data) def close(self): ... extract what's left ... self.target.feed(additional data) self.target.close() or 2) make threads mandatory, just like in Java. or 3) add light-weight threads (ala stackless python) to the interpreter... (I vote for alternative 3, but that's another story ;-) </F> From fredrik@pythonware.com Tue Nov 16 08:58:50 1999 From: fredrik@pythonware.com (Fredrik Lundh) Date: Tue, 16 Nov 1999 09:58:50 +0100 Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit) References: <000101bf2ff4$d636bb20$042d153f@tim> Message-ID: <016a01bf3010$cde52620$f29b12c2@secret.pythonware.com> Tim Peters <tim_one@email.msn.com> wrote: > (\b is supposed to be a word boundary assertion). in some places, that is. </F> Main Entry: reg·u·lar Pronunciation: 're-gy&-l&r, 're-g(&-)l&r 1 : belonging to a religious order 2 a : formed, built, arranged, or ordered according to some established rule, law, principle, or type ... 3 a : ORDERLY, METHODICAL <regular habits> ... 4 a : constituted, conducted, or done in conformity with established or prescribed usages, rules, or discipline ... From jack@oratrix.nl Tue Nov 16 11:05:55 1999 From: jack@oratrix.nl (Jack Jansen) Date: Tue, 16 Nov 1999 12:05:55 +0100 Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: Message by "M.-A. Lemburg" <mal@lemburg.com> , Mon, 15 Nov 1999 20:20:55 +0100 , <38305D17.60EC94D0@lemburg.com> Message-ID: <19991116110555.8B43335BB1E@snelboot.oratrix.nl> > I would propose to only add some very basic encodings to > the standard distribution, e.g. the ones mentioned under > Standard Codecs in the proposal: > > 'utf-8': 8-bit variable length encoding > 'utf-16': 16-bit variable length encoding (litte/big endian) > 'utf-16-le': utf-16 but explicitly little endian > 'utf-16-be': utf-16 but explicitly big endian > 'ascii': 7-bit ASCII codepage > 'latin-1': Latin-1 codepage > 'html-entities': Latin-1 + HTML entities; > see htmlentitydefs.py from the standard Pythin Lib > 'jis' (a popular version XXX): > Japanese character encoding > 'unicode-escape': See Unicode Constructors for a definition > 'native': Dump of the Internal Format used by Python I would suggest adding the Dos, Windows and Macintosh standard 8-bit charsets (their equivalents of latin-1) too, as documents in these encoding are pretty ubiquitous. But maybe these should only be added on the respective platforms. -- Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++ Jack.Jansen@oratrix.com | ++++ if you agree copy these lines to your sig ++++ www.oratrix.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm From mal@lemburg.com Tue Nov 16 08:35:28 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Tue, 16 Nov 1999 09:35:28 +0100 Subject: [Python-Dev] Some thoughts on the codecs... References: <000e01bf3006$c8c11fa0$042d153f@tim> Message-ID: <38311750.22D17EC1@lemburg.com> Tim Peters wrote: > > [Guido] > >> Does '\u0020' (no u prefix) have a meaning? > > [MAL] > > No, \uXXXX is only defined for u"" strings or strings that are > > used to build Unicode objects with this encoding: > > I believe your intent is that '\u0020' be exactly those 6 characters, just > as today. That is, it does have a meaning, but its meaning differs between > Unicode string literals and regular string literals. Right. > > Note that writing \uXX is an error, e.g. u"\u12 " will cause > > cause a syntax error. > > Although I believe your intent <wink> is that, just as today, '\u12' is not > an error. Right again :-) "\u12" gives a 4 byte string, u"\u12" produces an exception. > > Aside: I just noticed that '\x2010' doesn't give '\x20' + '10' > > but instead '\x10' -- is this intended ? > > Yes; see 2.4.1 ("String literals") of the Lang Ref. Blame the C committee > for not defining \x in a platform-independent way. Note that a Python \x > escape consumes *all* following hex characters, no matter how many -- and > ignores all but the last two. Strange definition... > > This [raw Unicode strings] can be had via unicode(): > > > > u = unicode(r'\a\b\c\u0020','unicode-escaped') > > > > If that's too long, define a ur() function which wraps up the > > above line in a function. > > As before, I think that's fine for now, but won't stand forever. If Guido agrees to ur"", I can put that into the proposal too -- it's just that things are starting to get a little crowded for a strawman proposal ;-) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Tue Nov 16 10:50:31 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Tue, 16 Nov 1999 11:50:31 +0100 Subject: [Python-Dev] Some thoughts on the codecs... References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com> <38307984.12653394@post.demon.co.uk> Message-ID: <383136F7.AB73A90@lemburg.com> Andy Robinson wrote: > > Leave JISXXX and the CJK stuff out. If you get into Japanese, you > really need to cover ShiftJIS, EUC-JP and JIS, they are big, and there > are lots of options about how to do it. The other ones are > algorithmic and can be small and fast and fit into the core. > > Ditto with HTML, and maybe even escaped-unicode too. So I can drop JIS ? [I won't be able to drop the escaped unicode codec because this is needed for u"" and ur"".] -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Tue Nov 16 10:42:19 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Tue, 16 Nov 1999 11:42:19 +0100 Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit) References: <000101bf2ff4$d636bb20$042d153f@tim> Message-ID: <3831350B.8F69CB6D@lemburg.com> Tim Peters wrote: > > [MAL, on raw Unicode strings] > > ... > > Agreed... note that you could also write your own codec for just this > > reason and then use: > > > > u = unicode('....\u1234...\...\...','raw-unicode-escaped') > > > > Put that into a function called 'ur' and you have: > > > > u = ur('...\u4545...\...\...') > > > > which is not that far away from ur'...' w/r to cosmetics. > > Well, not quite. In general you need to pass raw strings: > > u = unicode(r'....\u1234...\...\...','raw-unicode-escaped') > ^ > u = ur(r'...\u4545...\...\...') > ^ > > else Python will replace all the other backslash sequences. This is a > crucial distinction at times; e.g., else \b in a Unicode regexp will expand > into a backspace character before the regexp processor ever sees it (\b is > supposed to be a word boundary assertion). Right. Here is a sample implementation of what I had in mind: """ Demo for 'unicode-escape' encoding. """ import struct,string,re pack_format = '>H' def convert_string(s): l = map(None,s) for i in range(len(l)): l[i] = struct.pack(pack_format,ord(l[i])) return l u_escape = re.compile(r'\\u([0-9a-fA-F]{0,4})') def unicode_unescape(s): l = [] start = 0 while start < len(s): m = u_escape.search(s,start) if not m: l[len(l):] = convert_string(s[start:]) break m_start,m_end = m.span() if m_start > start: l[len(l):] = convert_string(s[start:m_start]) hexcode = m.group(1) #print hexcode,start,m_start if len(hexcode) != 4: raise SyntaxError,'illegal \\uXXXX sequence: \\u%s' % hexcode ordinal = string.atoi(hexcode,16) l.append(struct.pack(pack_format,ordinal)) start = m_end #print l return string.join(l,'') def hexstr(s,sep=''): return string.join(map(lambda x,hex=hex,ord=ord: '%02x' % ord(x),s),sep) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Tue Nov 16 10:40:42 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Tue, 16 Nov 1999 11:40:42 +0100 Subject: [Python-Dev] Unicode proposal: %-formatting ? References: <000001bf2ff4$d36e2540$042d153f@tim> Message-ID: <383134AA.4B49D178@lemburg.com> Tim Peters wrote: > > [MAL] > > I wonder how we could add %-formatting to Unicode strings without > > duplicating the PyString_Format() logic. > > > > First, do we need Unicode object %-formatting at all ? > > Sure -- in the end, all the world speaks Unicode natively and encodings > become historical baggage. Granted I won't live that long, but I may last > long enough to see encodings become almost purely an I/O hassle, with all > computation done in Unicode. > > > Second, here is an emulation using strings and <default encoding> > > that should give an idea of one could work with the different > > encodings: > > > > s = '%s %i abcäöü' # a Latin-1 encoded string > > t = (u,3) > > What's u? A Unicode object? Another Latin-1 string? A default-encoded > string? How does the following know the difference? u refers to a Unicode object in the proposal. Sorry, forgot to mention that. > > # Convert Latin-1 s to a <default encoding> string via Unicode > > s1 = unicode(s,'latin-1').encode() > > > > # The '%s' will now add u in <default encoding> > > s2 = s1 % t > > > > # Finally, convert the <default encoding> encoded string to Unicode > > u1 = unicode(s2) > > I don't expect this actually works: for example, change %s to %4s. > Assuming u is either UTF-8 or Unicode, PyString_Format isn't smart enough to > know that some (or all) characters in u consume multiple bytes, so can't > extract "the right" number of bytes from u. I think % formating has to know > the truth of what you're doing. Hmm, guess you're right... format parameters should indeed refer to characters rather than number of encoding bytes. This means a new PyUnicode_Format() implementation mapping Unicode format objects to Unicode objects. > > Note that .encode() defaults to the current setting of > > <default encoding>. > > > > Provided u maps to Latin-1, an alternative would be: > > > > u1 = unicode('%s %i abcäöü' % (u.encode('latin-1'),3), 'latin-1') > > More interesting is fmt % tuple where everything is Unicode; people can muck > with Latin-1 directly today using regular strings, so the example above > mostly shows artificial convolution. ... hmm, there is a problem there: how should the PyUnicode_Format() API deal with '%s' when it sees a Unicode object as argument ? E.g. what would you get in these cases: u = u"%s %s" % (u"abc", "abc") Perhaps we need a new marker for "insert Unicode object here". -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Tue Nov 16 10:48:13 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Tue, 16 Nov 1999 11:48:13 +0100 Subject: [Python-Dev] Some thoughts on the codecs... References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com> <199911152137.QAA28280@eric.cnri.reston.va.us> <38308F2E.44B9C6BF@lemburg.com> <3839a078.22625844@post.demon.co.uk> Message-ID: <3831366D.8A09E194@lemburg.com> Andy Robinson wrote: > > On Mon, 15 Nov 1999 23:54:38 +0100, you wrote: > > >[I'll get back on this tomorrow, just some quick notes here...] > >The Codecs provide implementations for encoding and decoding, > >they are not intended as complete wrappers for e.g. files or > >sockets. > > > >The unicodec module will define a generic stream wrapper > >(which is yet to be defined) for dealing with files, sockets, > >etc. It will use the codec registry to do the actual codec > >work. > > > >XXX unicodec.file(<filename>,<mode>,<encname>) could be provided as > > short-hand for unicodec.file(open(<filename>,<mode>),<encname>) which > > also assures that <mode> contains the 'b' character when needed. > > > >The Codec interface defines two pairs of methods > >on purpose: one which works internally (ie. directly between > >strings and Unicode objects), and one which works externally > >(directly between a stream and Unicode objects). > > That's the problem Guido and I are worried about. Your present API is > not enough to build stream encoders. The 'slurp it into a unicode > string in one go' approach fails for big files or for network > connections. And you just cannot build a generic stream reader/writer > by slicing it into strings. The solution must be specific to the > codec - only it knows how much to buffer, when to flip states etc. > > So the codec should provide proper stream reading and writing > services. I guess I'll have to rethink the Codec specs. Some leads: 1. introduce a new StreamCodec class which is designed for handling stream encoding and decoding (and supports state) 2. give more information to the unicodec registry: one could register classes instead of instances which the Unicode imlementation would then instantiate whenever it needs to apply the conversion; since this is only needed for encodings maintaining state, the registery would only have to do the instantiation for these codecs and could use cached instances for stateless codecs. > Unicodec can then wrap those up in labour-saving ways - I'm not fussy > which but I like the one-line file-open utility. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From fredrik@pythonware.com Tue Nov 16 11:38:31 1999 From: fredrik@pythonware.com (Fredrik Lundh) Date: Tue, 16 Nov 1999 12:38:31 +0100 Subject: [Python-Dev] Some thoughts on the codecs... References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com> Message-ID: <024b01bf3027$1cff1480$f29b12c2@secret.pythonware.com> > I would propose to only add some very basic encodings to > the standard distribution, e.g. the ones mentioned under > Standard Codecs in the proposal: > > 'utf-8': 8-bit variable length encoding > 'utf-16': 16-bit variable length encoding (litte/big endian) > 'utf-16-le': utf-16 but explicitly little endian > 'utf-16-be': utf-16 but explicitly big endian > 'ascii': 7-bit ASCII codepage > 'latin-1': Latin-1 codepage > 'html-entities': Latin-1 + HTML entities; > see htmlentitydefs.py from the standard Pythin Lib > 'jis' (a popular version XXX): > Japanese character encoding > 'unicode-escape': See Unicode Constructors for a definition > 'native': Dump of the Internal Format used by Python since this is already very close, maybe we could adopt the naming guidelines from XML: In an encoding declaration, the values "UTF-8", "UTF-16", "ISO-10646-UCS-2", and "ISO-10646-UCS-4" should be used for the various encodings and transformations of Unicode/ISO/IEC 10646, the values "ISO-8859-1", "ISO-8859-2", ... "ISO-8859-9" should be used for the parts of ISO 8859, and the values "ISO-2022-JP", "Shift_JIS", and "EUC-JP" should be used for the various encoded forms of JIS X-0208-1997. XML processors may recognize other encodings; it is recommended that character encodings registered (as charsets) with the Internet Assigned Numbers Authority [IANA], other than those just listed, should be referred to using their registered names. Note that these registered names are defined to be case-insensitive, so processors wishing to match against them should do so in a case-insensitive way. (ie "iso-8859-1" instead of "latin-1", etc -- at least as aliases...). </F> From gstein@lyra.org Tue Nov 16 11:45:48 1999 From: gstein@lyra.org (Greg Stein) Date: Tue, 16 Nov 1999 03:45:48 -0800 (PST) Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: <024b01bf3027$1cff1480$f29b12c2@secret.pythonware.com> Message-ID: <Pine.LNX.4.10.9911160344500.2535-100000@nebula.lyra.org> On Tue, 16 Nov 1999, Fredrik Lundh wrote: >... > since this is already very close, maybe we could adopt > the naming guidelines from XML: > > In an encoding declaration, the values "UTF-8", "UTF-16", > "ISO-10646-UCS-2", and "ISO-10646-UCS-4" should be used > for the various encodings and transformations of > Unicode/ISO/IEC 10646, the values "ISO-8859-1", > "ISO-8859-2", ... "ISO-8859-9" should be used for the parts > of ISO 8859, and the values "ISO-2022-JP", "Shift_JIS", > and "EUC-JP" should be used for the various encoded > forms of JIS X-0208-1997. > > XML processors may recognize other encodings; it is > recommended that character encodings registered > (as charsets) with the Internet Assigned Numbers > Authority [IANA], other than those just listed, > should be referred to using their registered names. > > Note that these registered names are defined to be > case-insensitive, so processors wishing to match > against them should do so in a case-insensitive way. > > (ie "iso-8859-1" instead of "latin-1", etc -- at least as > aliases...). +1 (as we'd say in Apache-land... :-) -g -- Greg Stein, http://www.lyra.org/ From gstein@lyra.org Tue Nov 16 12:04:47 1999 From: gstein@lyra.org (Greg Stein) Date: Tue, 16 Nov 1999 04:04:47 -0800 (PST) Subject: [Python-Dev] just say no... In-Reply-To: <3830595B.348E8CC7@lemburg.com> Message-ID: <Pine.LNX.4.10.9911160348480.2535-100000@nebula.lyra.org> On Mon, 15 Nov 1999, M.-A. Lemburg wrote: > Guido van Rossum wrote: >... > > t# refers to byte-encoded data. Multibyte encodings are explicitly > > designed to be passed cleanly through processing steps that handle > > single-byte character data, as long as they are 8-bit clean and don't > > do too much processing. > > Ah, ok. I interpreted 8-bit to mean: 8 bits in length, not > "8-bit clean" as you obviously did. Hrm. That might be dangerous. Many of the functions that use "t#" assume that each character is 8-bits long. i.e. the returned length == the number of characters. I'm not sure what the implications would be if you interpret the semantics of "t#" as multi-byte characters. >... > > For example, take an encryption engine. While it is defined in terms > > of byte streams, there's no requirement that the bytes represent > > characters -- they could be the bytes of a GIF file, an MP3 file, or a > > gzipped tar file. If we pass Unicode to an encryption engine, we want > > Unicode to come out at the other end, not UTF-8. (If we had wanted to > > encrypt UTF-8, we should have fed it UTF-8.) Heck. I just want to quickly throw the data onto my disk. I'll write a BOM, following by the raw data. Done. It's even portable. >... > > Aha, I think there's a confusion about what "8-bit" means. For me, a > > multibyte encoding like UTF-8 is still 8-bit. Am I alone in this? Maybe. I don't see multi-byte characters as 8-bit (in the sense of the "t" format). > > (As far as I know, C uses char* to represent multibyte characters.) > > Maybe we should disambiguate it more explicitly? We can disambiguate with a new format character, or we can clarify the semantics of "t" to mean single- *or* multi- byte characters. Again, I think there may be trouble if the semantics of "t" are defined to allow multibyte characters. > There should be some definition for the two markers and the > ideas behind them in the API guide, I guess. Certainly. [ man, I'm bad... I've got doc updates there and for the buffer stuff :-( ] > > > Hmm, I would strongly object to making "s#" return the internal > > > format. file.write() would then default to writing UTF-16 data > > > instead of UTF-8 data. This could result in strange errors > > > due to the UTF-16 format being endian dependent. > > > > But this was the whole design. file.write() needs to be changed to > > use s# when the file is open in binary mode and t# when the file is > > open in text mode. Interesting idea, but that presumes that "t" will be defined for the Unicode object (i.e. it implements the getcharbuffer type slot). Because of the multi-byte problem, I don't think it will. [ not to mention, that I don't think the Unicode object should implicitly do a UTF-8 conversion and hold a ref to the resulting string ] >... > I still don't feel very comfortable about the fact that all > existing APIs using "s#" will suddenly receive UTF-16 data if > being passed Unicode objects: this probably won't get us the > "magical" Unicode integration we invision, since "t#" usage is not > very wide spread and character handling code will probably not > work well with UTF-16 encoded strings. I'm not sure that we should definitely go for "magical." Perl has magic in it, and that is one of its worst faults. Go for clean and predictable, and leave as much logic to the Python level as possible. The interpreter should provide a minimum of functionality, rather than second-guessing and trying to be neat and sneaky with its operation. >... > > Because file.write() for a binary file, and other similar things > > (e.g. the encryption engine example I mentioned above) must have > > *some* way to get at the raw bits. > > What for ? How about: "because I'm the application developer, and I say that I want the raw bytes in the file." > Any lossless encoding should do the trick... UTF-8 > is just as good as UTF-16 for binary files; plus it's more compact > for ASCII data. I don't really see a need to get explicitly > at the internal data representation because both encodings are > in fact "internal" w/r to Unicode objects. > > The only argument I can come up with is that using UTF-16 for > binary files could (possibly) eliminate the UTF-8 conversion step > which is otherwise always needed. The argument that I come up with is "don't tell me how to design my storage format, and don't make Python force me into one." If I want to write Unicode text to a file, the most natural thing to do is: open('file', 'w').write(u) If you do a conversion on me, then I'm not writing Unicode. I've got to go and do some nasty conversion which just monkeys up my program. If I have a Unicode object, but I *want* to write UTF-8 to the file, then the cleanest thing is: open('file', 'w').write(encode(u, 'utf-8')) This is clear that I've got a Unicode object input, but I'm writing UTF-8. I have a second argument, too: See my first argument. :-) Really... this is kind of what Fredrik was trying to say: don't get in the way of the application programmer. Give them tools, but avoid policy and gimmicks and other "magic". Cheers, -g -- Greg Stein, http://www.lyra.org/ From gstein@lyra.org Tue Nov 16 12:09:17 1999 From: gstein@lyra.org (Greg Stein) Date: Tue, 16 Nov 1999 04:09:17 -0800 (PST) Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...) In-Reply-To: <199911152137.QAA28280@eric.cnri.reston.va.us> Message-ID: <Pine.LNX.4.10.9911160406070.2535-100000@nebula.lyra.org> On Mon, 15 Nov 1999, Guido van Rossum wrote: >... > > The problem with these large tables is that currently > > Python modules are not shared among processes since > > every process builds its own table. > > > > Static C data has the advantage of being shareable at > > the OS level. > > Don't worry about it. 128K is too small to care, I think... This is the reason Python starts up so slow and has a large memory footprint. There hasn't been any concern for moving stuff into shared data pages. As a result, a process must map in a bunch of vmem pages, for no other reason than to allocate Python structures in that memory and copy constants in. Go start Perl 100 times, then do the same with Python. Python is significantly slower. I've actually written a web app in PHP because another one that I did in Python had slow response time. [ yah: the Real Man Answer is to write a real/good mod_python. ] Cheers, -g -- Greg Stein, http://www.lyra.org/ From andy@robanal.demon.co.uk Tue Nov 16 12:18:19 1999 From: andy@robanal.demon.co.uk (=?iso-8859-1?q?Andy=20Robinson?=) Date: Tue, 16 Nov 1999 04:18:19 -0800 (PST) Subject: [Python-Dev] Some thoughts on the codecs... Message-ID: <19991116121819.21509.rocketmail@web606.mail.yahoo.com> --- "M.-A. Lemburg" <mal@lemburg.com> wrote: > So I can drop JIS ? [I won't be able to drop the > escaped unicode > codec because this is needed for u"" and ur"".] Drop Japanese from the core language. JIS0208 is a big character set with three popular encodings (Shift-JIS, EUC-JP and JIS), and a host of slight variations; it has 6879 characters, and there are a range of options a user might need to set for it to be useful. So let's assume for now this a separate package. There's a good chance I'll do it but it is not a small job. If you start statically linking in tables of 7000 characters for one Asian language, you'll have to do the lot. As for the single-byte Latin ones, a prototype Python module could be whipped up in a couple of evenings, and a tiny C function which does single-byte to double-byte mappings and vice versa could make it fast. We can have an extensible, data driven solution in no time without having to build it into the core. The way I see it, to claim that python has i18n, a serious effort is needed to ensure every major encoding in the world is available to Python users. But that's separate to the core languages. Your spec should only cover what is going to be hard-coded into Python. I'd like to see one paragraph in your spec stating that our architecture seperates the encodings themselves from the core language changes, and that getting them sorted is a logically separate (but important) project. Ideally, we could put together a separate proposal for the encoding library itself and run it by some world class experts in that field, but after yours is done. - Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From guido@CNRI.Reston.VA.US Tue Nov 16 13:28:42 1999 From: guido@CNRI.Reston.VA.US (Guido van Rossum) Date: Tue, 16 Nov 1999 08:28:42 -0500 Subject: [Python-Dev] Unicode proposal: %-formatting ? In-Reply-To: Your message of "Tue, 16 Nov 1999 11:40:42 +0100." <383134AA.4B49D178@lemburg.com> References: <000001bf2ff4$d36e2540$042d153f@tim> <383134AA.4B49D178@lemburg.com> Message-ID: <199911161328.IAA29042@eric.cnri.reston.va.us> > ... hmm, there is a problem there: how should the PyUnicode_Format() > API deal with '%s' when it sees a Unicode object as argument ? > > E.g. what would you get in these cases: > > u = u"%s %s" % (u"abc", "abc") From the user's perspective, it should clearly return u"abc abc". > Perhaps we need a new marker for "insert Unicode object here". No, please! BTW, we also need to look at the proposal from JPython's perspective (where all strings are Unicode; I don't know if they are UTF-16 or UCS-2). It should be possible to add a small number of dummy things to JPython so that a CPython program using unicode can be run unchanged there. A minimal set seems to be: - u"..." is treated the same as "..."; and ur"..." (if accepted) is r"..." - unichr(c) is the same as chr(c) - unicode(s[,encoding]) is added - s.encode([encoding]) is added Anything I forgot? The default encoding may be tricky; it makes most sense to let the default encoding be "native" so that unicode(s) and s.encode() can return s unchanged. This can occasionally cause programs to fail that work in CPython, e.g. a program that opens a file in binary mode, reads a string from it, and converts it to unicode using the default encoding. But such programs are on thin ice already (it's always better to be explicit about encodings). --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@CNRI.Reston.VA.US Tue Nov 16 13:45:17 1999 From: guido@CNRI.Reston.VA.US (Guido van Rossum) Date: Tue, 16 Nov 1999 08:45:17 -0500 Subject: [Python-Dev] just say no... In-Reply-To: Your message of "Tue, 16 Nov 1999 04:04:47 PST." <Pine.LNX.4.10.9911160348480.2535-100000@nebula.lyra.org> References: <Pine.LNX.4.10.9911160348480.2535-100000@nebula.lyra.org> Message-ID: <199911161345.IAA29064@eric.cnri.reston.va.us> > > Ah, ok. I interpreted 8-bit to mean: 8 bits in length, not > > "8-bit clean" as you obviously did. > > Hrm. That might be dangerous. Many of the functions that use "t#" assume > that each character is 8-bits long. i.e. the returned length == the number > of characters. > > I'm not sure what the implications would be if you interpret the semantics > of "t#" as multi-byte characters. Hrm. Can you quote examples of users of t# who would be confused by multibyte characters? I guess that there are quite a few places where they will be considered illegal, but that's okay -- the string will be parsed at some point and rejected, e.g. as an illegal filename, hostname or whatever. On the other hand, there are quite some places where I would think that multibyte characters would do just the right thing. Many places using t# could just as well be using 's' except they need to know the length and they don't want to call strlen(). In all cases I've looked at, the reason they need the length because they are allocating a buffer (or checking whether it fits in a statically allocated buffer) -- and there the number of bytes in a multibyte string is just fine. Note that I take the same stance on 's' -- it should return multibyte characters. > > What for ? > > How about: "because I'm the application developer, and I say that I want > the raw bytes in the file." Here I'm with you, man! > Greg Stein, http://www.lyra.org/ --Guido van Rossum (home page: http://www.python.org/~guido/) From gward@cnri.reston.va.us Tue Nov 16 14:10:33 1999 From: gward@cnri.reston.va.us (Greg Ward) Date: Tue, 16 Nov 1999 09:10:33 -0500 Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...) In-Reply-To: <Pine.LNX.4.10.9911160406070.2535-100000@nebula.lyra.org>; from gstein@lyra.org on Tue, Nov 16, 1999 at 04:09:17AM -0800 References: <199911152137.QAA28280@eric.cnri.reston.va.us> <Pine.LNX.4.10.9911160406070.2535-100000@nebula.lyra.org> Message-ID: <19991116091032.A4063@cnri.reston.va.us> On 16 November 1999, Greg Stein said: > This is the reason Python starts up so slow and has a large memory > footprint. There hasn't been any concern for moving stuff into shared data > pages. As a result, a process must map in a bunch of vmem pages, for no > other reason than to allocate Python structures in that memory and copy > constants in. > > Go start Perl 100 times, then do the same with Python. Python is > significantly slower. I've actually written a web app in PHP because > another one that I did in Python had slow response time. > [ yah: the Real Man Answer is to write a real/good mod_python. ] I don't think this is the only factor in startup overhead. Try looking into the number of system calls for the trivial startup case of each interpreter: $ truss perl -e 1 2> perl.log $ truss python -c 1 2> python.log (This is on Solaris; I did the same thing on Linux with "strace", and on IRIX with "par -s -SS". Dunno about other Unices.) The results are interesting, and useful despite the platform and version disparities. (For the record: Python 1.5.2 on all three platforms; Perl 5.005_03 on Solaris, 5.004_05 on Linux, and 5.004_04 on IRIX. The Solaris is 2.6, using the Official CNRI Python Build by Barry, and the ditto Perl build by me; the Linux system is starship, using whatever Perl and Python the Starship Masters provide us with; the IRIX box is an elderly but well-maintained SGI Challenge running IRIX 5.3.) Also, this is with an empty PYTHONPATH. The Solaris build of Python has different prefix and exec_prefix, but on the Linux and IRIX builds, they are the same. (I think this will reflect poorly on the Solaris version.) PERLLIB, PERL5LIB, and Perl's builtin @INC should not affect startup of the trivial "1" script, so I haven't paid attention to them. First, the size of log files (in lines), i.e. number of system calls: Solaris Linux IRIX[1] Perl 88 85 70 Python 425 316 257 [1] after chopping off the summary counts from the "par" output -- ie. these really are the number of system calls, not the number of lines in the log files Next, the number of "open" calls: Solaris Linux IRIX Perl 16 10 9 Python 107 71 48 (It looks as though *all* of the Perl 'open' calls are due to the dynamic linker going through /usr/lib and/or /lib.) And the number of unsuccessful "open" calls: Solaris Linux IRIX Perl 6 1 3 Python 77 49 32 Number of "mmap" calls: Solaris Linux IRIX Perl 25 25 1 Python 36 24 1 ...nope, guess we can't blame mmap for any Perl/Python startup disparity. How about "brk": Solaris Linux IRIX Perl 6 11 12 Python 47 39 25 ...ok, looks like Greg's gripe about memory holds some water. Rerunning "truss" on Solaris with "python -S -c 1" drastically reduces the startup overhead as measured by "number of system calls". Some quick timing experiments show a drastic speedup (in wall-clock time) by adding "-S": about 37% faster under Solaris, 56% faster under Linux, and 35% under IRIX. These figures should be taken with a large grain of salt, as the Linux and IRIX systems were fairly well loaded at the time, and the wall-clock results I measured had huge variance. Still, it gets the point across. Oh, also for the record, all timings were done like: perl -e 'for $i (1 .. 100) { system "python", "-S", "-c", "1"; }' because I wanted to guarantee no shell was involved in the Python startup. Greg -- Greg Ward - software developer gward@cnri.reston.va.us Corporation for National Research Initiatives 1895 Preston White Drive voice: +1-703-620-8990 Reston, Virginia, USA 20191-5434 fax: +1-703-620-0913 From mal@lemburg.com Tue Nov 16 11:33:07 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Tue, 16 Nov 1999 12:33:07 +0100 Subject: [Python-Dev] Some thoughts on the codecs... References: <19991116110555.8B43335BB1E@snelboot.oratrix.nl> Message-ID: <383140F3.EDDB307A@lemburg.com> Jack Jansen wrote: > > > I would propose to only add some very basic encodings to > > the standard distribution, e.g. the ones mentioned under > > Standard Codecs in the proposal: > > > > 'utf-8': 8-bit variable length encoding > > 'utf-16': 16-bit variable length encoding (litte/big endian) > > 'utf-16-le': utf-16 but explicitly little endian > > 'utf-16-be': utf-16 but explicitly big endian > > 'ascii': 7-bit ASCII codepage > > 'latin-1': Latin-1 codepage > > 'html-entities': Latin-1 + HTML entities; > > see htmlentitydefs.py from the standard Pythin Lib > > 'jis' (a popular version XXX): > > Japanese character encoding > > 'unicode-escape': See Unicode Constructors for a definition > > 'native': Dump of the Internal Format used by Python > > I would suggest adding the Dos, Windows and Macintosh standard 8-bit charsets > (their equivalents of latin-1) too, as documents in these encoding are pretty > ubiquitous. But maybe these should only be added on the respective platforms. Good idea. What code pages would that be ? -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Tue Nov 16 14:13:25 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Tue, 16 Nov 1999 15:13:25 +0100 Subject: [Python-Dev] Unicode Proposal: Version 0.6 References: <382C0A54.E6E8328D@lemburg.com> <382D625B.DC14DBDE@lemburg.com> Message-ID: <38316685.7977448D@lemburg.com> FYI, I've uploaded a new version of the proposal which incorporates many things we have discussed lately, e.g. the buffer interface, "s#" vs. "t#", etc. The latest version of the proposal is available at: http://starship.skyport.net/~lemburg/unicode-proposal.txt Older versions are available as: http://starship.skyport.net/~lemburg/unicode-proposal-X.X.txt Some POD (points of discussion) that are still open: · Unicode objects support for %-formatting · specifying StreamCodecs -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Tue Nov 16 12:54:51 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Tue, 16 Nov 1999 13:54:51 +0100 Subject: [Python-Dev] Some thoughts on the codecs... References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com> <024b01bf3027$1cff1480$f29b12c2@secret.pythonware.com> Message-ID: <3831541B.B242FFA9@lemburg.com> Fredrik Lundh wrote: > > > I would propose to only add some very basic encodings to > > the standard distribution, e.g. the ones mentioned under > > Standard Codecs in the proposal: > > > > 'utf-8': 8-bit variable length encoding > > 'utf-16': 16-bit variable length encoding (litte/big endian) > > 'utf-16-le': utf-16 but explicitly little endian > > 'utf-16-be': utf-16 but explicitly big endian > > 'ascii': 7-bit ASCII codepage > > 'latin-1': Latin-1 codepage > > 'html-entities': Latin-1 + HTML entities; > > see htmlentitydefs.py from the standard Pythin Lib > > 'jis' (a popular version XXX): > > Japanese character encoding > > 'unicode-escape': See Unicode Constructors for a definition > > 'native': Dump of the Internal Format used by Python > > since this is already very close, maybe we could adopt > the naming guidelines from XML: > > In an encoding declaration, the values "UTF-8", "UTF-16", > "ISO-10646-UCS-2", and "ISO-10646-UCS-4" should be used > for the various encodings and transformations of > Unicode/ISO/IEC 10646, the values "ISO-8859-1", > "ISO-8859-2", ... "ISO-8859-9" should be used for the parts > of ISO 8859, and the values "ISO-2022-JP", "Shift_JIS", > and "EUC-JP" should be used for the various encoded > forms of JIS X-0208-1997. > > XML processors may recognize other encodings; it is > recommended that character encodings registered > (as charsets) with the Internet Assigned Numbers > Authority [IANA], other than those just listed, > should be referred to using their registered names. > > Note that these registered names are defined to be > case-insensitive, so processors wishing to match > against them should do so in a case-insensitive way. > > (ie "iso-8859-1" instead of "latin-1", etc -- at least as > aliases...). >From the proposal: """ General Remarks: ---------------- · Unicode encoding names should be lower case on output and case-insensitive on input (they will be converted to lower case by all APIs taking an encoding name as input). Encoding names should follow the name conventions as used by the Unicode Consortium: spaces are converted to hyphens, e.g. 'utf 16' is written as 'utf-16'. """ Is there a naming scheme definition for these encoding names? (The quote you gave above doesn't really sound like a definition to me.) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Tue Nov 16 13:15:19 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Tue, 16 Nov 1999 14:15:19 +0100 Subject: [Python-Dev] Some thoughts on the codecs... References: <19991116121819.21509.rocketmail@web606.mail.yahoo.com> Message-ID: <383158E7.BC574A1F@lemburg.com> Andy Robinson wrote: > > --- "M.-A. Lemburg" <mal@lemburg.com> wrote: > > So I can drop JIS ? [I won't be able to drop the > > escaped unicode > > codec because this is needed for u"" and ur"".] > > Drop Japanese from the core language. Done ... that one was easy ;-) > JIS0208 is a big character set with three popular > encodings (Shift-JIS, EUC-JP and JIS), and a host of > slight variations; it has 6879 characters, and there > are a range of options a user might need to set for it > to be useful. So let's assume for now this a separate > package. There's a good chance I'll do it but it is > not a small job. If you start statically linking in > tables of 7000 characters for one Asian language, > you'll have to do the lot. > > As for the single-byte Latin ones, a prototype Python > module could be whipped up in a couple of evenings, > and a tiny C function which does single-byte to > double-byte mappings and vice versa could make it > fast. We can have an extensible, data driven solution > in no time without having to build it into the core. Perhaps these helper function could be intergrated into the core to avoid compilation when adding a new codec. > The way I see it, to claim that python has i18n, a > serious effort is needed to ensure every major > encoding in the world is available to Python users. > But that's separate to the core languages. Your spec > should only cover what is going to be hard-coded into > Python. Right. > I'd like to see one paragraph in your spec stating > that our architecture seperates the encodings > themselves from the core language changes, and that > getting them sorted is a logically separate (but > important) project. Ideally, we could put together a > separate proposal for the encoding library itself and > run it by some world class experts in that field, but > after yours is done. I've added: All other encoding such as the CJK ones to support Asian scripts should be implemented in seperate packages which do not get included in the core Python distribution and are not a part of this proposal. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Tue Nov 16 13:06:39 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Tue, 16 Nov 1999 14:06:39 +0100 Subject: [Python-Dev] just say no... References: <Pine.LNX.4.10.9911160348480.2535-100000@nebula.lyra.org> Message-ID: <383156DF.2209053F@lemburg.com> Greg Stein wrote: > > On Mon, 15 Nov 1999, M.-A. Lemburg wrote: > > Guido van Rossum wrote: > >... > > > t# refers to byte-encoded data. Multibyte encodings are explicitly > > > designed to be passed cleanly through processing steps that handle > > > single-byte character data, as long as they are 8-bit clean and don't > > > do too much processing. > > > > Ah, ok. I interpreted 8-bit to mean: 8 bits in length, not > > "8-bit clean" as you obviously did. > > Hrm. That might be dangerous. Many of the functions that use "t#" assume > that each character is 8-bits long. i.e. the returned length == the number > of characters. > > I'm not sure what the implications would be if you interpret the semantics > of "t#" as multi-byte characters. FYI, the next version of the proposal now says "s#" gives you UTF-16 and "t#" returns UTF-8. File objects opened in text mode will use "t#" and binary ones use "s#". I'll just use explicit u.encode('utf-8') calls if I want to write UTF-8 to binary files -- perhaps everyone else should too ;-) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From akuchlin@mems-exchange.org Tue Nov 16 14:35:39 1999 From: akuchlin@mems-exchange.org (Andrew M. Kuchling) Date: Tue, 16 Nov 1999 09:35:39 -0500 (EST) Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...) In-Reply-To: <19991116091032.A4063@cnri.reston.va.us> References: <199911152137.QAA28280@eric.cnri.reston.va.us> <Pine.LNX.4.10.9911160406070.2535-100000@nebula.lyra.org> <19991116091032.A4063@cnri.reston.va.us> Message-ID: <14385.27579.292173.433577@amarok.cnri.reston.va.us> Greg Ward writes: >Next, the number of "open" calls: > Solaris Linux IRIX > Perl 16 10 9 > Python 107 71 48 Running 'python -v' explains this: amarok akuchlin>python -v # /usr/local/lib/python1.5/exceptions.pyc matches /usr/local/lib/python1.5/exceptions.py import exceptions # precompiled from /usr/local/lib/python1.5/exceptions.pyc # /usr/local/lib/python1.5/site.pyc matches /usr/local/lib/python1.5/site.py import site # precompiled from /usr/local/lib/python1.5/site.pyc # /usr/local/lib/python1.5/os.pyc matches /usr/local/lib/python1.5/os.py import os # precompiled from /usr/local/lib/python1.5/os.pyc import posix # builtin # /usr/local/lib/python1.5/posixpath.pyc matches /usr/local/lib/python1.5/posixpath.py import posixpath # precompiled from /usr/local/lib/python1.5/posixpath.pyc # /usr/local/lib/python1.5/stat.pyc matches /usr/local/lib/python1.5/stat.py import stat # precompiled from /usr/local/lib/python1.5/stat.pyc # /usr/local/lib/python1.5/UserDict.pyc matches /usr/local/lib/python1.5/UserDict.py import UserDict # precompiled from /usr/local/lib/python1.5/UserDict.pyc Python 1.5.2 (#80, May 25 1999, 18:06:07) [GCC 2.8.1] on sunos5 Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam import readline # dynamically loaded from /usr/local/lib/python1.5/lib-dynload/readline.so And each import tries several different forms of the module name: stat("/usr/local/lib/python1.5/os", 0xEFFFD5E0) Err#2 ENOENT open("/usr/local/lib/python1.5/os.so", O_RDONLY) Err#2 ENOENT open("/usr/local/lib/python1.5/osmodule.so", O_RDONLY) Err#2 ENOENT open("/usr/local/lib/python1.5/os.py", O_RDONLY) = 4 I don't see how this is fixable, unless we strip down site.py, which drags in os, which drags in os.path and stat and UserDict. -- A.M. Kuchling http://starship.python.net/crew/amk/ I'm going stir-crazy, and I've joined the ranks of the walking brain-dead, but otherwise I'm just peachy. -- Lyta Hall on parenthood, in SANDMAN #40: "Parliament of Rooks" From guido@CNRI.Reston.VA.US Tue Nov 16 14:43:07 1999 From: guido@CNRI.Reston.VA.US (Guido van Rossum) Date: Tue, 16 Nov 1999 09:43:07 -0500 Subject: [Python-Dev] just say no... In-Reply-To: Your message of "Tue, 16 Nov 1999 14:06:39 +0100." <383156DF.2209053F@lemburg.com> References: <Pine.LNX.4.10.9911160348480.2535-100000@nebula.lyra.org> <383156DF.2209053F@lemburg.com> Message-ID: <199911161443.JAA29149@eric.cnri.reston.va.us> > FYI, the next version of the proposal now says "s#" gives you > UTF-16 and "t#" returns UTF-8. File objects opened in text mode > will use "t#" and binary ones use "s#". Good. > I'll just use explicit u.encode('utf-8') calls if I want to write > UTF-8 to binary files -- perhaps everyone else should too ;-) You could write UTF-8 to files opened in text mode too; at least most actual systems will leave the UTF-8 escapes alone and just to LF -> CRLF translation, which should be fine. --Guido van Rossum (home page: http://www.python.org/~guido/) From fdrake@acm.org Tue Nov 16 14:50:55 1999 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Tue, 16 Nov 1999 09:50:55 -0500 (EST) Subject: [Python-Dev] just say no... In-Reply-To: <000901bf2ffc$3d4bb8e0$042d153f@tim> References: <14380.16437.71847.832880@weyr.cnri.reston.va.us> <000901bf2ffc$3d4bb8e0$042d153f@tim> Message-ID: <14385.28495.685427.598748@weyr.cnri.reston.va.us> Tim Peters writes: > Yet another use for a weak reference <0.5 wink>. Those just keep popping up! I seem to recall Diane Hackborne actually implemented these under the name "vref" long ago; perhaps that's worth revisiting after all? (Not the implementation so much as the idea.) I think to make it general would cost one PyObject* in each object's structure, and some code in some constructors (maybe), and all destructors, but not much. Is this worth pursuing, or is it locked out of the core because of the added space for the PyObject*? (Note that the concept isn't necessarily useful for all object types -- numbers in particular -- but it only makes sense to bother if it works for everything, even if it's not very useful in some cases.) -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives From fdrake@acm.org Tue Nov 16 15:12:43 1999 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Tue, 16 Nov 1999 10:12:43 -0500 (EST) Subject: [Python-Dev] just say no... In-Reply-To: <Pine.LNX.4.10.9911160348480.2535-100000@nebula.lyra.org> References: <3830595B.348E8CC7@lemburg.com> <Pine.LNX.4.10.9911160348480.2535-100000@nebula.lyra.org> Message-ID: <14385.29803.459364.456840@weyr.cnri.reston.va.us> Greg Stein writes: > [ man, I'm bad... I've got doc updates there and for the buffer stuff :-( ] And the sooner I receive them, the sooner they can be integrated! Any plans to get them to me? I'll probably want to do another release before the IPC8. -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives From mal@lemburg.com Tue Nov 16 14:36:54 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Tue, 16 Nov 1999 15:36:54 +0100 Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...) References: <199911152137.QAA28280@eric.cnri.reston.va.us> <Pine.LNX.4.10.9911160406070.2535-100000@nebula.lyra.org> <19991116091032.A4063@cnri.reston.va.us> Message-ID: <38316C06.8B0E1D7B@lemburg.com> Greg Ward wrote: > > > Go start Perl 100 times, then do the same with Python. Python is > > significantly slower. I've actually written a web app in PHP because > > another one that I did in Python had slow response time. > > [ yah: the Real Man Answer is to write a real/good mod_python. ] > > I don't think this is the only factor in startup overhead. Try looking > into the number of system calls for the trivial startup case of each > interpreter: > > $ truss perl -e 1 2> perl.log > $ truss python -c 1 2> python.log > > (This is on Solaris; I did the same thing on Linux with "strace", and on > IRIX with "par -s -SS". Dunno about other Unices.) The results are > interesting, and useful despite the platform and version disparities. > > (For the record: Python 1.5.2 on all three platforms; Perl 5.005_03 on > Solaris, 5.004_05 on Linux, and 5.004_04 on IRIX. The Solaris is 2.6, > using the Official CNRI Python Build by Barry, and the ditto Perl build > by me; the Linux system is starship, using whatever Perl and Python the > Starship Masters provide us with; the IRIX box is an elderly but > well-maintained SGI Challenge running IRIX 5.3.) > > Also, this is with an empty PYTHONPATH. The Solaris build of Python has > different prefix and exec_prefix, but on the Linux and IRIX builds, they > are the same. (I think this will reflect poorly on the Solaris > version.) PERLLIB, PERL5LIB, and Perl's builtin @INC should not affect > startup of the trivial "1" script, so I haven't paid attention to them. For kicks I've done a similar test with cgipython, the one file version of Python 1.5.2: > First, the size of log files (in lines), i.e. number of system calls: > > Solaris Linux IRIX[1] > Perl 88 85 70 > Python 425 316 257 cgipython 182 > [1] after chopping off the summary counts from the "par" output -- ie. > these really are the number of system calls, not the number of > lines in the log files > > Next, the number of "open" calls: > > Solaris Linux IRIX > Perl 16 10 9 > Python 107 71 48 cgipython 33 > (It looks as though *all* of the Perl 'open' calls are due to the > dynamic linker going through /usr/lib and/or /lib.) > > And the number of unsuccessful "open" calls: > > Solaris Linux IRIX > Perl 6 1 3 > Python 77 49 32 cgipython 28 Note that cgipython does search for sitecutomize.py. > > Number of "mmap" calls: > > Solaris Linux IRIX > Perl 25 25 1 > Python 36 24 1 cgipython 13 > > ...nope, guess we can't blame mmap for any Perl/Python startup > disparity. > > How about "brk": > > Solaris Linux IRIX > Perl 6 11 12 > Python 47 39 25 cgipython 41 (?) So at least in theory, using cgipython for the intended purpose should gain some performance. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Tue Nov 16 16:00:58 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Tue, 16 Nov 1999 17:00:58 +0100 Subject: [Python-Dev] Codecs and StreamCodecs Message-ID: <38317FBA.4F3D6B1F@lemburg.com> Here is a new proposal for the codec interface: class Codec: def encode(self,u,slice=None): """ Return the Unicode object u encoded as Python string. If slice is given (as slice object), only the sliced part of the Unicode object is encoded. The method may not store state in the Codec instance. Use SteamCodec for codecs which have to keep state in order to make encoding/decoding efficient. """ ... def decode(self,s,slice=None): """ Return an equivalent Unicode object for the encoded Python string s. If slice is given (as slice object), only the sliced part of the Python string is decoded and returned as Unicode object. Note that this can cause the decoding algorithm to fail due to truncations in the encoding. The method may not store state in the Codec instance. Use SteamCodec for codecs which have to keep state in order to make encoding/decoding efficient. """ ... class StreamCodec(Codec): def __init__(self,stream=None,errors='strict'): """ Creates a StreamCodec instance. stream must be a file-like object open for reading and/or writing binary data depending on the intended codec action or None. The StreamCodec may implement different error handling schemes by providing the errors argument. These parameters are known (they need not all be supported by StreamCodec subclasses): 'strict' - raise an UnicodeError (or a subclass) 'ignore' - ignore the character and continue with the next (a single character) - replace errorneous characters with the given character (may also be a Unicode character) """ self.stream = stream def write(self,u,slice=None): """ Writes the Unicode object's contents encoded to self.stream. stream must be a file-like object open for writing binary data. If slice is given (as slice object), only the sliced part of the Unicode object is written. """ ... the base class should provide a default implementation of this method using self.encode ... def read(self,length=None): """ Reads an encoded string from the stream and returns an equivalent Unicode object. If length is given, only length Unicode characters are returned (the StreamCodec instance reads as many raw bytes as needed to fulfill this requirement). Otherwise, all available data is read and decoded. """ ... the base class should provide a default implementation of this method using self.decode ... It is not required by the unicodec.register() API to provide a subclass of these base class, only the given methods must be present; this allows writing Codecs as extensions types. All Codecs must provide the .encode()/.decode() methods. Codecs having the .read() and/or .write() methods are considered to be StreamCodecs. The Unicode implementation will by itself only use the stateless .encode() and .decode() methods. All other conversion have to be done by explicitly instantiating the appropriate [Stream]Codec. -- Feel free to beat on this one ;-) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Tue Nov 16 16:08:49 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Tue, 16 Nov 1999 17:08:49 +0100 Subject: [Python-Dev] just say no... References: <14380.16437.71847.832880@weyr.cnri.reston.va.us> <000901bf2ffc$3d4bb8e0$042d153f@tim> <14385.28495.685427.598748@weyr.cnri.reston.va.us> Message-ID: <38318191.11D93903@lemburg.com> "Fred L. Drake, Jr." wrote: > > Tim Peters writes: > > Yet another use for a weak reference <0.5 wink>. > > Those just keep popping up! I seem to recall Diane Hackborne > actually implemented these under the name "vref" long ago; perhaps > that's worth revisiting after all? (Not the implementation so much as > the idea.) I think to make it general would cost one PyObject* in > each object's structure, and some code in some constructors (maybe), > and all destructors, but not much. > Is this worth pursuing, or is it locked out of the core because of > the added space for the PyObject*? (Note that the concept isn't > necessarily useful for all object types -- numbers in particular -- > but it only makes sense to bother if it works for everything, even if > it's not very useful in some cases.) FYI, there's mxProxy which implements a flavor of them. Look in the standard places for mx stuff ;-) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From fdrake@acm.org Tue Nov 16 16:14:06 1999 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Tue, 16 Nov 1999 11:14:06 -0500 (EST) Subject: [Python-Dev] just say no... In-Reply-To: <38318191.11D93903@lemburg.com> References: <14380.16437.71847.832880@weyr.cnri.reston.va.us> <000901bf2ffc$3d4bb8e0$042d153f@tim> <14385.28495.685427.598748@weyr.cnri.reston.va.us> <38318191.11D93903@lemburg.com> Message-ID: <14385.33486.855802.187739@weyr.cnri.reston.va.us> M.-A. Lemburg writes: > FYI, there's mxProxy which implements a flavor of them. Look > in the standard places for mx stuff ;-) Yes, but still not in the core. So we have two general examples (vrefs and mxProxy) and there's WeakDict (or something like that). I think there really needs to be a core facility for this. There are a lot of users (including myself) who think that things are far less useful if they're not in the core. (No, I'm not saying that everything should be in the core, or even that it needs a lot more stuff. I just don't want to be writing code that requires a lot of separate packages to be installed. At least not until we can tell an installation tool to "install this and everything it depends on." ;) -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives From bwarsaw@cnri.reston.va.us (Barry A. Warsaw) Tue Nov 16 16:14:55 1999 From: bwarsaw@cnri.reston.va.us (Barry A. Warsaw) (Barry A. Warsaw) Date: Tue, 16 Nov 1999 11:14:55 -0500 (EST) Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...) References: <199911152137.QAA28280@eric.cnri.reston.va.us> <Pine.LNX.4.10.9911160406070.2535-100000@nebula.lyra.org> <19991116091032.A4063@cnri.reston.va.us> <14385.27579.292173.433577@amarok.cnri.reston.va.us> Message-ID: <14385.33535.23316.286575@anthem.cnri.reston.va.us> >>>>> "AMK" == Andrew M Kuchling <akuchlin@mems-exchange.org> writes: AMK> I don't see how this is fixable, unless we strip down AMK> site.py, which drags in os, which drags in os.path and stat AMK> and UserDict. One approach might be to support loading modules out of jar files (or whatever) using Greg imputils. We could put the bootstrap .pyc files in this jar and teach Python to import from it first. Python installations could even craft their own modules.jar file to include whatever modules they are willing to "hard code". This, with -S might make Python start up much faster, at the small cost of some flexibility (which could be regained with a c.l. switch or other mechanism to bypass modules.jar). -Barry From guido@CNRI.Reston.VA.US Tue Nov 16 16:20:28 1999 From: guido@CNRI.Reston.VA.US (Guido van Rossum) Date: Tue, 16 Nov 1999 11:20:28 -0500 Subject: [Python-Dev] Codecs and StreamCodecs In-Reply-To: Your message of "Tue, 16 Nov 1999 17:00:58 +0100." <38317FBA.4F3D6B1F@lemburg.com> References: <38317FBA.4F3D6B1F@lemburg.com> Message-ID: <199911161620.LAA02643@eric.cnri.reston.va.us> > It is not required by the unicodec.register() API to provide a > subclass of these base class, only the given methods must be present; > this allows writing Codecs as extensions types. All Codecs must > provide the .encode()/.decode() methods. Codecs having the .read() > and/or .write() methods are considered to be StreamCodecs. > > The Unicode implementation will by itself only use the > stateless .encode() and .decode() methods. > > All other conversion have to be done by explicitly instantiating > the appropriate [Stream]Codec. Looks okay, although I'd like someone to implement a simple shift-state-based stream codec to check this out further. I have some questions about the constructor. You seem to imply that instantiating the class without arguments creates a codec without state. That's fine. When given a stream argument, shouldn't the direction of the stream be given as an additional argument, so the proper state for encoding or decoding can be set up? I can see that for an implementation it might be more convenient to have separate classes for encoders and decoders -- certainly the state being kept is very different. Also, I don't want to ignore the alternative interface that was suggested by /F. It uses feed() similar to htmllib c.s. This has some advantages (although we might want to define some compatibility so it can also feed directly into a file). Perhaps someone should go ahead and implement prototype codecs using either paradigm and then write some simple apps, so we can make a better decision. In any case I think the specs codec registry API aren't on the critical path, integration of /F's basic unicode object is the first thing we need. --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@CNRI.Reston.VA.US Tue Nov 16 16:27:53 1999 From: guido@CNRI.Reston.VA.US (Guido van Rossum) Date: Tue, 16 Nov 1999 11:27:53 -0500 Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...) In-Reply-To: Your message of "Tue, 16 Nov 1999 11:14:55 EST." <14385.33535.23316.286575@anthem.cnri.reston.va.us> References: <199911152137.QAA28280@eric.cnri.reston.va.us> <Pine.LNX.4.10.9911160406070.2535-100000@nebula.lyra.org> <19991116091032.A4063@cnri.reston.va.us> <14385.27579.292173.433577@amarok.cnri.reston.va.us> <14385.33535.23316.286575@anthem.cnri.reston.va.us> Message-ID: <199911161627.LAA02665@eric.cnri.reston.va.us> > >>>>> "AMK" == Andrew M Kuchling <akuchlin@mems-exchange.org> writes: > > AMK> I don't see how this is fixable, unless we strip down > AMK> site.py, which drags in os, which drags in os.path and stat > AMK> and UserDict. > > One approach might be to support loading modules out of jar files (or > whatever) using Greg imputils. We could put the bootstrap .pyc files > in this jar and teach Python to import from it first. Python > installations could even craft their own modules.jar file to include > whatever modules they are willing to "hard code". This, with -S might > make Python start up much faster, at the small cost of some > flexibility (which could be regained with a c.l. switch or other > mechanism to bypass modules.jar). A completely different approach (which, incidentally, HP has lobbied for before; and which has been implemented by Sjoerd Mullender for one particular application) would be to cache a mapping from module names to filenames in a dbm file. For Sjoerd's app (which imported hundreds of modules) this made a huge difference. The problem is that it's hard to deal with issues like updating the cache while sharing it with other processes and even other users... But if those can be solved, this could greatly reduce the number of stats and unsuccessful opens, without having to resort to jar files. --Guido van Rossum (home page: http://www.python.org/~guido/) From gmcm@hypernet.com Tue Nov 16 16:56:19 1999 From: gmcm@hypernet.com (Gordon McMillan) Date: Tue, 16 Nov 1999 11:56:19 -0500 Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...) In-Reply-To: <14385.33535.23316.286575@anthem.cnri.reston.va.us> Message-ID: <1269351119-9152905@hypernet.com> Barry A. Warsaw writes: > One approach might be to support loading modules out of jar files > (or whatever) using Greg imputils. We could put the bootstrap > .pyc files in this jar and teach Python to import from it first. > Python installations could even craft their own modules.jar file > to include whatever modules they are willing to "hard code". > This, with -S might make Python start up much faster, at the > small cost of some flexibility (which could be regained with a > c.l. switch or other mechanism to bypass modules.jar). Couple hundred Windows users have been doing this for months (http://starship.python.net/crew/gmcm/install.html). The .pyz files are cross-platform, although the "embedding" app would have to be redone for *nix, (and all the embedding really does is keep Python from hunting all over your disk). Yeah, it's faster. And I can put Python+Tcl/Tk+IDLE on a diskette with a little room left over. but-since-its-WIndows-it-must-be-tainted-ly y'rs - Gordon From guido@CNRI.Reston.VA.US Tue Nov 16 17:00:15 1999 From: guido@CNRI.Reston.VA.US (Guido van Rossum) Date: Tue, 16 Nov 1999 12:00:15 -0500 Subject: [Python-Dev] Python 1.6 status Message-ID: <199911161700.MAA02716@eric.cnri.reston.va.us> Greg Stein recently reminded me that he was holding off on 1.6 patches because he was under the impression that I wasn't accepting them yet. The situation is rather more complicated than that. There are a great deal of things that need to be done, and for many of them I'd be most happy to receive patches! For other things, however, I'm still in the requirements analysis phase, and patches might be premature (e.g., I want to redesign the import mechanisms, and while I like some of the prototypes that have been posted, I'm not ready to commit to any specific implementation). How do you know for which things I'm ready for patches? Ask me. I've tried to make lists before, and there are probably some hints in the TODO FAQ wizard as well as in the "requests" section of the Python Bugs List. Greg also suggested that I might receive more patches if I opened up the CVS tree for checkins by certain valued contributors. On the one hand I'm reluctant to do that (I feel I have a pretty good track record of checking in patches that are mailed to me, assuming I agree with them) but on the other hand there might be something to say for this, because it gives contributors more of a sense of belonging to the inner core. Of course, checkin privileges don't mean you can check in anything you like -- as in the Apache world, changes must be discussed and approved by the group, and I would like to have a veto. However once a change is approved, it's much easier if the contributor can check the code in without having to go through me all the time. A drawback may be that some people will make very forceful requests to be given checkin privileges, only to never use them; just like there are some members of python-dev who have never contributed. I definitely want to limit the number of privileged contributors to a very small number (e.g. 10-15). One additional detail is the legal side -- contributors will have to sign some kind of legal document similar to the current (wetsign.html) release form, but guiding all future contributions. I'll have to discuss this with CNRI's legal team. Greg, I understand you have checkin privileges for Apache. What is the procedure there for handing out those privileges? What is the procedure for using them? (E.g. if you made a bogus change to part of Apache you're not supposed to work on, what happens?) I'm hoping for several kind of responses to this email: - uncontroversial patches - questions about whether specific issues are sufficiently settled to start coding a patch - discussion threads opening up some issues that haven't been settled yet (like the current, very productive, thread in i18n) - posts summarizing issues that were settled long ago in the past, requesting reverification that the issue is still settled - suggestions for new issues that maybe ought to be settled in 1.6 - requests for checkin privileges, preferably with a specific issue or area of expertise for which the requestor will take responsibility --Guido van Rossum (home page: http://www.python.org/~guido/) From akuchlin@mems-exchange.org Tue Nov 16 17:11:48 1999 From: akuchlin@mems-exchange.org (Andrew M. Kuchling) Date: Tue, 16 Nov 1999 12:11:48 -0500 (EST) Subject: [Python-Dev] Python 1.6 status In-Reply-To: <199911161700.MAA02716@eric.cnri.reston.va.us> References: <199911161700.MAA02716@eric.cnri.reston.va.us> Message-ID: <14385.36948.610106.195971@amarok.cnri.reston.va.us> Guido van Rossum writes: >I'm hoping for several kind of responses to this email: My list of things to do for 1.6 is: * Translate re.py to C and switch to the latest PCRE 2 codebase (mostly done, perhaps ready for public review in a week or so). * Go through the O'Reilly POSIX book and draw up a list of missing POSIX functions that aren't available in the posix module. This was sparked by Greg Ward showing me a Perl daemonize() function he'd written, and I realized that some of the functions it used weren't available in Python at all. (setsid() was one of them, I think.) * A while back I got approval to add the mmapfile module to the core. The outstanding issue there is that the constructor has a different interface on Unix and Windows platforms. On Windows: mm = mmapfile.mmapfile("filename", "tag name", <mapsize>) On Unix, it looks like the mmap() function: mm = mmapfile.mmapfile(<filedesc>, <mapsize>, <flags> (like MAP_SHARED), <prot> (like PROT_READ, PROT_READWRITE) ) Can we reconcile these interfaces, have two different function names, or what? >- suggestions for new issues that maybe ought to be settled in 1.6 Perhaps we should figure out what new capabilities, if any, should be added in 1.6. Fred has mentioned weak references, and there are other possibilities such as ExtensionClass. -- A.M. Kuchling http://starship.python.net/crew/amk/ Society, my dear, is like salt water, good to swim in but hard to swallow. -- Arthur Stringer, _The Silver Poppy_ From beazley@cs.uchicago.edu Tue Nov 16 17:24:24 1999 From: beazley@cs.uchicago.edu (David Beazley) Date: Tue, 16 Nov 1999 11:24:24 -0600 (CST) Subject: [Python-Dev] Python 1.6 status References: <199911161700.MAA02716@eric.cnri.reston.va.us> <14385.36948.610106.195971@amarok.cnri.reston.va.us> Message-ID: <199911161724.LAA13496@gargoyle.cs.uchicago.edu> Andrew M. Kuchling writes: > Guido van Rossum writes: > >I'm hoping for several kind of responses to this email: > > * Go through the O'Reilly POSIX book and draw up a list of missing > POSIX functions that aren't available in the posix module. This > was sparked by Greg Ward showing me a Perl daemonize() function > he'd written, and I realized that some of the functions it used > weren't available in Python at all. (setsid() was one of them, I > think.) > I second this! This was one of the things I noticed when doing the Essential Reference Book. Assuming no one has done it already, I wouldn't mind volunteering to take a crack at it. Cheers, Dave From fdrake@acm.org Tue Nov 16 17:25:02 1999 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Tue, 16 Nov 1999 12:25:02 -0500 (EST) Subject: [Python-Dev] Codecs and StreamCodecs In-Reply-To: <199911161620.LAA02643@eric.cnri.reston.va.us> References: <38317FBA.4F3D6B1F@lemburg.com> <199911161620.LAA02643@eric.cnri.reston.va.us> Message-ID: <14385.37742.816993.642515@weyr.cnri.reston.va.us> Guido van Rossum writes: > Also, I don't want to ignore the alternative interface that was > suggested by /F. It uses feed() similar to htmllib c.s. This has > some advantages (although we might want to define some compatibility > so it can also feed directly into a file). I think one or the other can be used, and then a wrapper that converts to the other interface. Perhaps the encoders should provide feed(), and a file-like wrapper can convert write() to feed(). It could also be done the other way; I'm not sure if it matters which is "normal." (Or perhaps feed() was badly named and should be write()? The general intent was a little different, I think, but an output file is very much a stream consumer.) -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives From akuchlin@mems-exchange.org Tue Nov 16 17:32:41 1999 From: akuchlin@mems-exchange.org (Andrew M. Kuchling) Date: Tue, 16 Nov 1999 12:32:41 -0500 (EST) Subject: [Python-Dev] mmapfile module In-Reply-To: <199911161720.MAA02764@eric.cnri.reston.va.us> References: <199911161700.MAA02716@eric.cnri.reston.va.us> <14385.36948.610106.195971@amarok.cnri.reston.va.us> <199911161720.MAA02764@eric.cnri.reston.va.us> Message-ID: <14385.38201.301429.786642@amarok.cnri.reston.va.us> Guido van Rossum writes: >Hm, this seems to require a higher-level Python module to hide the >differences. Maybe the Unix version could also use a filename? I >would think that mmap'ed files should always be backed by a file (not >by a pipe, socket etc.). Or is there an issue with secure creation of >temp files? This is a question for a separate thread. Hmm... I don't know of any way to use mmap() on non-file things, either; there are odd special cases, like using MAP_ANONYMOUS on /dev/zero to allocate memory, but that's still using a file. On the other hand, there may be some special case where you need to do that. We could add a fileno() method to get the file descriptor, but I don't know if that's useful to Windows. (Is Sam Rushing, the original author of the Win32 mmapfile, on this list?) What do we do about the tagname, which is a Win32 argument that has no Unix counterpart -- I'm not even sure what its function is. -- A.M. Kuchling http://starship.python.net/crew/amk/ I had it in me to be the Pierce Brosnan of my generation. -- Vincent Me's past career plans in EGYPT #1 From mal@lemburg.com Tue Nov 16 17:53:46 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Tue, 16 Nov 1999 18:53:46 +0100 Subject: [Python-Dev] Codecs and StreamCodecs References: <38317FBA.4F3D6B1F@lemburg.com> <199911161620.LAA02643@eric.cnri.reston.va.us> Message-ID: <38319A2A.4385D2E7@lemburg.com> Guido van Rossum wrote: > > > It is not required by the unicodec.register() API to provide a > > subclass of these base class, only the given methods must be present; > > this allows writing Codecs as extensions types. All Codecs must > > provide the .encode()/.decode() methods. Codecs having the .read() > > and/or .write() methods are considered to be StreamCodecs. > > > > The Unicode implementation will by itself only use the > > stateless .encode() and .decode() methods. > > > > All other conversion have to be done by explicitly instantiating > > the appropriate [Stream]Codec. > > Looks okay, although I'd like someone to implement a simple > shift-state-based stream codec to check this out further. > > I have some questions about the constructor. You seem to imply > that instantiating the class without arguments creates a codec without > state. That's fine. When given a stream argument, shouldn't the > direction of the stream be given as an additional argument, so the > proper state for encoding or decoding can be set up? I can see that > for an implementation it might be more convenient to have separate > classes for encoders and decoders -- certainly the state being kept is > very different. Wouldn't it be possible to have the read/write methods set up the state when called for the first time ? Note that I wrote ".read() and/or .write() methods" in the proposal on purpose: you can of course implement Codecs which only implement one of them, i.e. Readers and Writers. The registry doesn't care about them anyway :-) Then, if you use a Reader for writing, it will result in an AttributeError... > Also, I don't want to ignore the alternative interface that was > suggested by /F. It uses feed() similar to htmllib c.s. This has > some advantages (although we might want to define some compatibility > so it can also feed directly into a file). AFAIK, .feed() and .finalize() (or .close() etc.) have a different backgound: you add data in chunks and then process it at some final stage rather than for each feed. This is often more efficient. With respest to codecs this would mean, that you buffer the output in memory, first doing only preliminary operations on the feeds and then apply some final logic to the buffer at the time .finalize() is called. We could define a StreamCodec subclass for this kind of operation. > Perhaps someone should go ahead and implement prototype codecs using > either paradigm and then write some simple apps, so we can make a > better decision. > > In any case I think the specs codec registry API aren't on the > critical path, integration of /F's basic unicode object is the first > thing we need. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From gward@cnri.reston.va.us Tue Nov 16 17:54:06 1999 From: gward@cnri.reston.va.us (Greg Ward) Date: Tue, 16 Nov 1999 12:54:06 -0500 Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...) In-Reply-To: <199911161627.LAA02665@eric.cnri.reston.va.us>; from guido@cnri.reston.va.us on Tue, Nov 16, 1999 at 11:27:53AM -0500 References: <199911152137.QAA28280@eric.cnri.reston.va.us> <Pine.LNX.4.10.9911160406070.2535-100000@nebula.lyra.org> <19991116091032.A4063@cnri.reston.va.us> <14385.27579.292173.433577@amarok.cnri.reston.va.us> <14385.33535.23316.286575@anthem.cnri.reston.va.us> <199911161627.LAA02665@eric.cnri.reston.va.us> Message-ID: <19991116125405.B4063@cnri.reston.va.us> On 16 November 1999, Guido van Rossum said: > A completely different approach (which, incidentally, HP has lobbied > for before; and which has been implemented by Sjoerd Mullender for one > particular application) would be to cache a mapping from module names > to filenames in a dbm file. For Sjoerd's app (which imported hundreds > of modules) this made a huge difference. Hey, this could be a big win for Zope startup. Dunno how much of that 20-30 sec startup overhead is due to loading modules, but I'm sure it's a sizeable percentage. Any Zope-heads listening? > The problem is that it's > hard to deal with issues like updating the cache while sharing it with > other processes and even other users... Probably not a concern in the case of Zope: one installation, one process, only gets started when it's explicitly shut down and restarted. HmmmMMMMmmm... Greg From petrilli@amber.org Tue Nov 16 18:04:46 1999 From: petrilli@amber.org (Christopher Petrilli) Date: Tue, 16 Nov 1999 13:04:46 -0500 Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...) In-Reply-To: <19991116125405.B4063@cnri.reston.va.us>; from gward@cnri.reston.va.us on Tue, Nov 16, 1999 at 12:54:06PM -0500 References: <199911152137.QAA28280@eric.cnri.reston.va.us> <Pine.LNX.4.10.9911160406070.2535-100000@nebula.lyra.org> <19991116091032.A4063@cnri.reston.va.us> <14385.27579.292173.433577@amarok.cnri.reston.va.us> <14385.33535.23316.286575@anthem.cnri.reston.va.us> <199911161627.LAA02665@eric.cnri.reston.va.us> <19991116125405.B4063@cnri.reston.va.us> Message-ID: <19991116130446.A3068@trump.amber.org> Greg Ward [gward@cnri.reston.va.us] wrote: > On 16 November 1999, Guido van Rossum said: > > A completely different approach (which, incidentally, HP has lobbied > > for before; and which has been implemented by Sjoerd Mullender for one > > particular application) would be to cache a mapping from module names > > to filenames in a dbm file. For Sjoerd's app (which imported hundreds > > of modules) this made a huge difference. > > Hey, this could be a big win for Zope startup. Dunno how much of that > 20-30 sec startup overhead is due to loading modules, but I'm sure it's > a sizeable percentage. Any Zope-heads listening? Wow, that's a huge start up that I've personally never seen. I can't imagine... even loading the Oracle libraries dynamically, which are HUGE (2Mb or so), it's only a couple seconds. > > The problem is that it's > > hard to deal with issues like updating the cache while sharing it with > > other processes and even other users... > > Probably not a concern in the case of Zope: one installation, one > process, only gets started when it's explicitly shut down and > restarted. HmmmMMMMmmm... This doesn't reslve a lot of other users of Python howver... and Zope would always benefit, especially when you're running multiple instances on th same machine... would perhaps share more code. Chris -- | Christopher Petrilli | petrilli@amber.org From gmcm@hypernet.com Tue Nov 16 18:04:41 1999 From: gmcm@hypernet.com (Gordon McMillan) Date: Tue, 16 Nov 1999 13:04:41 -0500 Subject: [Python-Dev] mmapfile module In-Reply-To: <14385.38201.301429.786642@amarok.cnri.reston.va.us> References: <199911161720.MAA02764@eric.cnri.reston.va.us> Message-ID: <1269347016-9399681@hypernet.com> Andrew M. Kuchling wrote: > Hmm... I don't know of any way to use mmap() on non-file things, > either; there are odd special cases, like using MAP_ANONYMOUS on > /dev/zero to allocate memory, but that's still using a file. On > the other hand, there may be some special case where you need to > do that. We could add a fileno() method to get the file > descriptor, but I don't know if that's useful to Windows. (Is > Sam Rushing, the original author of the Win32 mmapfile, on this > list?) > > What do we do about the tagname, which is a Win32 argument that > has no Unix counterpart -- I'm not even sure what its function > is. On Windows, a mmap is always backed by disk (swap space), but is not necessarily associated with a (user-land) file. The tagname is like the "name" associated with a semaphore; two processes opening the same tagname get shared memory. Fileno (in the c runtime sense) would be useless on Windows. As with all Win32 resources, there's a "handle", which is analagous. But different enough, it seems to me, to confound any attempts at a common API. Another fundamental difference (IIRC) is that Windows mmap's can be resized on the fly. - Gordon From guido@CNRI.Reston.VA.US Tue Nov 16 18:09:43 1999 From: guido@CNRI.Reston.VA.US (Guido van Rossum) Date: Tue, 16 Nov 1999 13:09:43 -0500 Subject: [Python-Dev] Codecs and StreamCodecs In-Reply-To: Your message of "Tue, 16 Nov 1999 18:53:46 +0100." <38319A2A.4385D2E7@lemburg.com> References: <38317FBA.4F3D6B1F@lemburg.com> <199911161620.LAA02643@eric.cnri.reston.va.us> <38319A2A.4385D2E7@lemburg.com> Message-ID: <199911161809.NAA02894@eric.cnri.reston.va.us> > > I have some questions about the constructor. You seem to imply > > that instantiating the class without arguments creates a codec without > > state. That's fine. When given a stream argument, shouldn't the > > direction of the stream be given as an additional argument, so the > > proper state for encoding or decoding can be set up? I can see that > > for an implementation it might be more convenient to have separate > > classes for encoders and decoders -- certainly the state being kept is > > very different. > > Wouldn't it be possible to have the read/write methods set up > the state when called for the first time ? Hm, I'd rather be explicit. We don't do this for files either. > Note that I wrote ".read() and/or .write() methods" in the proposal > on purpose: you can of course implement Codecs which only implement > one of them, i.e. Readers and Writers. The registry doesn't care > about them anyway :-) > > Then, if you use a Reader for writing, it will result in an > AttributeError... > > > Also, I don't want to ignore the alternative interface that was > > suggested by /F. It uses feed() similar to htmllib c.s. This has > > some advantages (although we might want to define some compatibility > > so it can also feed directly into a file). > > AFAIK, .feed() and .finalize() (or .close() etc.) have a different > backgound: you add data in chunks and then process it at some > final stage rather than for each feed. This is often more > efficient. > > With respest to codecs this would mean, that you buffer the > output in memory, first doing only preliminary operations on > the feeds and then apply some final logic to the buffer at > the time .finalize() is called. This is part of the purpose, yes. > We could define a StreamCodec subclass for this kind of operation. The difference is that to decode from a file, your proposed interface is to call read() on the codec which will in turn call read() on the stream. In /F's version, I call read() on the stream (geting multibyte encoded data), feed() that to the codec, which in turn calls feed() to some other back end -- perhaps another codec which in turn feed()s its converted data to another file, perhaps an XML parser. --Guido van Rossum (home page: http://www.python.org/~guido/) From fdrake@acm.org Tue Nov 16 18:16:42 1999 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Tue, 16 Nov 1999 13:16:42 -0500 (EST) Subject: [Python-Dev] Codecs and StreamCodecs In-Reply-To: <38319A2A.4385D2E7@lemburg.com> References: <38317FBA.4F3D6B1F@lemburg.com> <199911161620.LAA02643@eric.cnri.reston.va.us> <38319A2A.4385D2E7@lemburg.com> Message-ID: <14385.40842.709711.12141@weyr.cnri.reston.va.us> M.-A. Lemburg writes: > Wouldn't it be possible to have the read/write methods set up > the state when called for the first time ? That slows the down; the constructor should handle initialization. Perhaps what gets registered should be: encoding function, decoding function, stream encoder factory (can be a class), stream decoder factory (again, can be a class). These can be encapsulated either before or after hitting the registry, and can be None. The registry and provide default implementations from what is provided (stream handlers from the functions, or functions from the stream handlers) as required. Ideally, I should be able to write a module with four well-known entry points and then provide the module object itself as the registration entry. Or I could construct a new object that has the right interface and register that if it made more sense for the encoding. > AFAIK, .feed() and .finalize() (or .close() etc.) have a different > backgound: you add data in chunks and then process it at some > final stage rather than for each feed. This is often more Many of the classes that provide feed() do as much work as possible as data is fed into them (see htmllib.HTMLParser); this structure is commonly used to support asynchonous operation. > With respest to codecs this would mean, that you buffer the > output in memory, first doing only preliminary operations on > the feeds and then apply some final logic to the buffer at > the time .finalize() is called. That depends on the encoding. I'd expect it to feed encoded data to a sink as quickly as it could and let the target decide what needs to happen. If buffering is needed, the target could be a StringIO or whatever. -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives From fredrik@pythonware.com Tue Nov 16 19:32:21 1999 From: fredrik@pythonware.com (Fredrik Lundh) Date: Tue, 16 Nov 1999 20:32:21 +0100 Subject: [Python-Dev] mmapfile module References: <199911161700.MAA02716@eric.cnri.reston.va.us><14385.36948.610106.195971@amarok.cnri.reston.va.us><199911161720.MAA02764@eric.cnri.reston.va.us> <14385.38201.301429.786642@amarok.cnri.reston.va.us> Message-ID: <002201bf3069$4e232a50$f29b12c2@secret.pythonware.com> > Hmm... I don't know of any way to use mmap() on non-file things, > either; there are odd special cases, like using MAP_ANONYMOUS on > /dev/zero to allocate memory, but that's still using a file. but that's not always the case -- OSF/1 supports truly anonymous mappings, for example. in fact, it bombs if you use ANONYMOUS with a file handle: $ man mmap ... If MAP_ANONYMOUS is set in the flags parameter: + A new memory region is created and initialized to all zeros. This memory region can be shared only with descendents of the current pro- cess. + If the filedes parameter is not -1, the mmap() function fails. ... (btw, doing anonymous maps isn't exactly an odd special case under this operating system; it's the only memory- allocation mechanism provided by the kernel...) </F> From fredrik@pythonware.com Tue Nov 16 19:33:52 1999 From: fredrik@pythonware.com (Fredrik Lundh) Date: Tue, 16 Nov 1999 20:33:52 +0100 Subject: [Python-Dev] Codecs and StreamCodecs References: <38317FBA.4F3D6B1F@lemburg.com> <199911161620.LAA02643@eric.cnri.reston.va.us> Message-ID: <002e01bf3069$8477b440$f29b12c2@secret.pythonware.com> Guido van Rossum <guido@CNRI.Reston.VA.US> wrote: > Also, I don't want to ignore the alternative interface that was > suggested by /F. It uses feed() similar to htmllib c.s. This has > some advantages (although we might want to define some > compatibility so it can also feed directly into a file). seeing this made me switch on my brain for a moment, and recall how things are done in PIL (which is, as I've bragged about before, another library with an internal format, and many possible external encodings). among other things, PIL lets you read and write images to both ordinary files and arbitrary file objects, but it also lets you incrementally decode images by feeding it chunks of data (through ImageFile.Parser). and it's fast -- it has to be, since images tends to contain lots of pixels... anyway, here's what I came up with (code will follow, if someone's interested). -------------------------------------------------------------------- A PIL-like Unicode Codec Proposal -------------------------------------------------------------------- In the PIL model, the codecs are called with a piece of data, and returns the result to the caller. The codecs maintain internal state when needed. class decoder: def decode(self, s, offset=0): # decode as much data as we possibly can from the # given string. if there's not enough data in the # input string to form a full character, return # what we've got this far (this might be an empty # string). def flush(self): # flush the decoding buffers. this should usually # return None, unless the fact that knowing that the # input stream has ended means that the state can be # interpreted in a meaningful way. however, if the # state indicates that there last character was not # finished, this method should raise a UnicodeError # exception. class encoder: def encode(self, u, offset=0, buffersize=0): # encode data from the given offset in the input # unicode string into a buffer of the given size # (or slightly larger, if required to proceed). # if the buffer size is 0, the decoder is free # to pick a suitable size itself (if at all # possible, it should make it large enough to # encode the entire input string). returns a # 2-tuple containing the encoded data, and the # number of characters consumed by this call. def flush(self): # flush the encoding buffers. returns an ordinary # string (which may be empty), or None. Note that a codec instance can be used for a single string; the codec registry should hold codec factories, not codec instances. In addition, you may use a single type or class to implement both interfaces at once. -------------------------------------------------------------------- Use Cases -------------------------------------------------------------------- A null decoder: class decoder: def decode(self, s, offset=0): return s[offset:] def flush(self): pass A null encoder: class encoder: def encode(self, s, offset=0, buffersize=0): if buffersize: s = s[offset:offset+buffersize] else: s = s[offset:] return s, len(s) def flush(self): pass Decoding a string: def decode(s, encoding) c = registry.getdecoder(encoding) u = c.decode(s) t = c.flush() if not t: return u return u + t # not very common Encoding a string: def encode(u, encoding) c = registry.getencoder(encoding) p = [] o = 0 while o < len(u): s, n = c.encode(u, o) p.append(s) o = o + n if len(p) == 1: return p[0] return string.join(p, "") # not very common Implementing stream codecs is left as an exercise (see the zlib material in the eff-bot guide for a decoder example). --- end of proposal From fredrik@pythonware.com Tue Nov 16 19:37:40 1999 From: fredrik@pythonware.com (Fredrik Lundh) Date: Tue, 16 Nov 1999 20:37:40 +0100 Subject: [Python-Dev] Python 1.6 status References: <199911161700.MAA02716@eric.cnri.reston.va.us> <14385.36948.610106.195971@amarok.cnri.reston.va.us> Message-ID: <003d01bf306a$0bdea330$f29b12c2@secret.pythonware.com> > * Go through the O'Reilly POSIX book and draw up a list of missing > POSIX functions that aren't available in the posix module. This > was sparked by Greg Ward showing me a Perl daemonize() function > he'd written, and I realized that some of the functions it used > weren't available in Python at all. (setsid() was one of them, I > think.) $ python Python 1.5.2 (#1, Aug 23 1999, 14:42:39) [GCC 2.7.2.3] on linux2 Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam >>> import os >>> os.setsid <built-in function setsid> </F> From mhammond@skippinet.com.au Tue Nov 16 21:54:15 1999 From: mhammond@skippinet.com.au (Mark Hammond) Date: Wed, 17 Nov 1999 08:54:15 +1100 Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: <19991116110555.8B43335BB1E@snelboot.oratrix.nl> Message-ID: <00f701bf307d$20f0cb00$0501a8c0@bobcat> [Andy writes:] > Leave JISXXX and the CJK stuff out. If you get into Japanese, you > really need to cover ShiftJIS, EUC-JP and JIS, they are big, and there [Then Marc relpies:] > 2. give more information to the unicodec registry: > one could register classes instead of instances which the Unicode [Jack chimes in with:] > I would suggest adding the Dos, Windows and Macintosh > standard 8-bit charsets > (their equivalents of latin-1) too, as documents in these > encoding are pretty > ubiquitous. But maybe these should only be added on the > respective platforms. [And the conversation twisted around to Greg noting:] > Next, the number of "open" calls: > > Solaris Linux IRIX > Perl 16 10 9 > Python 107 71 48 This is leading me to conclude that our "codec registry" should be the file system, and Python modules. Would it be possible to define a "standard package" called "encodings", and when we need an encoding, we simply attempt to load a module from that package? The key benefits I see are: * No need to load modules simply to register a codec (which would make the number of open calls even higher, and the startup time even slower.) This makes it truly demand-loading of the codecs, rather than explicit load-and-register. * Making language specific distributions becomes simple - simply select a different set of modules from the "encodings" directory. The Python source distribution has them all, but (say) the Windows binary installer selects only a few. The Japanese binary installer for Windows installs a few more. * Installing new codecs becomes trivial - no need to hack site.py etc - simply copy the new "codec module" to the encodings directory and you are done. * No serious problem for GMcM's installer nor for freeze We would probably need to assume that certain codes exist for _all_ platforms and language - but this is no different to assuming that "exceptions.py" also exists for all platforms. Is this worthy of consideration? Mark. From andy@robanal.demon.co.uk Wed Nov 17 00:14:06 1999 From: andy@robanal.demon.co.uk (Andy Robinson) Date: Wed, 17 Nov 1999 00:14:06 GMT Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: <010001bf300e$14741310$f29b12c2@secret.pythonware.com> References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com> <199911152137.QAA28280@eric.cnri.reston.va.us> <010001bf300e$14741310$f29b12c2@secret.pythonware.com> Message-ID: <3836f28c.4929177@post.demon.co.uk> On Tue, 16 Nov 1999 09:39:20 +0100, you wrote: >1) codes written according to the "data > consumer model", instead of the "stream" > model. > > class myDecoder: > def __init__(self, target): > self.target = target > self.state = ... > def feed(self, data): > ... extract as much data as possible ... > self.target.feed(extracted data) > def close(self): > ... extract what's left ... > self.target.feed(additional data) > self.target.close() > Apart from feed() instead of write(), how is that different from a Java-like Stream writer as Guido suggested? He said: >Andy's file translation example could then be written as follows: > ># assuming variables input_file, input_encoding, output_file, ># output_encoding, and constant BUFFER_SIZE > >f = open(input_file, "rb") >f1 = unicodec.codecs[input_encoding].stream_reader(f) >g = open(output_file, "wb") >g1 = unicodec.codecs[output_encoding].stream_writer(f) > >while 1: > buffer = f1.read(BUFFER_SIZE) > if not buffer: > break > f2.write(buffer) > >f2.close() >f1.close() > >Note that we could possibly make these the only API that a codec needs >to provide; the string object <--> unicode object conversions can be >done using this and the cStringIO module. (On the other hand it seems >a common case that would be quite useful.) - Andy From gstein@lyra.org Wed Nov 17 02:03:21 1999 From: gstein@lyra.org (Greg Stein) Date: Tue, 16 Nov 1999 18:03:21 -0800 (PST) Subject: [Python-Dev] shared data In-Reply-To: <1269351119-9152905@hypernet.com> Message-ID: <Pine.LNX.4.10.9911161756290.10639-100000@nebula.lyra.org> On Tue, 16 Nov 1999, Gordon McMillan wrote: > Barry A. Warsaw writes: > > One approach might be to support loading modules out of jar files > > (or whatever) using Greg imputils. We could put the bootstrap > > .pyc files in this jar and teach Python to import from it first. > > Python installations could even craft their own modules.jar file > > to include whatever modules they are willing to "hard code". > > This, with -S might make Python start up much faster, at the > > small cost of some flexibility (which could be regained with a > > c.l. switch or other mechanism to bypass modules.jar). > > Couple hundred Windows users have been doing this for > months (http://starship.python.net/crew/gmcm/install.html). > The .pyz files are cross-platform, although the "embedding" > app would have to be redone for *nix, (and all the embedding > really does is keep Python from hunting all over your disk). > Yeah, it's faster. And I can put Python+Tcl/Tk+IDLE on a > diskette with a little room left over. I've got a patch from Jim Ahlstrom to provide a "standardized" library file. I've got to review and fold that thing in (I'll post here when that is done). As Gordon states: yes, the startup time is considerably improved. The DBM approach is interesting. That could definitely be used thru an imputils Importer; it would be quite interesting to try that out. (Note that the library style approach would be even harder to deal with updates, relative to what Sjoerd saw with the DBM approach; I would guess that the "right" approach is to rebuild the library from scratch and atomically replace the thing (but that would bust people with open references...)) Certainly something to look at. Cheers, -g p.s. I also want to try mmap'ing a library and creating code objects that use PyBufferObjects (rather than PyStringObjects) that refer to portions of the mmap. Presuming the mmap is shared, there "should" be a large reduction in heap usage. Question is that I don't know the proportion of code bytes to other heap usage caused by loading a .pyc. p.p.s. I also want to try the buffer approach for frozen code. -- Greg Stein, http://www.lyra.org/ From gstein@lyra.org Wed Nov 17 02:29:42 1999 From: gstein@lyra.org (Greg Stein) Date: Tue, 16 Nov 1999 18:29:42 -0800 (PST) Subject: [Python-Dev] Codecs and StreamCodecs In-Reply-To: <14385.40842.709711.12141@weyr.cnri.reston.va.us> Message-ID: <Pine.LNX.4.10.9911161821230.10639-100000@nebula.lyra.org> On Tue, 16 Nov 1999, Fred L. Drake, Jr. wrote: > M.-A. Lemburg writes: > > Wouldn't it be possible to have the read/write methods set up > > the state when called for the first time ? > > That slows the down; the constructor should handle initialization. > Perhaps what gets registered should be: encoding function, decoding > function, stream encoder factory (can be a class), stream decoder > factory (again, can be a class). These can be encapsulated either > before or after hitting the registry, and can be None. The registry I'm with Fred here; he beat me to the punch (and his email is better than what I'd write anyhow :-). I'd like to see the API be *functions* rather than a particular class specification. If the spec is going to say "do not alter/store state", then a function makes much more sense than a method on an object. Of course, bound method objects could be registered. This might occur if you have a general JIS encode/decoder but need to instantiate it a little differently for each JIS variant. (Andy also mentioned something about "options" in JIS encoding/decoding) > and provide default implementations from what is provided (stream > handlers from the functions, or functions from the stream handlers) as > required. Excellent idea... "I'll provide the encode/decode functions, but I don't have a spiffy algorithm for streaming -- please provide a stream wrapper for my functions." > Ideally, I should be able to write a module with four well-known > entry points and then provide the module object itself as the > registration entry. Or I could construct a new object that has the > right interface and register that if it made more sense for the > encoding. Mark's idea about throwing these things into a package for on-demand registrations is much better than a "register-beforehand" model. When the module is loaded from the package, it calls a registration function to insert its 4-tuple of registration data. Cheers, -g -- Greg Stein, http://www.lyra.org/ From gstein@lyra.org Wed Nov 17 02:40:07 1999 From: gstein@lyra.org (Greg Stein) Date: Tue, 16 Nov 1999 18:40:07 -0800 (PST) Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: <00f701bf307d$20f0cb00$0501a8c0@bobcat> Message-ID: <Pine.LNX.4.10.9911161830020.10639-100000@nebula.lyra.org> On Wed, 17 Nov 1999, Mark Hammond wrote: >... > Would it be possible to define a "standard package" called > "encodings", and when we need an encoding, we simply attempt to load a > module from that package? The key benefits I see are: >... > Is this worthy of consideration? Absolutely! You will need to provide a way for a module (in the "codec" package) to state *beforehand* that it should be loaded for the X, Y, and Z encodings. This might be in terms of little "info" files that get dropped into the package. The __init__.py module scans the directory for the info files and loads them to build an encoding => module-name mapping. The alternative would be to have stub modules like: iso-8859-1.py: import unicodec def encode_1(...) ... def encode_2(...) ... ... unicodec.register('iso-8859-1', encode_1, decode_1) unicodec.register('iso-8859-2', encode_2, decode_2) ... iso-8859-2.py: import iso-8859-1 I believe that encoding names are legitimate file names, but they aren't necessarily Python identifiers. That kind of bungs up "import codec.iso-8859-1". The codec package would need to programmatically import the modules. Clients should not be directly importing the modules, so I don't see a difficult here. [ if we do decide to allow clients access to the modules, then maybe they have to arrive through a "helper" module that has a nice name, or the codec package provides a "module = code.load('iso-8859-1')" idiom. ] Cheers, -g -- Greg Stein, http://www.lyra.org/ From mhammond@skippinet.com.au Wed Nov 17 02:57:48 1999 From: mhammond@skippinet.com.au (Mark Hammond) Date: Wed, 17 Nov 1999 13:57:48 +1100 Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: <Pine.LNX.4.10.9911161830020.10639-100000@nebula.lyra.org> Message-ID: <010501bf30a7$88c00320$0501a8c0@bobcat> > You will need to provide a way for a module (in the "codec" > package) to > state *beforehand* that it should be loaded for the X, Y, and ... > The alternative would be to have stub modules like: Actually, I was thinking even more radically - drop the codec registry all together, and use modules with "well-known" names (a slight precedent, but Python isnt adverse to well-known names in general) eg: iso-8859-1.py: import unicodec def encode(...): ... def decode(...): ... iso-8859-2.py: from iso-8859-1 import * The codec registry then is trivial, and effectively does not exist (cant get much more trivial than something that doesnt exist :-): def getencoder(encoding): mod = __import__( "encodings." + encoding ) return getattr(mod, "encode") > I believe that encoding names are legitimate file names, but > they aren't > necessarily Python identifiers. That kind of bungs up "import > codec.iso-8859-1". Agreed - clients should never need to import them, and codecs that wish to import other codes could use "__import__" Of course, I am not adverse to the idea of a registry as well and having the modules manually register themselves - but it doesnt seem to buy much, and the logic for getting a codec becomes more complex - ie, it needs to determine the module to import, then look in the registry - if it needs to determine the module anyway, why not just get it from the module and be done with it? Mark. From andy@robanal.demon.co.uk Wed Nov 17 00:18:22 1999 From: andy@robanal.demon.co.uk (Andy Robinson) Date: Wed, 17 Nov 1999 00:18:22 GMT Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: <00f701bf307d$20f0cb00$0501a8c0@bobcat> References: <00f701bf307d$20f0cb00$0501a8c0@bobcat> Message-ID: <3837f379.5166829@post.demon.co.uk> On Wed, 17 Nov 1999 08:54:15 +1100, you wrote: >This is leading me to conclude that our "codec registry" should be the >file system, and Python modules. > >Would it be possible to define a "standard package" called >"encodings", and when we need an encoding, we simply attempt to load a >module from that package? The key benefits I see are: [snip] >Is this worthy of consideration? Exactly what I am aiming for. The real icing on the cake would be a small state machine or some helper functions in C which made it possible to write fast codecs in pure Python, but that can come a bit later when we have examples up and running. - Andy From andy@robanal.demon.co.uk Wed Nov 17 00:08:01 1999 From: andy@robanal.demon.co.uk (Andy Robinson) Date: Wed, 17 Nov 1999 00:08:01 GMT Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <000601bf2ff7$4d8a4c80$042d153f@tim> References: <000601bf2ff7$4d8a4c80$042d153f@tim> Message-ID: <3834f142.4599884@post.demon.co.uk> On Tue, 16 Nov 1999 00:56:18 -0500, you wrote: >[Andy Robinson] >> ... >> I presume no one is actually advocating dropping >> ordinary Python strings, or the ability to do >> rawdata = open('myfile.txt', 'rb').read() >> without any transformations? > >If anyone has advocated either, they've successfully hidden it from me. >Anyone? Well, I hear statements looking forward to when all string-handling is done in Unicode internally. This scares the hell out of me - it is what VB does and that bit us badly on simple stream operations. For encoding work, you will always need raw strings, and often need Unicode ones. - Andy From tim_one@email.msn.com Wed Nov 17 07:33:06 1999 From: tim_one@email.msn.com (Tim Peters) Date: Wed, 17 Nov 1999 02:33:06 -0500 Subject: [Python-Dev] Unicode proposal: %-formatting ? In-Reply-To: <383134AA.4B49D178@lemburg.com> Message-ID: <000001bf30cd$fd6be9c0$a42d153f@tim> [MAL] > ... > This means a new PyUnicode_Format() implementation mapping > Unicode format objects to Unicode objects. It's a bitch, isn't it <0.5 wink>? I hope they're paying you a lot for this! > ... hmm, there is a problem there: how should the PyUnicode_Format() > API deal with '%s' when it sees a Unicode object as argument ? Anything other than taking the Unicode characters as-is would be incomprehensible. I mean, it's a Unicode format string sucking up Unicode strings -- what else could possibly make *sense*? > E.g. what would you get in these cases: > > u = u"%s %s" % (u"abc", "abc") That u"abc" gets substituted as-is seems screamingly necessary to me. I'm more baffled about what "abc" should do. I didn't understand the t#/s# etc arguments, and how those do or don't relate to what str() does. On the face of it, the idea that a gazillion and one distinct encodings all get lumped into "a string object" without remembering their nature makes about as much sense as if Python were to treat all instances of all user-defined classes as being of a single InstanceType type <wink> -- except in the latter case you at least get a __class__ attribute to find your way home again. As an ignorant user, I would hope that u"%s" % string had enough sense to know what string's encoding is all on its own, and promote it correctly to Unicode by magic. > Perhaps we need a new marker for "insert Unicode object here". %s means string, and at this level a Unicode object *is* "a string". If this isn't obvious, it's likely because we're too clever about what non-Unicode string objects do in this context. From andy@robanal.demon.co.uk Wed Nov 17 07:53:53 1999 From: andy@robanal.demon.co.uk (=?iso-8859-1?q?Andy=20Robinson?=) Date: Tue, 16 Nov 1999 23:53:53 -0800 (PST) Subject: [Python-Dev] Some thoughts on the codecs... Message-ID: <19991117075353.16046.rocketmail@web606.mail.yahoo.com> --- Mark Hammond <mhammond@skippinet.com.au> wrote: > Actually, I was thinking even more radically - drop > the codec registry > all together, and use modules with "well-known" > names (a slight > precedent, but Python isnt adverse to well-known > names in general) > > eg: > iso-8859-1.py: > > import unicodec > def encode(...): > ... > def decode(...): > ... > > iso-8859-2.py: > from iso-8859-1 import * > This is the simplest if each codec really is likely to be implemented in a separate module. But just look at the data! All the iso-8859 encodings need identical functionality, and just have a different mapping table with 256 elements. It would be trivial to implement these in one module. And the wide variety of Japanese encodings (mostly corporate or historical variants of the same character set) are again best treated from one code base with a bunch of mapping tables and routines to generate the variants - basically one can store the deltas. So the choice is between possibly having a lot of almost-dummy modules, or having Python modules which generate and register a logical family of encodings. I may have some time next week and will try to code up a few so we can pound on something. - Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From andy@robanal.demon.co.uk Wed Nov 17 07:58:23 1999 From: andy@robanal.demon.co.uk (=?iso-8859-1?q?Andy=20Robinson?=) Date: Tue, 16 Nov 1999 23:58:23 -0800 (PST) Subject: [Python-Dev] Unicode proposal: %-formatting ? Message-ID: <19991117075823.6498.rocketmail@web602.mail.yahoo.com> --- Tim Peters <tim_one@email.msn.com> wrote: > I'm more baffled about what "abc" should do. I > didn't understand the t#/s# > etc arguments, and how those do or don't relate to > what str() does. On the > face of it, the idea that a gazillion and one > distinct encodings all get > lumped into "a string object" without remembering > their nature makes about > as much sense as if Python were to treat all > instances of all user-defined > classes as being of a single InstanceType type > <wink> -- except in the > latter case you at least get a __class__ attribute > to find your way home > again. Well said. When the core stuff is done, I'm going to implement a set of "TypedString" helper routines which will remember what they are encoded in and won't let you abuse them by concatenating or otherwise mixing different encodings. If you are consciously working with multi-encoding data, this higher level of abstraction is really useful. But I reckon that can be done in pure Python (just overload '%;, '+' etc. with some encoding checks). - Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From mal@lemburg.com Wed Nov 17 10:03:59 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 17 Nov 1999 11:03:59 +0100 Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit) References: <000201bf30d3$cb2cb240$a42d153f@tim> Message-ID: <38327D8F.7A5352E6@lemburg.com> Tim Peters wrote: > > [MAL] > > ...demo script... > > It looks like > > r'\\u0000' > > will get translated into a 2-character Unicode string. Right... > That's probably not > good, if for no other reason than that Java would not do this (it would > create the obvious 7-character Unicode string), and having something that > looks like a Java escape that doesn't *work* like the Java escape will be > confusing as heck for JPython users. Keeping track of even-vs-odd number of > backslashes can't be done with a regexp search, but is easy if the code is > simple <wink>: > ...Tim's version of the demo... Guido and I have decided to turn \uXXXX into a standard escape sequence with no further magic applied. \uXXXX will only be expanded in u"" strings. Here's the new scheme: With the 'unicode-escape' encoding being defined as: · all non-escape characters represent themselves as a Unicode ordinal (e.g. 'a' -> U+0061). · all existing defined Python escape sequences are interpreted as Unicode ordinals; note that \xXXXX can represent all Unicode ordinals, and \OOO (octal) can represent Unicode ordinals up to U+01FF. · a new escape sequence, \uXXXX, represents U+XXXX; it is a syntax error to have fewer than 4 digits after \u. Examples: u'abc' -> U+0061 U+0062 U+0063 u'\u1234' -> U+1234 u'abc\u1234\n' -> U+0061 U+0062 U+0063 U+1234 U+05c Now how should we define ur"abc\u1234\n" ... ? -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 44 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From tim_one@email.msn.com Wed Nov 17 09:31:27 1999 From: tim_one@email.msn.com (Tim Peters) Date: Wed, 17 Nov 1999 04:31:27 -0500 Subject: [Python-Dev] Python 1.6 status In-Reply-To: <199911161700.MAA02716@eric.cnri.reston.va.us> Message-ID: <000801bf30de$85bea500$a42d153f@tim> [Guido] > ... > I'm hoping for several kind of responses to this email: > ... > - requests for checkin privileges, preferably with a specific issue > or area of expertise for which the requestor will take responsibility. I'm specifically requesting not to have checkin privileges. So there. I see two problems: 1. When patches go thru you, you at least eyeball them. This catches bugs and design errors early. 2. For a multi-platform app, few people have adequate resources for testing; e.g., I can test under an obsolete version of Win95, and NT if I have to, but that's it. You may not actually do better testing than that, but having patches go thru you allows me the comfort of believing you do <wink>. From mal@lemburg.com Wed Nov 17 10:11:05 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 17 Nov 1999 11:11:05 +0100 Subject: [Python-Dev] Some thoughts on the codecs... References: <00f701bf307d$20f0cb00$0501a8c0@bobcat> Message-ID: <38327F39.AA381647@lemburg.com> Mark Hammond wrote: > > This is leading me to conclude that our "codec registry" should be the > file system, and Python modules. > > Would it be possible to define a "standard package" called > "encodings", and when we need an encoding, we simply attempt to load a > module from that package? The key benefits I see are: > > * No need to load modules simply to register a codec (which would make > the number of open calls even higher, and the startup time even > slower.) This makes it truly demand-loading of the codecs, rather > than explicit load-and-register. > > * Making language specific distributions becomes simple - simply > select a different set of modules from the "encodings" directory. The > Python source distribution has them all, but (say) the Windows binary > installer selects only a few. The Japanese binary installer for > Windows installs a few more. > > * Installing new codecs becomes trivial - no need to hack site.py > etc - simply copy the new "codec module" to the encodings directory > and you are done. > > * No serious problem for GMcM's installer nor for freeze > > We would probably need to assume that certain codes exist for _all_ > platforms and language - but this is no different to assuming that > "exceptions.py" also exists for all platforms. > > Is this worthy of consideration? Why not... using the new registry scheme I proposed in the thread "Codecs and StreamCodecs" you could implement this via factory_functions and lazy imports (with the encoding name folded to make up a proper Python identifier, e.g. hyphens get converted to '' and spaces to '_'). I'd suggest grouping encodings: [encodings] [iso} [iso88591] [iso88592] [jis] ... [cyrillic] ... [misc] The unicodec registry could then query encodings.get(encoding,action) and the package would take care of the rest. Note that the "walk-me-up-scotty" import patch would probably be nice in this situation too, e.g. to reach the modules in [misc] or in higher levels such the ones in [iso] from [iso88591]. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 44 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Wed Nov 17 09:29:34 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 17 Nov 1999 10:29:34 +0100 Subject: [Python-Dev] Codecs and StreamCodecs References: <38317FBA.4F3D6B1F@lemburg.com> <199911161620.LAA02643@eric.cnri.reston.va.us> <002e01bf3069$8477b440$f29b12c2@secret.pythonware.com> Message-ID: <3832757E.B9503606@lemburg.com> Fredrik Lundh wrote: > > -------------------------------------------------------------------- > A PIL-like Unicode Codec Proposal > -------------------------------------------------------------------- > > In the PIL model, the codecs are called with a piece of data, and > returns the result to the caller. The codecs maintain internal state > when needed. > > class decoder: > > def decode(self, s, offset=0): > # decode as much data as we possibly can from the > # given string. if there's not enough data in the > # input string to form a full character, return > # what we've got this far (this might be an empty > # string). > > def flush(self): > # flush the decoding buffers. this should usually > # return None, unless the fact that knowing that the > # input stream has ended means that the state can be > # interpreted in a meaningful way. however, if the > # state indicates that there last character was not > # finished, this method should raise a UnicodeError > # exception. Could you explain for reason for having a .flush() method and what it should return. Note that the .decode method is not so much different from my Codec.decode method except that it uses a single offset where my version uses a slice (the offset is probably the better variant, because it avoids data truncation). > class encoder: > > def encode(self, u, offset=0, buffersize=0): > # encode data from the given offset in the input > # unicode string into a buffer of the given size > # (or slightly larger, if required to proceed). > # if the buffer size is 0, the decoder is free > # to pick a suitable size itself (if at all > # possible, it should make it large enough to > # encode the entire input string). returns a > # 2-tuple containing the encoded data, and the > # number of characters consumed by this call. Dito. > def flush(self): > # flush the encoding buffers. returns an ordinary > # string (which may be empty), or None. > > Note that a codec instance can be used for a single string; the codec > registry should hold codec factories, not codec instances. In > addition, you may use a single type or class to implement both > interfaces at once. Perhaps I'm missing something, but how would you define stream codecs using this interface ? > Implementing stream codecs is left as an exercise (see the zlib > material in the eff-bot guide for a decoder example). ...? -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 44 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Wed Nov 17 09:55:05 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 17 Nov 1999 10:55:05 +0100 Subject: [Python-Dev] Codecs and StreamCodecs References: <38317FBA.4F3D6B1F@lemburg.com> <199911161620.LAA02643@eric.cnri.reston.va.us> <38319A2A.4385D2E7@lemburg.com> <14385.40842.709711.12141@weyr.cnri.reston.va.us> Message-ID: <38327B79.2415786B@lemburg.com> "Fred L. Drake, Jr." wrote: > > M.-A. Lemburg writes: > > Wouldn't it be possible to have the read/write methods set up > > the state when called for the first time ? > > That slows the down; the constructor should handle initialization. > Perhaps what gets registered should be: encoding function, decoding > function, stream encoder factory (can be a class), stream decoder > factory (again, can be a class). Guido proposed the factory approach too, though not seperated into these 4 APIs (note that your proposal looks very much like what I had in the early version of my proposal). Anyway, I think that factory functions are the way to go, because they offer more flexibility w/r to reusing already instantiated codecs, importing modules on-the-fly as was suggested in another thread (thereby making codec module import lazy) or mapping encoder and decoder requests all to one class. So here's a new registry approach: unicodec.register(encoding,factory_function,action) with encoding - name of the supported encoding, e.g. Shift_JIS factory_function - a function that returns an object or function ready to be used for action action - a string stating the supported action: 'encode' 'decode' 'stream write' 'stream read' The factory_function API depends on the implementation of the codec. The returned object's interface on the value of action: Codecs: ------- obj = factory_function_for_<action>(errors='strict') 'encode': obj(u,slice=None) -> Python string 'decode': obj(s,offset=0,chunksize=0) -> (Unicode object, bytes consumed) factory_functions are free to return simple function objects for stateless encodings. StreamCodecs: ------------- obj = factory_function_for_<action>(stream,errors='strict') obj should provide access to all methods defined for the stream object, overriding these: 'stream write': obj.write(u,slice=None) -> bytes written to stream obj.flush() -> ??? 'stream read': obj.read(chunksize=0) -> (Unicode object, bytes read) obj.flush() -> ??? errors is defined like in my Codec spec. The codecs are expected to use this argument to handle error conditions. I'm not sure what Fredrik intended with the .flush() methods, so the definition is still open. I would expect it to do some finalization of state. Perhaps we need another set of actions for the .feed()/.close() approach... As in earlier version of the proposal: The registry should provide default implementations for missing action factory_functions using the other registered functions, e.g. 'stream write' can be emulated using 'encode' and 'stream read' using 'decode'. The same probably holds for feed approach. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 44 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From tim_one@email.msn.com Wed Nov 17 08:14:38 1999 From: tim_one@email.msn.com (Tim Peters) Date: Wed, 17 Nov 1999 03:14:38 -0500 Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit) In-Reply-To: <3831350B.8F69CB6D@lemburg.com> Message-ID: <000201bf30d3$cb2cb240$a42d153f@tim> [MAL] > ... > Here is a sample implementation of what I had in mind: > > """ Demo for 'unicode-escape' encoding. > """ > import struct,string,re > > pack_format = '>H' > > def convert_string(s): > > l = map(None,s) > for i in range(len(l)): > l[i] = struct.pack(pack_format,ord(l[i])) > return l > > u_escape = re.compile(r'\\u([0-9a-fA-F]{0,4})') > > def unicode_unescape(s): > > l = [] > start = 0 > while start < len(s): > m = u_escape.search(s,start) > if not m: > l[len(l):] = convert_string(s[start:]) > break > m_start,m_end = m.span() > if m_start > start: > l[len(l):] = convert_string(s[start:m_start]) > hexcode = m.group(1) > #print hexcode,start,m_start > if len(hexcode) != 4: > raise SyntaxError,'illegal \\uXXXX sequence: \\u%s' % hexcode > ordinal = string.atoi(hexcode,16) > l.append(struct.pack(pack_format,ordinal)) > start = m_end > #print l > return string.join(l,'') > > def hexstr(s,sep=''): > > return string.join(map(lambda x,hex=hex,ord=ord: '%02x' % > ord(x),s),sep) It looks like r'\\u0000' will get translated into a 2-character Unicode string. That's probably not good, if for no other reason than that Java would not do this (it would create the obvious 7-character Unicode string), and having something that looks like a Java escape that doesn't *work* like the Java escape will be confusing as heck for JPython users. Keeping track of even-vs-odd number of backslashes can't be done with a regexp search, but is easy if the code is simple <wink>: def unicode_unescape(s): from string import atoi import array i, n = 0, len(s) result = array.array('H') # unsigned short, native order while i < n: ch = s[i] i = i+1 if ch != "\\": result.append(ord(ch)) continue if i == n: raise ValueError("string ends with lone backslash") ch = s[i] i = i+1 if ch != "u": result.append(ord("\\")) result.append(ord(ch)) continue hexchars = s[i:i+4] if len(hexchars) != 4: raise ValueError("\\u escape at end not followed by " "at least 4 characters") i = i+4 for ch in hexchars: if ch not in "01234567890abcdefABCDEF": raise ValueError("\\u" + hexchars + " contains " "non-hex characters") result.append(atoi(hexchars, 16)) # print result return result.tostring() From tim_one@email.msn.com Wed Nov 17 08:47:48 1999 From: tim_one@email.msn.com (Tim Peters) Date: Wed, 17 Nov 1999 03:47:48 -0500 Subject: [Python-Dev] just say no... In-Reply-To: <383156DF.2209053F@lemburg.com> Message-ID: <000401bf30d8$6cf30bc0$a42d153f@tim> [MAL] > FYI, the next version of the proposal ... > File objects opened in text mode will use "t#" and binary ones use "s#". Am I the only one who sees magical distinctions between text and binary mode as a Really Bad Idea? I wouldn't have guessed the Unix natives here would quietly acquiesce to importing a bit of Windows madness <wink>. From tim_one@email.msn.com Wed Nov 17 08:47:46 1999 From: tim_one@email.msn.com (Tim Peters) Date: Wed, 17 Nov 1999 03:47:46 -0500 Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: <383140F3.EDDB307A@lemburg.com> Message-ID: <000301bf30d8$6bbd4ae0$a42d153f@tim> [Jack Jansen] > I would suggest adding the Dos, Windows and Macintosh standard > 8-bit charsets (their equivalents of latin-1) too, as documents > in these encoding are pretty ubiquitous. But maybe these should > only be added on the respective platforms. [MAL] > Good idea. What code pages would that be ? I'm not clear on what's being suggested; e.g., Windows supports *many* different "code pages". CP 1252 is default in the U.S., and is an extension of Latin-1. See e.g. ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT which appears to be up-to-date (has 0x80 as the euro symbol, Unicode U+20AC -- although whether your version of U.S. Windows actually has this depends on whether you installed the service pack that added it!). See ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP850.TXT for the closest DOS got. From tim_one@email.msn.com Wed Nov 17 09:05:21 1999 From: tim_one@email.msn.com (Tim Peters) Date: Wed, 17 Nov 1999 04:05:21 -0500 Subject: Weak refs (was [Python-Dev] just say no...) In-Reply-To: <14385.33486.855802.187739@weyr.cnri.reston.va.us> Message-ID: <000601bf30da$e069d820$a42d153f@tim> [Fred L. Drake, Jr., pines for some flavor of weak refs; MAL reminds us of his work; & back to Fred] > Yes, but still not in the core. So we have two general examples > (vrefs and mxProxy) and there's WeakDict (or something like that). I > think there really needs to be a core facility for this. This kind of thing certainly belongs in the core (for efficiency and smooth integration) -- if it belongs in the language at all. This was discussed at length here some months ago; that's what prompted MAL to "do something" about it. Guido hasn't shown visible interest, and nobody has been willing to fight him to the death over it. So it languishes. Buy him lunch tomorrow and get him excited <wink>. From tim_one@email.msn.com Wed Nov 17 09:10:24 1999 From: tim_one@email.msn.com (Tim Peters) Date: Wed, 17 Nov 1999 04:10:24 -0500 Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...) In-Reply-To: <1269351119-9152905@hypernet.com> Message-ID: <000701bf30db$94d4ac40$a42d153f@tim> [Gordon McMillan] > ... > Yeah, it's faster. And I can put Python+Tcl/Tk+IDLE on a > diskette with a little room left over. That's truly remarkable (he says while waiting for the Inbox Repair Tool to finish repairing his 50Mb Outlook mail file ...)! > but-since-its-WIndows-it-must-be-tainted-ly y'rs Indeed -- if it runs on Windows, it's a worthless piece o' crap <wink>. From fredrik@pythonware.com Wed Nov 17 11:00:10 1999 From: fredrik@pythonware.com (Fredrik Lundh) Date: Wed, 17 Nov 1999 12:00:10 +0100 Subject: [Python-Dev] Codecs and StreamCodecs References: <38317FBA.4F3D6B1F@lemburg.com> <199911161620.LAA02643@eric.cnri.reston.va.us> <002e01bf3069$8477b440$f29b12c2@secret.pythonware.com> <3832757E.B9503606@lemburg.com> Message-ID: <004101bf30ea$eb3801e0$f29b12c2@secret.pythonware.com> M.-A. Lemburg <mal@lemburg.com> wrote: > > def flush(self): > > # flush the decoding buffers. this should usually > > # return None, unless the fact that knowing that the > > # input stream has ended means that the state can be > > # interpreted in a meaningful way. however, if the > > # state indicates that there last character was not > > # finished, this method should raise a UnicodeError > > # exception. > > Could you explain for reason for having a .flush() method > and what it should return. in most cases, it should either return None, or raise a UnicodeError exception: >>> u = unicode("å i åa ä e ö", "iso-latin-1") >>> # yes, that's a valid Swedish sentence ;-) >>> s = u.encode("utf-8") >>> d = decoder("utf-8") >>> d.decode(s[:-1]) "å i åa ä e " >>> d.flush() UnicodeError: last character not complete on the other hand, there are situations where it might actually return a string. consider a "HTML entity decoder" which uses the following pattern to match a character entity: "&\w+;?" (note that the trailing semicolon is optional). >>> u = unicode("å i åa ä e ö", "iso-latin-1") >>> s = u.encode("html-entities") >>> d = decoder("html-entities") >>> d.decode(s[:-1]) "å i åa ä e " >>> d.flush() "ö" > Perhaps I'm missing something, but how would you define > stream codecs using this interface ? input: read chunks of data, decode, and keep extra data in a local buffer. output: encode data into suitable chunks, and write to the output stream (that's why there's a buffersize argument to encode -- if someone writes a 10mb unicode string to an encoded stream, python shouldn't allocate an extra 10-30 megabytes just to be able to encode the darn thing...) > > Implementing stream codecs is left as an exercise (see the zlib > > material in the eff-bot guide for a decoder example). everybody should have a copy of the eff-bot guide ;-) (but alright, I plan to post a complete utf-8 implementation in a not too distant future). </F> From gstein@lyra.org Wed Nov 17 10:57:36 1999 From: gstein@lyra.org (Greg Stein) Date: Wed, 17 Nov 1999 02:57:36 -0800 (PST) Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: <38327F39.AA381647@lemburg.com> Message-ID: <Pine.LNX.4.10.9911170255060.10639-100000@nebula.lyra.org> On Wed, 17 Nov 1999, M.-A. Lemburg wrote: >... > I'd suggest grouping encodings: > > [encodings] > [iso} > [iso88591] > [iso88592] > [jis] > ... > [cyrillic] > ... > [misc] WHY?!?! This is taking a simple solution and making it complicated. I see no benefit to the creating yet-another-level-of-hierarchy. Why should they be grouped? Leave the modules just under "encodings" and be done with it. Cheers, -g -- Greg Stein, http://www.lyra.org/ From gstein@lyra.org Wed Nov 17 11:14:01 1999 From: gstein@lyra.org (Greg Stein) Date: Wed, 17 Nov 1999 03:14:01 -0800 (PST) Subject: [Python-Dev] Codecs and StreamCodecs In-Reply-To: <38327B79.2415786B@lemburg.com> Message-ID: <Pine.LNX.4.10.9911170259150.10639-100000@nebula.lyra.org> On Wed, 17 Nov 1999, M.-A. Lemburg wrote: >... > Anyway, I think that factory functions are the way to go, > because they offer more flexibility w/r to reusing already > instantiated codecs, importing modules on-the-fly as was > suggested in another thread (thereby making codec module > import lazy) or mapping encoder and decoder requests all > to one class. Why a factory? I've got a simple encode() function. I don't need a factory. "flexibility" at the cost of complexity (IMO). > So here's a new registry approach: > > unicodec.register(encoding,factory_function,action) > > with > encoding - name of the supported encoding, e.g. Shift_JIS > factory_function - a function that returns an object > or function ready to be used for action > action - a string stating the supported action: > 'encode' > 'decode' > 'stream write' > 'stream read' This action thing is subject to error. *if* you're wanting to go this route, then have: unicodec.register_encode(...) unicodec.register_decode(...) unicodec.register_stream_write(...) unicodec.register_stream_read(...) They are equivalent. Guido has also told me in the past that he dislikes parameters that alter semantics -- preferring different functions instead. (this is why there are a good number of PyBufferObject interfaces; I had fewer to start with) This suggested approach is also quite a bit more wordy/annoying than Fred's alternative: unicode.register('iso-8859-1', encoder, decoder, None, None) And don't say "future compatibility allows us to add new actions." Well, those same future changes can add new registration functions or additional parameters to the single register() function. Not that I'm advocating it, but register() could also take a single parameter: if a class, then instantiate it and call methods for each action; if an instance, then just call methods for each action. [ and the third/original variety: a function object as the first param is the actual hook, and params 2 thru 4 (each are optional, or just the stream funcs?) are the other hook functions ] > The factory_function API depends on the implementation of > the codec. The returned object's interface on the value of action: > > Codecs: > ------- > > obj = factory_function_for_<action>(errors='strict') Where does this "errors" value come from? How does a user alter that value? Without an ability to change this, I see no reason for a factory. [ and no: don't tell me it is a thread-state value :-) ] On the other hand: presuming the "errors" thing is valid, *then* I see a need for a factory. Truly... I dislike factories. IMO, they just add code/complexity in many cases where the functionality isn't needed. But that's just me :-) Cheers, -g -- Greg Stein, http://www.lyra.org/ From andy@robanal.demon.co.uk Wed Nov 17 11:17:00 1999 From: andy@robanal.demon.co.uk (=?iso-8859-1?q?Andy=20Robinson?=) Date: Wed, 17 Nov 1999 03:17:00 -0800 (PST) Subject: [Python-Dev] Rosette i18n API Message-ID: <19991117111700.8831.rocketmail@web603.mail.yahoo.com> There is a very capable C++ library at http://rosette.basistech.com/ It is well worth looking at the things this API actually lets you do for ideas on patterns. - Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From gstein@lyra.org Wed Nov 17 11:21:18 1999 From: gstein@lyra.org (Greg Stein) Date: Wed, 17 Nov 1999 03:21:18 -0800 (PST) Subject: [Python-Dev] just say no... In-Reply-To: <000401bf30d8$6cf30bc0$a42d153f@tim> Message-ID: <Pine.LNX.4.10.9911170316380.10639-100000@nebula.lyra.org> On Wed, 17 Nov 1999, Tim Peters wrote: > [MAL] > > FYI, the next version of the proposal ... > > File objects opened in text mode will use "t#" and binary ones use "s#". > > Am I the only one who sees magical distinctions between text and binary mode > as a Really Bad Idea? I wouldn't have guessed the Unix natives here would > quietly acquiesce to importing a bit of Windows madness <wink>. It's a seductive idea... yes, it feels wrong, but then... it seems kind of right, too... :-) Yes. It is a mode. Is it bad? Not sure. You've already told the system that you want to treat the file differently. Much like you're treating it differently when you specify 'r' vs. 'w'. The real annoying thing would be to assume that opening a file as 'r' means that I *meant* text mode and to start using "t#". In actuality, I typically open files that way since I do most of my coding on Linux. If I now have to pay attention to things and open it as 'rb', then I'll be pissed. And the change in behavior and bugs that interpreting 'r' as text would introduce? Ack! Cheers, -g -- Greg Stein, http://www.lyra.org/ From fredrik@pythonware.com Wed Nov 17 11:36:32 1999 From: fredrik@pythonware.com (Fredrik Lundh) Date: Wed, 17 Nov 1999 12:36:32 +0100 Subject: [Python-Dev] Codecs and StreamCodecs References: <Pine.LNX.4.10.9911170259150.10639-100000@nebula.lyra.org> Message-ID: <001b01bf30ef$ffb08a20$f29b12c2@secret.pythonware.com> Greg Stein <gstein@lyra.org> wrote: > Why a factory? I've got a simple encode() function. I don't need a > factory. "flexibility" at the cost of complexity (IMO). so where do you put the state? how do you reset the state between strings? how do you handle incremental decoding/encoding? etc. (I suggest taking another look at PIL's codec design. it solves all these problems with a minimum of code, and it works -- people have been hammering on PIL for years...) </F> From gstein@lyra.org Wed Nov 17 11:34:30 1999 From: gstein@lyra.org (Greg Stein) Date: Wed, 17 Nov 1999 03:34:30 -0800 (PST) Subject: [Python-Dev] Codecs and StreamCodecs In-Reply-To: <001b01bf30ef$ffb08a20$f29b12c2@secret.pythonware.com> Message-ID: <Pine.LNX.4.10.9911170331560.10639-100000@nebula.lyra.org> On Wed, 17 Nov 1999, Fredrik Lundh wrote: > Greg Stein <gstein@lyra.org> wrote: > > Why a factory? I've got a simple encode() function. I don't need a > > factory. "flexibility" at the cost of complexity (IMO). > > so where do you put the state? encode() is not supposed to retain state. It is supposed to do a complete translation. It is not a stream thingy, which may have received partial characters. > how do you reset the state between > strings? There is none :-) > how do you handle incremental > decoding/encoding? Streams. -g -- Greg Stein, http://www.lyra.org/ From fredrik@pythonware.com Wed Nov 17 11:46:01 1999 From: fredrik@pythonware.com (Fredrik Lundh) Date: Wed, 17 Nov 1999 12:46:01 +0100 Subject: [Python-Dev] Python 1.6 status References: <199911161700.MAA02716@eric.cnri.reston.va.us> Message-ID: <004c01bf30f1$537102b0$f29b12c2@secret.pythonware.com> Guido van Rossum <guido@CNRI.Reston.VA.US> wrote: > - suggestions for new issues that maybe ought to be settled in 1.6 three things: imputil, imputil, imputil </F> From fredrik@pythonware.com Wed Nov 17 11:51:33 1999 From: fredrik@pythonware.com (Fredrik Lundh) Date: Wed, 17 Nov 1999 12:51:33 +0100 Subject: [Python-Dev] Codecs and StreamCodecs References: <Pine.LNX.4.10.9911170331560.10639-100000@nebula.lyra.org> Message-ID: <006201bf30f2$194626f0$f29b12c2@secret.pythonware.com> Greg Stein <gstein@lyra.org> wrote: > > so where do you put the state? > > encode() is not supposed to retain state. It is supposed to do a complete > translation. It is not a stream thingy, which may have received partial > characters. > > > how do you handle incremental > > decoding/encoding? > > Streams. hmm. why have two different mechanisms when you can do the same thing with one? </F> From gstein@lyra.org Wed Nov 17 13:01:47 1999 From: gstein@lyra.org (Greg Stein) Date: Wed, 17 Nov 1999 05:01:47 -0800 (PST) Subject: [Python-Dev] Apache process (was: Python 1.6 status) In-Reply-To: <199911161700.MAA02716@eric.cnri.reston.va.us> Message-ID: <Pine.LNX.4.10.9911170441360.10639-100000@nebula.lyra.org> On Tue, 16 Nov 1999, Guido van Rossum wrote: >... > Greg, I understand you have checkin privileges for Apache. What is > the procedure there for handing out those privileges? What is the > procedure for using them? (E.g. if you made a bogus change to part of > Apache you're not supposed to work on, what happens?) Somebody proposes that a person is added to the list of people with checkin privileges. If nobody else in the group vetoes that, then they're in (their system doesn't require continual participation by each member, so it can only operate at a veto level, rather than a unanimous assent). It is basically determined on the basis of merit -- has the person been active (on the Apache developer's mailing list) and has the person contributed something significant? Further, by providing commit access, will they further the goals of Apache? And, of course, does their temperament seem to fit in with the other group members? I can make any change that I'd like. However, there are about 20 other people who can easily revert or alter my changes if they're bogus. There are no programmatic restrictions.... You could say it is based on mutual respect and a social contract of behavior. Large changes should be discussed before committing to CVS. Bug fixes, doc enhancements, minor functional improvements, etc, all follow a commit-then-review process. I just check the thing in. Others see the diff (emailed to the checkins mailing list (this is different from Python-checkins which only says what files are changed, rather than providing the diff)) and can comment on the change, make their own changes, etc. To be concrete: I added the Expat code that now appears in Apache 1.3.9. Before doing so, I queried the group. There were some issues that I dealt with before finally commiting Expat to the CVS repository. On another occasion, I added a new API to Apache; again, I proposed it first, got an "all OK" and committed it. I've done a couple bug fixes which I just checked in. [ "all OK" means three +1 votes and no vetoes. everybody has veto ability (but the responsibility to explain why and to remove their veto when their concerns are addressed). ] On many occasions, I've reviewed the diffs that were posted to the checkins list, and made comments back to the author. I've caught a few problems this way. For Apache 2.0, even large changes are commit-then-review at this point. At some point, it will switch over to review-then-commit and the project will start moving towards stabilization/release. (bug fixes and stuff will always remain commit-then-review) I'll note that the process works very well given that diffs are emailed. I doubt that it would be effective if people had to fetch CVS diffs themselves. Your note also implies "areas of ownership". This doesn't really exist within Apache. There aren't even "primary authors" or things like that. I have the ability/rights to change any portions: from the low-level networking, to the documentation, to the server-side include processing. Of coures, if I'm going to make a big change, then I'll be posting a patch for review first, and whoever has worked in that area in the past may/will/should comment. Cheers, -g -- Greg Stein, http://www.lyra.org/ From guido@CNRI.Reston.VA.US Wed Nov 17 13:32:05 1999 From: guido@CNRI.Reston.VA.US (Guido van Rossum) Date: Wed, 17 Nov 1999 08:32:05 -0500 Subject: [Python-Dev] Python 1.6 status In-Reply-To: Your message of "Wed, 17 Nov 1999 04:31:27 EST." <000801bf30de$85bea500$a42d153f@tim> References: <000801bf30de$85bea500$a42d153f@tim> Message-ID: <199911171332.IAA03266@kaluha.cnri.reston.va.us> > I'm specifically requesting not to have checkin privileges. So there. I will force nobody to use checkin privileges. However I see that for some contributors, checkin privileges will save me and them time. > I see two problems: > > 1. When patches go thru you, you at least eyeball them. This catches bugs > and design errors early. I will still eyeball them -- only after the fact. Since checkins are pretty public, being slapped on the wrist for a bad checkin is a pretty big embarrassment, so few contributors will check in buggy code more than once. Moreover, there will be more eyeballs. > 2. For a multi-platform app, few people have adequate resources for testing; > e.g., I can test under an obsolete version of Win95, and NT if I have to, > but that's it. You may not actually do better testing than that, but having > patches go thru you allows me the comfort of believing you do <wink>. I expect that the same mechanisms will apply. I have access to Solaris, Linux and Windows (NT + 98) but it's actually a lot easier to check portability after things have been checked in. And again, there will be more testers. --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@CNRI.Reston.VA.US Wed Nov 17 13:34:23 1999 From: guido@CNRI.Reston.VA.US (Guido van Rossum) Date: Wed, 17 Nov 1999 08:34:23 -0500 Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: Your message of "Tue, 16 Nov 1999 23:53:53 PST." <19991117075353.16046.rocketmail@web606.mail.yahoo.com> References: <19991117075353.16046.rocketmail@web606.mail.yahoo.com> Message-ID: <199911171334.IAA03374@kaluha.cnri.reston.va.us> > This is the simplest if each codec really is likely to > be implemented in a separate module. But just look at > the data! All the iso-8859 encodings need identical > functionality, and just have a different mapping table > with 256 elements. It would be trivial to implement > these in one module. And the wide variety of Japanese > encodings (mostly corporate or historical variants of > the same character set) are again best treated from > one code base with a bunch of mapping tables and > routines to generate the variants - basically one can > store the deltas. > > So the choice is between possibly having a lot of > almost-dummy modules, or having Python modules which > generate and register a logical family of encodings. > > I may have some time next week and will try to code up > a few so we can pound on something. I see no problem with having a lot of near-dummy modules if it simplifies the architecture. You can still do code sharing. Files are cheap; APIs are expensive. --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@CNRI.Reston.VA.US Wed Nov 17 13:38:35 1999 From: guido@CNRI.Reston.VA.US (Guido van Rossum) Date: Wed, 17 Nov 1999 08:38:35 -0500 Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: Your message of "Wed, 17 Nov 1999 02:57:36 PST." <Pine.LNX.4.10.9911170255060.10639-100000@nebula.lyra.org> References: <Pine.LNX.4.10.9911170255060.10639-100000@nebula.lyra.org> Message-ID: <199911171338.IAA03511@kaluha.cnri.reston.va.us> > This is taking a simple solution and making it complicated. I see no > benefit to the creating yet-another-level-of-hierarchy. Why should they be > grouped? > > Leave the modules just under "encodings" and be done with it. Agreed. Tim Peters once remarked that Python likes shallow encodings (or perhaps that *I* like them :-). This is one such case where I would strongly urge for the simplicity of a shallow hierarchy. --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@CNRI.Reston.VA.US Wed Nov 17 13:43:44 1999 From: guido@CNRI.Reston.VA.US (Guido van Rossum) Date: Wed, 17 Nov 1999 08:43:44 -0500 Subject: [Python-Dev] Codecs and StreamCodecs In-Reply-To: Your message of "Wed, 17 Nov 1999 03:14:01 PST." <Pine.LNX.4.10.9911170259150.10639-100000@nebula.lyra.org> References: <Pine.LNX.4.10.9911170259150.10639-100000@nebula.lyra.org> Message-ID: <199911171343.IAA03636@kaluha.cnri.reston.va.us> > Why a factory? I've got a simple encode() function. I don't need a > factory. "flexibility" at the cost of complexity (IMO). Unless there are certain cases where factories are useful. But let's read on... > > action - a string stating the supported action: > > 'encode' > > 'decode' > > 'stream write' > > 'stream read' > > This action thing is subject to error. *if* you're wanting to go this > route, then have: > > unicodec.register_encode(...) > unicodec.register_decode(...) > unicodec.register_stream_write(...) > unicodec.register_stream_read(...) > > They are equivalent. Guido has also told me in the past that he dislikes > parameters that alter semantics -- preferring different functions instead. Yes, indeed! (But weren't we going to do away with the whole registry idea in favor of an encodings package?) > Not that I'm advocating it, but register() could also take a single > parameter: if a class, then instantiate it and call methods for each > action; if an instance, then just call methods for each action. Nah, that's bad -- a class is just a factory, and once you are allowing classes it's really good to also allowing factory functions. > [ and the third/original variety: a function object as the first param is > the actual hook, and params 2 thru 4 (each are optional, or just the > stream funcs?) are the other hook functions ] Fine too. They should all be optional. > > obj = factory_function_for_<action>(errors='strict') > > Where does this "errors" value come from? How does a user alter that > value? Without an ability to change this, I see no reason for a factory. > [ and no: don't tell me it is a thread-state value :-) ] > > On the other hand: presuming the "errors" thing is valid, *then* I see a > need for a factory. The idea is that various places that take an encoding name can also take a codec instance. So the user can call the factory function / class constructor. > Truly... I dislike factories. IMO, they just add code/complexity in many > cases where the functionality isn't needed. But that's just me :-) Get over it... In a sense, every Python class is a factory for its own instances! I think you must be confusing Python with Java or C++. :-) --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@CNRI.Reston.VA.US Wed Nov 17 13:56:56 1999 From: guido@CNRI.Reston.VA.US (Guido van Rossum) Date: Wed, 17 Nov 1999 08:56:56 -0500 Subject: [Python-Dev] Apache process (was: Python 1.6 status) In-Reply-To: Your message of "Wed, 17 Nov 1999 05:01:47 PST." <Pine.LNX.4.10.9911170441360.10639-100000@nebula.lyra.org> References: <Pine.LNX.4.10.9911170441360.10639-100000@nebula.lyra.org> Message-ID: <199911171356.IAA04005@kaluha.cnri.reston.va.us> > Somebody proposes that a person is added to the list of people with > checkin privileges. If nobody else in the group vetoes that, then they're > in (their system doesn't require continual participation by each member, > so it can only operate at a veto level, rather than a unanimous assent). > It is basically determined on the basis of merit -- has the person been > active (on the Apache developer's mailing list) and has the person > contributed something significant? Further, by providing commit access, > will they further the goals of Apache? And, of course, does their > temperament seem to fit in with the other group members? This makes sense, but I have one concern: if somebody who isn't liked very much (say a capable hacker who is a real troublemaker) asks for privileges, would people veto this? I'd be reluctant to go on record as veto'ing a particular person. (E.g. there are a few troublemakers in c.l.py, and I would never want them to join python-dev let alone give them commit privileges, but I'm not sure if I would want to discuss this on a publicly archived mailing list -- or even on a privately archived mailing list, given that the number of members might be in the hundreds. [...stuff I like...] > I'll note that the process works very well given that diffs are emailed. I > doubt that it would be effective if people had to fetch CVS diffs > themselves. That's a great idea; I'll see if we can do that to our checkin email, regardless of whether we hand out commit privileges. > Your note also implies "areas of ownership". This doesn't really exist > within Apache. There aren't even "primary authors" or things like that. I > have the ability/rights to change any portions: from the low-level > networking, to the documentation, to the server-side include processing. But that's Apache, which is explicitly run as a collective. In Python, I definitely want to have ownership of certain sections of the code. But I agree that this doesn't need to be formalized by access control lists; the social process you describe sounds like it will work just fine. --Guido van Rossum (home page: http://www.python.org/~guido/) From fdrake@acm.org Wed Nov 17 14:44:25 1999 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Wed, 17 Nov 1999 09:44:25 -0500 (EST) Subject: Weak refs (was [Python-Dev] just say no...) In-Reply-To: <000601bf30da$e069d820$a42d153f@tim> References: <14385.33486.855802.187739@weyr.cnri.reston.va.us> <000601bf30da$e069d820$a42d153f@tim> Message-ID: <14386.48969.630893.119344@weyr.cnri.reston.va.us> Tim Peters writes: > about it. Guido hasn't shown visible interest, and nobody has been willing > to fight him to the death over it. So it languishes. Buy him lunch > tomorrow and get him excited <wink>. Guido has asked me to pursue this topic, so I'll be checking out available implementations and seeing if any are adoptable or if something different is needed to be fully general and well-integrated. -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives From tim_one@email.msn.com Thu Nov 18 03:21:16 1999 From: tim_one@email.msn.com (Tim Peters) Date: Wed, 17 Nov 1999 22:21:16 -0500 Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit) In-Reply-To: <38327D8F.7A5352E6@lemburg.com> Message-ID: <000101bf3173$f9805340$c0a0143f@tim> [MAL] > Guido and I have decided to turn \uXXXX into a standard > escape sequence with no further magic applied. \uXXXX will > only be expanded in u"" strings. Does that exclude ur"" strings? Not arguing either way, just don't know what all this means. > Here's the new scheme: > > With the 'unicode-escape' encoding being defined as: > > · all non-escape characters represent themselves as a Unicode ordinal > (e.g. 'a' -> U+0061). Same as before (scream if that's wrong). > · all existing defined Python escape sequences are interpreted as > Unicode ordinals; Same as before (ditto). > note that \xXXXX can represent all Unicode ordinals, This means that the definition of \xXXXX has changed, then -- as you pointed out just yesterday <wink>, \xABCDq currently acts like \xCDq. Does the new \x definition apply only in u"" strings, or in "" strings too? What is the new \x definition? > and \OOO (octal) can represent Unicode ordinals up to U+01FF. Same as before (ditto). > · a new escape sequence, \uXXXX, represents U+XXXX; it is a syntax > error to have fewer than 4 digits after \u. Same as before (ditto). IOW, I don't see anything that's changed other than an unspecified new treatment of \x escapes, and possibly that ur"" strings don't expand \u escapes. > Examples: > > u'abc' -> U+0061 U+0062 U+0063 > u'\u1234' -> U+1234 > u'abc\u1234\n' -> U+0061 U+0062 U+0063 U+1234 U+05c The last example is damaged (U+05c isn't legit). Other than that, these look the same as before. > Now how should we define ur"abc\u1234\n" ... ? If strings carried an encoding tag with them, the obvious answer is that this acts exactly like r"abc\u1234\n" acts today except gets a "unicode-escaped" encoding tag instead of a "[whatever the default is today]" encoding tag. If strings don't carry an encoding tag with them, you're in a bit of a pickle: you'll have to convert it to a regular string or a Unicode string, but in either case have no way to communicate that it may need further processing; i.e., no way to distinguish it from a regular or Unicode string produced by any other mechanism. The code I posted yesterday remains my best answer to that unpleasant puzzle (i.e., produce a Unicode string, fiddling with backslashes just enough to get the \u escapes expanded, in the same way Java's (conceptual) preprocessor does it). From tim_one@email.msn.com Thu Nov 18 03:21:19 1999 From: tim_one@email.msn.com (Tim Peters) Date: Wed, 17 Nov 1999 22:21:19 -0500 Subject: [Python-Dev] just say no... In-Reply-To: <Pine.LNX.4.10.9911170316380.10639-100000@nebula.lyra.org> Message-ID: <000201bf3173$fb7f7ea0$c0a0143f@tim> [MAL] > File objects opened in text mode will use "t#" and binary > ones use "s#". [Greg Stein] > ... > The real annoying thing would be to assume that opening a file as 'r' > means that I *meant* text mode and to start using "t#". Isn't that exactly what MAL said would happen? Note that a "t" flag for "text mode" is an MS extension -- C doesn't define "t", and Python doesn't either; a lone "r" has always meant text mode. > In actuality, I typically open files that way since I do most of my > coding on Linux. If I now have to pay attention to things and open it > as 'rb', then I'll be pissed. > > And the change in behavior and bugs that interpreting 'r' as text would > introduce? Ack! 'r' is already intepreted as text mode, but so far, on Unix-like systems, there's been no difference between text and binary modes. Introducing a distinction will certainly cause problems. I don't know what the compensating advantages are thought to be. From tim_one@email.msn.com Thu Nov 18 03:23:00 1999 From: tim_one@email.msn.com (Tim Peters) Date: Wed, 17 Nov 1999 22:23:00 -0500 Subject: [Python-Dev] Python 1.6 status In-Reply-To: <199911171332.IAA03266@kaluha.cnri.reston.va.us> Message-ID: <000301bf3174$37b465c0$c0a0143f@tim> [Guido] > I will force nobody to use checkin privileges. That almost went without saying <wink>. > However I see that for some contributors, checkin privileges will > save me and them time. Then it's Good! Provided it doesn't hurt language stability. I agree that changing the system to mail out diffs addresses what I was worried about there. From tim_one@email.msn.com Thu Nov 18 03:31:38 1999 From: tim_one@email.msn.com (Tim Peters) Date: Wed, 17 Nov 1999 22:31:38 -0500 Subject: [Python-Dev] Apache process (was: Python 1.6 status) In-Reply-To: <199911171356.IAA04005@kaluha.cnri.reston.va.us> Message-ID: <000401bf3175$6c089660$c0a0143f@tim> [Greg] > ... > Somebody proposes that a person is added to the list of people with > checkin privileges. If nobody else in the group vetoes that, then ? they're in ... [Guido] > This makes sense, but I have one concern: if somebody who isn't liked > very much (say a capable hacker who is a real troublemaker) asks for > privileges, would people veto this? It seems that a key point in Greg's description is that people don't propose *themselves* for checkin. They have to talk someone else into proposing them. That should keep Endang out of the running for a few years <wink>. After that, I care more about their code than their personalities. If the stuff they check in is good, fine; if it's not, lock 'em out for direct cause. > I'd be reluctant to go on record as veto'ing a particular person. Secret Ballot run off a web page -- although not so secret you can't see who voted for what <wink>. From tim_one@email.msn.com Thu Nov 18 03:37:18 1999 From: tim_one@email.msn.com (Tim Peters) Date: Wed, 17 Nov 1999 22:37:18 -0500 Subject: Weak refs (was [Python-Dev] just say no...) In-Reply-To: <14386.48969.630893.119344@weyr.cnri.reston.va.us> Message-ID: <000501bf3176$36a5ca00$c0a0143f@tim> [Fred L. Drake, Jr.] > Guido has asked me to pursue this topic [weak refs], so I'll be > checking out available implementations and seeing if any are > adoptable or if something different is needed to be fully general > and well-integrated. Just don't let "fully general" stop anything for its sake alone; e.g., if there's a slick trick that *could* exempt numbers, that's all to the good! Adding a pointer to every object is really unattractive, while adding a flag or two to type objects is dirt cheap. Note in passing that current Java addresses weak refs too (several flavors of 'em! -- very elaborate). From gstein@lyra.org Thu Nov 18 08:09:24 1999 From: gstein@lyra.org (Greg Stein) Date: Thu, 18 Nov 1999 00:09:24 -0800 (PST) Subject: [Python-Dev] just say no... In-Reply-To: <000201bf3173$fb7f7ea0$c0a0143f@tim> Message-ID: <Pine.LNX.4.10.9911180008020.10639-100000@nebula.lyra.org> On Wed, 17 Nov 1999, Tim Peters wrote: >... > 'r' is already intepreted as text mode, but so far, on Unix-like systems, > there's been no difference between text and binary modes. Introducing a > distinction will certainly cause problems. I don't know what the > compensating advantages are thought to be. Wow. "compensating advantages" ... Excellent "power phrase" there. hehe... -g -- Greg Stein, http://www.lyra.org/ From mal@lemburg.com Thu Nov 18 08:15:04 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 18 Nov 1999 09:15:04 +0100 Subject: [Python-Dev] just say no... References: <000201bf3173$fb7f7ea0$c0a0143f@tim> Message-ID: <3833B588.1E31F01B@lemburg.com> Tim Peters wrote: > > [MAL] > > File objects opened in text mode will use "t#" and binary > > ones use "s#". > > [Greg Stein] > > ... > > The real annoying thing would be to assume that opening a file as 'r' > > means that I *meant* text mode and to start using "t#". > > Isn't that exactly what MAL said would happen? Note that a "t" flag for > "text mode" is an MS extension -- C doesn't define "t", and Python doesn't > either; a lone "r" has always meant text mode. Em, I think you've got something wrong here: "t#" refers to the parsing marker used for writing data to files opened in text mode. Until now, all files used the "s#" parsing marker for writing data, regardeless of being opened in text or binary mode. The new interpretation (new, because there previously was none ;-) of the buffer interface forces this to be changed to regain conformance. > > In actuality, I typically open files that way since I do most of my > > coding on Linux. If I now have to pay attention to things and open it > > as 'rb', then I'll be pissed. > > > > And the change in behavior and bugs that interpreting 'r' as text would > > introduce? Ack! > > 'r' is already intepreted as text mode, but so far, on Unix-like systems, > there's been no difference between text and binary modes. Introducing a > distinction will certainly cause problems. I don't know what the > compensating advantages are thought to be. I guess you won't notice any difference: strings define both interfaces ("s#" and "t#") to mean the same thing. Only other buffer compatible types may now fail to write to text files -- which is not so bad, because it forces the programmer to rethink what he really intended when opening the file in text mode. Besides, if you are writing portable scripts you should pay close attention to "r" vs. "rb" anyway. [Strange, I find myself argueing for a feature that I don't like myself ;-)] -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Thu Nov 18 08:59:21 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 18 Nov 1999 09:59:21 +0100 Subject: [Python-Dev] Python 1.6 status References: <199911161700.MAA02716@eric.cnri.reston.va.us> <004c01bf30f1$537102b0$f29b12c2@secret.pythonware.com> Message-ID: <3833BFE9.6FD118B1@lemburg.com> Fredrik Lundh wrote: > > Guido van Rossum <guido@CNRI.Reston.VA.US> wrote: > > - suggestions for new issues that maybe ought to be settled in 1.6 > > three things: imputil, imputil, imputil But please don't add the current version as default importer... its strategy is way too slow for real life apps (yes, I've tested this: imports typically take twice as long as with the builtin importer). I'd opt for an import manager which provides a useful API for import hooks to register themselves with. What we really need is not yet another complete reimplementation of what the builtin importer does, but rather a more detailed exposure of the various import aspects: finding modules and loading modules. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Thu Nov 18 08:50:36 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 18 Nov 1999 09:50:36 +0100 Subject: [Python-Dev] Codecs and StreamCodecs References: <38317FBA.4F3D6B1F@lemburg.com> <199911161620.LAA02643@eric.cnri.reston.va.us> <002e01bf3069$8477b440$f29b12c2@secret.pythonware.com> <3832757E.B9503606@lemburg.com> <004101bf30ea$eb3801e0$f29b12c2@secret.pythonware.com> Message-ID: <3833BDDC.7CD2CC1F@lemburg.com> Fredrik Lundh wrote: > > M.-A. Lemburg <mal@lemburg.com> wrote: > > > def flush(self): > > > # flush the decoding buffers. this should usually > > > # return None, unless the fact that knowing that the > > > # input stream has ended means that the state can be > > > # interpreted in a meaningful way. however, if the > > > # state indicates that there last character was not > > > # finished, this method should raise a UnicodeError > > > # exception. > > > > Could you explain for reason for having a .flush() method > > and what it should return. > > in most cases, it should either return None, or > raise a UnicodeError exception: > > >>> u = unicode("å i åa ä e ö", "iso-latin-1") > >>> # yes, that's a valid Swedish sentence ;-) > >>> s = u.encode("utf-8") > >>> d = decoder("utf-8") > >>> d.decode(s[:-1]) > "å i åa ä e " > >>> d.flush() > UnicodeError: last character not complete > > on the other hand, there are situations where it > might actually return a string. consider a "HTML > entity decoder" which uses the following pattern > to match a character entity: "&\w+;?" (note that > the trailing semicolon is optional). > > >>> u = unicode("å i åa ä e ö", "iso-latin-1") > >>> s = u.encode("html-entities") > >>> d = decoder("html-entities") > >>> d.decode(s[:-1]) > "å i åa ä e " > >>> d.flush() > "ö" Ah, ok. So the .flush() method checks for proper string endings and then either returns the remaining input or raises an error. > > Perhaps I'm missing something, but how would you define > > stream codecs using this interface ? > > input: read chunks of data, decode, and > keep extra data in a local buffer. > > output: encode data into suitable chunks, > and write to the output stream (that's why > there's a buffersize argument to encode -- > if someone writes a 10mb unicode string to > an encoded stream, python shouldn't allocate > an extra 10-30 megabytes just to be able to > encode the darn thing...) So the stream codecs would be wrappers around the string codecs. Have you read my latest version of the Codec interface ? Wouldn't that be a reasonable approach ? Note that I have integrated your ideas into the new API -- it's basically only missing the .flush() methods, which I can add now that I know what you meant. > > > Implementing stream codecs is left as an exercise (see the zlib > > > material in the eff-bot guide for a decoder example). > > everybody should have a copy of the eff-bot guide ;-) Sure, but the format, the format... make it printed and add a CD and you would probably have a good selling book there ;-) > (but alright, I plan to post a complete utf-8 implementation > in a not too distant future). -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Thu Nov 18 08:16:48 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 18 Nov 1999 09:16:48 +0100 Subject: [Python-Dev] Some thoughts on the codecs... References: <Pine.LNX.4.10.9911170255060.10639-100000@nebula.lyra.org> Message-ID: <3833B5F0.FA4620AD@lemburg.com> Greg Stein wrote: > > On Wed, 17 Nov 1999, M.-A. Lemburg wrote: > >... > > I'd suggest grouping encodings: > > > > [encodings] > > [iso} > > [iso88591] > > [iso88592] > > [jis] > > ... > > [cyrillic] > > ... > > [misc] > > WHY?!?! > > This is taking a simple solution and making it complicated. I see no > benefit to the creating yet-another-level-of-hierarchy. Why should they be > grouped? > > Leave the modules just under "encodings" and be done with it. Nevermind, was just an idea... -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Thu Nov 18 08:43:31 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 18 Nov 1999 09:43:31 +0100 Subject: [Python-Dev] Codecs and StreamCodecs References: <Pine.LNX.4.10.9911170259150.10639-100000@nebula.lyra.org> <199911171343.IAA03636@kaluha.cnri.reston.va.us> Message-ID: <3833BC33.66E134F@lemburg.com> Guido van Rossum wrote: > > > Why a factory? I've got a simple encode() function. I don't need a > > factory. "flexibility" at the cost of complexity (IMO). > > Unless there are certain cases where factories are useful. But let's > read on... > > > > action - a string stating the supported action: > > > 'encode' > > > 'decode' > > > 'stream write' > > > 'stream read' > > > > This action thing is subject to error. *if* you're wanting to go this > > route, then have: > > > > unicodec.register_encode(...) > > unicodec.register_decode(...) > > unicodec.register_stream_write(...) > > unicodec.register_stream_read(...) > > > > They are equivalent. Guido has also told me in the past that he dislikes > > parameters that alter semantics -- preferring different functions instead. > > Yes, indeed! Ok. > (But weren't we going to do away with the whole registry > idea in favor of an encodings package?) One way or another, the Unicode implementation will have to access a dictionary containing references to the codecs for a particular encoding. You won't get around registering these at some point... be it in a lazy way, on-the-fly or by some other means. What we could do is implement the lookup like this: 1. call encodings.lookup_<action>(encoding) and use the return value for the conversion 2. if all fails, cop out with an error Step 1. would do all the import magic and then register the found codecs in some dictionary for faster access (perhaps this could be done in a way that is directly available to the Unicode implementation, e.g. in a global internal dictionary -- the one I originally had in mind for the unicodec registry). > > Not that I'm advocating it, but register() could also take a single > > parameter: if a class, then instantiate it and call methods for each > > action; if an instance, then just call methods for each action. > > Nah, that's bad -- a class is just a factory, and once you are > allowing classes it's really good to also allowing factory functions. > > > [ and the third/original variety: a function object as the first param is > > the actual hook, and params 2 thru 4 (each are optional, or just the > > stream funcs?) are the other hook functions ] > > Fine too. They should all be optional. Ok. > > > obj = factory_function_for_<action>(errors='strict') > > > > Where does this "errors" value come from? How does a user alter that > > value? Without an ability to change this, I see no reason for a factory. > > [ and no: don't tell me it is a thread-state value :-) ] > > > > On the other hand: presuming the "errors" thing is valid, *then* I see a > > need for a factory. > > The idea is that various places that take an encoding name can also > take a codec instance. So the user can call the factory function / > class constructor. Right. The argument is reachable via: Codec = encodings.lookup_encode('utf-8') codec = Codec(errors='?') s = codec(u"abcäöäü") s would then equal 'abc??'. -- Should I go ahead then and change the registry business to the new strategy (via the encodings package in the above sense) ? -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mhammond@skippinet.com.au Thu Nov 18 10:57:44 1999 From: mhammond@skippinet.com.au (Mark Hammond) Date: Thu, 18 Nov 1999 21:57:44 +1100 Subject: [Python-Dev] Codecs and StreamCodecs In-Reply-To: <3833BC33.66E134F@lemburg.com> Message-ID: <002401bf31b3$bf16c230$0501a8c0@bobcat> [Guido] > > (But weren't we going to do away with the whole registry > > idea in favor of an encodings package?) > [MAL] > One way or another, the Unicode implementation will have to > access a dictionary containing references to the codecs for > a particular encoding. You won't get around registering these > at some point... be it in a lazy way, on-the-fly or by some > other means. What is wrong with my idea of using well-known-names from the encoding module? The dict then is "encodings.<encoding-name>.__dict__". All encodings "just work" because the leverage from the Python module system. Unless Im missing something, there is no need for any extra registry at all. I guess it would actually resolve to 2 dict lookups, but thats OK surely? Mark. From mal@lemburg.com Thu Nov 18 09:39:30 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 18 Nov 1999 10:39:30 +0100 Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit) References: <000101bf3173$f9805340$c0a0143f@tim> Message-ID: <3833C952.C6F154B1@lemburg.com> Tim Peters wrote: > > [MAL] > > Guido and I have decided to turn \uXXXX into a standard > > escape sequence with no further magic applied. \uXXXX will > > only be expanded in u"" strings. > > Does that exclude ur"" strings? Not arguing either way, just don't know > what all this means. > > > Here's the new scheme: > > > > With the 'unicode-escape' encoding being defined as: > > > > · all non-escape characters represent themselves as a Unicode ordinal > > (e.g. 'a' -> U+0061). > > Same as before (scream if that's wrong). > > > · all existing defined Python escape sequences are interpreted as > > Unicode ordinals; > > Same as before (ditto). > > > note that \xXXXX can represent all Unicode ordinals, > > This means that the definition of \xXXXX has changed, then -- as you pointed > out just yesterday <wink>, \xABCDq currently acts like \xCDq. Does the new > \x definition apply only in u"" strings, or in "" strings too? What is the > new \x definition? Guido decided to make \xYYXX return U+YYXX *only* within u"" strings. In "" (Python strings) the same sequence will result in chr(0xXX). > > and \OOO (octal) can represent Unicode ordinals up to U+01FF. > > Same as before (ditto). > > > · a new escape sequence, \uXXXX, represents U+XXXX; it is a syntax > > error to have fewer than 4 digits after \u. > > Same as before (ditto). > > IOW, I don't see anything that's changed other than an unspecified new > treatment of \x escapes, and possibly that ur"" strings don't expand \u > escapes. The difference is that we no longer take the two step approach. \uXXXX is treated at the same time all other escape sequences are decoded (the previous version first scanned and decoded all standard Python sequences and then turned to the \uXXXX sequences in a second scan). > > Examples: > > > > u'abc' -> U+0061 U+0062 U+0063 > > u'\u1234' -> U+1234 > > u'abc\u1234\n' -> U+0061 U+0062 U+0063 U+1234 U+05c > > The last example is damaged (U+05c isn't legit). Other than that, these > look the same as before. Corrected; thanks. > > Now how should we define ur"abc\u1234\n" ... ? > > If strings carried an encoding tag with them, the obvious answer is that > this acts exactly like r"abc\u1234\n" acts today except gets a > "unicode-escaped" encoding tag instead of a "[whatever the default is > today]" encoding tag. > > If strings don't carry an encoding tag with them, you're in a bit of a > pickle: you'll have to convert it to a regular string or a Unicode string, > but in either case have no way to communicate that it may need further > processing; i.e., no way to distinguish it from a regular or Unicode string > produced by any other mechanism. The code I posted yesterday remains my > best answer to that unpleasant puzzle (i.e., produce a Unicode string, > fiddling with backslashes just enough to get the \u escapes expanded, in the > same way Java's (conceptual) preprocessor does it). They don't have such tags... so I guess we're in trouble ;-) I guess to make ur"" have a meaning at all, we'd need to go the Java preprocessor way here, i.e. scan the string *only* for \uXXXX sequences, decode these and convert the rest as-is to Unicode ordinals. Would that be ok ? -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Thu Nov 18 11:41:32 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 18 Nov 1999 12:41:32 +0100 Subject: [Python-Dev] Codecs and StreamCodecs References: <002401bf31b3$bf16c230$0501a8c0@bobcat> Message-ID: <3833E5EC.AAFE5016@lemburg.com> Mark Hammond wrote: > > [Guido] > > > (But weren't we going to do away with the whole registry > > > idea in favor of an encodings package?) > > > [MAL] > > One way or another, the Unicode implementation will have to > > access a dictionary containing references to the codecs for > > a particular encoding. You won't get around registering these > > at some point... be it in a lazy way, on-the-fly or by some > > other means. > > What is wrong with my idea of using well-known-names from the encoding > module? The dict then is "encodings.<encoding-name>.__dict__". All > encodings "just work" because the leverage from the Python module > system. Unless Im missing something, there is no need for any extra > registry at all. I guess it would actually resolve to 2 dict lookups, > but thats OK surely? The problem is that the encoding names are not Python identifiers, e.g. iso-8859-1 is allowed as identifier. This and the fact that applications may want to ship their own codecs (which do not get installed under the system wide encodings package) make the registry necessary. I don't see a problem with the registry though -- the encodings package can take care of the registration process without any user interaction. There would only have to be an API for looking up an encoding published by the encodings package for the Unicode implementation to use. The magic behind that API is left to the encodings package... BTW, nothing's wrong with your idea :-) In fact, I like it a lot because it keeps the encoding modules out of the top-level scope which is good. PS: we could probably even take the whole codec idea one step further and also allow other input/output formats to be registered, e.g. stream ciphers or pickle mechanisms. The step in that direction is not a big one: we'd only have to drop the specification of the Unicode object in the spec and replace it with an arbitrary object. Of course, this will still have to be a Unicode object for use by the Unicode implementation. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From gmcm@hypernet.com Thu Nov 18 14:19:48 1999 From: gmcm@hypernet.com (Gordon McMillan) Date: Thu, 18 Nov 1999 09:19:48 -0500 Subject: [Python-Dev] Python 1.6 status In-Reply-To: <3833BFE9.6FD118B1@lemburg.com> Message-ID: <1269187709-18981857@hypernet.com> Marc-Andre wrote: > Fredrik Lundh wrote: > > > > Guido van Rossum <guido@CNRI.Reston.VA.US> wrote: > > > - suggestions for new issues that maybe ought to be settled in 1.6 > > > > three things: imputil, imputil, imputil > > But please don't add the current version as default importer... > its strategy is way too slow for real life apps (yes, I've tested > this: imports typically take twice as long as with the builtin > importer). I think imputil's emulation of the builtin importer is more of a demonstration than a serious implementation. As for speed, it depends on the test. > I'd opt for an import manager which provides a useful API for > import hooks to register themselves with. I think that rather than blindly chain themselves together, there should be a simple minded manager. This could let the programmer prioritize them. > What we really need > is not yet another complete reimplementation of what the > builtin importer does, but rather a more detailed exposure of > the various import aspects: finding modules and loading modules. The first clause I sort of agree with - the current implementation is a fine implementation of a filesystem directory based importer. I strongly disagree with the second clause. The current import hooks are just such a detailed exposure; and they are incomprehensible and unmanagable. I guess you want to tweak the "finding" part of the builtin import mechanism. But that's no reason to ask all importers to break themselves up into "find" and "load" pieces. It's a reason to ask that the standard importer be, in some sense, "subclassable" (ie, expose hooks, or perhaps be an extension class like thingie). - Gordon From jim@interet.com Thu Nov 18 14:39:20 1999 From: jim@interet.com (James C. Ahlstrom) Date: Thu, 18 Nov 1999 09:39:20 -0500 Subject: [Python-Dev] Python 1.6 status References: <1269187709-18981857@hypernet.com> Message-ID: <38340F98.212F61@interet.com> Gordon McMillan wrote: > > Marc-Andre wrote: > > > Fredrik Lundh wrote: > > > > > > Guido van Rossum <guido@CNRI.Reston.VA.US> wrote: > > > > - suggestions for new issues that maybe ought to be settled in 1.6 > > > > > > three things: imputil, imputil, imputil > > > > But please don't add the current version as default importer... > > its strategy is way too slow for real life apps (yes, I've tested > > this: imports typically take twice as long as with the builtin > > importer). > > I think imputil's emulation of the builtin importer is more of a > demonstration than a serious implementation. As for speed, it > depends on the test. IMHO the current import mechanism is good for developers who must work on the library code in the directory tree, but a disaster for sysadmins who must distribute Python applications either internally to a number of machines or commercially. What we need is a standard Python library file like a Java "Jar" file. Imputil can support this as 130 lines of Python. I have also written one in C. I like the imputil approach, but if we want to add a library importer to import.c, I volunteer to write it. I don't want to just add more complicated and unmanageable hooks which people will all use different ways and just add to the confusion. It is easy to install packages by just making them into a library file and throwing it into a directory. So why aren't we doing it? Jim Ahlstrom From guido@CNRI.Reston.VA.US Thu Nov 18 15:30:28 1999 From: guido@CNRI.Reston.VA.US (Guido van Rossum) Date: Thu, 18 Nov 1999 10:30:28 -0500 Subject: [Python-Dev] Import redesign (was: Python 1.6 status) In-Reply-To: Your message of "Thu, 18 Nov 1999 09:19:48 EST." <1269187709-18981857@hypernet.com> References: <1269187709-18981857@hypernet.com> Message-ID: <199911181530.KAA03887@eric.cnri.reston.va.us> Gordon McMillan wrote: > Marc-Andre wrote: > > > Fredrik Lundh wrote: > > > > > Guido van Rossum <guido@CNRI.Reston.VA.US> wrote: > > > > - suggestions for new issues that maybe ought to be settled in 1.6 > > > > > > three things: imputil, imputil, imputil > > > > But please don't add the current version as default importer... > > its strategy is way too slow for real life apps (yes, I've tested > > this: imports typically take twice as long as with the builtin > > importer). > > I think imputil's emulation of the builtin importer is more of a > demonstration than a serious implementation. As for speed, it > depends on the test. Agreed. I like some of imputil's features, but I think the API need to be redesigned. > > I'd opt for an import manager which provides a useful API for > > import hooks to register themselves with. > > I think that rather than blindly chain themselves together, there > should be a simple minded manager. This could let the > programmer prioritize them. Indeed. (A list of importers has been suggested, to replace the list of directories currently used.) > > What we really need > > is not yet another complete reimplementation of what the > > builtin importer does, but rather a more detailed exposure of > > the various import aspects: finding modules and loading modules. > > The first clause I sort of agree with - the current > implementation is a fine implementation of a filesystem > directory based importer. > > I strongly disagree with the second clause. The current import > hooks are just such a detailed exposure; and they are > incomprehensible and unmanagable. Based on how many people have successfully written import hooks, I have to agree. :-( > I guess you want to tweak the "finding" part of the builtin > import mechanism. But that's no reason to ask all importers > to break themselves up into "find" and "load" pieces. It's a > reason to ask that the standard importer be, in some sense, > "subclassable" (ie, expose hooks, or perhaps be an extension > class like thingie). Agreed. Subclassing is a good way towards flexibility. And Jim Ahlstrom writes: > IMHO the current import mechanism is good for developers who must > work on the library code in the directory tree, but a disaster > for sysadmins who must distribute Python applications either > internally to a number of machines or commercially. Unfortunately, you're right. :-( > What we need is a standard Python library file like a Java "Jar" > file. Imputil can support this as 130 lines of Python. I have also > written one in C. I like the imputil approach, but if we want to > add a library importer to import.c, I volunteer to write it. Please volunteer to design or at least review the grand architecture -- see below. > I don't want to just add more complicated and unmanageable hooks > which people will all use different ways and just add to the > confusion. You're so right! > It is easy to install packages by just making them into a library > file and throwing it into a directory. So why aren't we doing it? Rhetorical question. :-) So here's a challenge: redesign the import API from scratch. Let me start with some requirements. Compatibility issues: --------------------- - the core API may be incompatible, as long as compatibility layers can be provided in pure Python - support for rexec functionality - support for freeze functionality - load .py/.pyc/.pyo files and shared libraries from files - support for packages - sys.path and sys.modules should still exist; sys.path might have a slightly different meaning - $PYTHONPATH and $PYTHONHOME should still be supported (I wouldn't mind a splitting up of importdl.c into several platform-specific files, one of which is chosen by the configure script; but that's a bit of a separate issue.) New features: ------------- - Integrated support for Greg Ward's distribution utilities (i.e. a module prepared by the distutil tools should install painlessly) - Good support for prospective authors of "all-in-one" packaging tool authors like Gordon McMillan's win32 installer or /F's squish. (But I *don't* require backwards compatibility for existing tools.) - Standard import from zip or jar files, in two ways: (1) an entry on sys.path can be a zip/jar file instead of a directory; its contents will be searched for modules or packages (2) a file in a directory that's on sys.path can be a zip/jar file; its contents will be considered as a package (note that this is different from (1)!) I don't particularly care about supporting all zip compression schemes; if Java gets away with only supporting gzip compression in jar files, so can we. - Easy ways to subclass or augment the import mechanism along different dimensions. For example, while none of the following features should be part of the core implementation, it should be easy to add any or all: - support for a new compression scheme to the zip importer - support for a new archive format, e.g. tar - a hook to import from URLs or other data sources (e.g. a "module server" imported in CORBA) (this needn't be supported through $PYTHONPATH though) - a hook that imports from compressed .py or .pyc/.pyo files - a hook to auto-generate .py files from other filename extensions (as currently implemented by ILU) - a cache for file locations in directories/archives, to improve startup time - a completely different source of imported modules, e.g. for an embedded system or PalmOS (which has no traditional filesystem) - Note that different kinds of hooks should (ideally, and within reason) properly combine, as follows: if I write a hook to recognize .spam files and automatically translate them into .py files, and you write a hook to support a new archive format, then if both hooks are installed together, it should be possible to find a .spam file in an archive and do the right thing, without any extra action. Right? - It should be possible to write hooks in C/C++ as well as Python - Applications embedding Python may supply their own implementations, default search path, etc., but don't have to if they want to piggyback on an existing Python installation (even though the latter is fraught with risk, it's cheaper and easier to understand). Implementation: --------------- - There must clearly be some code in C that can import certain essential modules (to solve the chicken-or-egg problem), but I don't mind if the majority of the implementation is written in Python. Using Python makes it easy to subclass. - In order to support importing from zip/jar files using compression, we'd at least need the zlib extension module and hence libz itself, which may not be available everywhere. - I suppose that the bootstrap is solved using a mechanism very similar to what freeze currently used (other solutions seem to be platform dependent). - I also want to still support importing *everything* from the filesystem, if only for development. (It's hard enough to deal with the fact that exceptions.py is needed during Py_Initialize(); I want to be able to hack on the import code written in Python without having to rebuild the executable all the time. Let's first complete the requirements gathering. Are these requirements reasonable? Will they make an implementation too complex? Am I missing anything? Finally, to what extent does this impact the desire for dealing differently with the Python bytecode compiler (e.g. supporting optimizers written in Python)? And does it affect the desire to implement the read-eval-print loop (the >>> prompt) in Python? --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@CNRI.Reston.VA.US Thu Nov 18 15:37:49 1999 From: guido@CNRI.Reston.VA.US (Guido van Rossum) Date: Thu, 18 Nov 1999 10:37:49 -0500 Subject: [Python-Dev] Codecs and StreamCodecs In-Reply-To: Your message of "Thu, 18 Nov 1999 12:41:32 +0100." <3833E5EC.AAFE5016@lemburg.com> References: <002401bf31b3$bf16c230$0501a8c0@bobcat> <3833E5EC.AAFE5016@lemburg.com> Message-ID: <199911181537.KAA03911@eric.cnri.reston.va.us> > The problem is that the encoding names are not Python identifiers, > e.g. iso-8859-1 is allowed as identifier. This is easily taken care of by translating each string of consecutive non-identifier-characters to an underscore, so this would import the iso_8859_1.py module. (I also noticed in an earlier post that the official name for Shift_JIS has an underscore, while most other encodings use hyphens.) > This and > the fact that applications may want to ship their own codecs (which > do not get installed under the system wide encodings package) > make the registry necessary. But it could be enough to register a package where to look for encodings (in addition to the system package). Or there could be a registry for encoding search functions. (See the import discussion.) > I don't see a problem with the registry though -- the encodings > package can take care of the registration process without any > user interaction. There would only have to be an API for > looking up an encoding published by the encodings package for > the Unicode implementation to use. The magic behind that API > is left to the encodings package... I think that the collection of encodings will eventually grow large enough to make it a requirement to avoid doing work proportional to the number of supported encodings at startup (or even when an encoding is referenced for the first time). Any "lazy" mechanism (of which module search is an example) will do. > BTW, nothing's wrong with your idea :-) In fact, I like it > a lot because it keeps the encoding modules out of the > top-level scope which is good. Yes. > PS: we could probably even take the whole codec idea one step > further and also allow other input/output formats to be registered, > e.g. stream ciphers or pickle mechanisms. The step in that > direction is not a big one: we'd only have to drop the specification > of the Unicode object in the spec and replace it with an arbitrary > object. Of course, this will still have to be a Unicode object > for use by the Unicode implementation. This is a step towards Java's architecture of stackable streams. But I'm always in favor of tackling what we know we need before tackling the most generalized version of the problem. --Guido van Rossum (home page: http://www.python.org/~guido/) From mal@lemburg.com Thu Nov 18 15:52:26 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 18 Nov 1999 16:52:26 +0100 Subject: [Python-Dev] Python 1.6 status References: <1269187709-18981857@hypernet.com> <38340F98.212F61@interet.com> Message-ID: <383420BA.EF8A6AC5@lemburg.com> [imputil and friends] "James C. Ahlstrom" wrote: > > IMHO the current import mechanism is good for developers who must > work on the library code in the directory tree, but a disaster > for sysadmins who must distribute Python applications either > internally to a number of machines or commercially. What we > need is a standard Python library file like a Java "Jar" file. > Imputil can support this as 130 lines of Python. I have also > written one in C. I like the imputil approach, but if we want > to add a library importer to import.c, I volunteer to write it. > > I don't want to just add more complicated and unmanageable hooks > which people will all use different ways and just add to the > confusion. > > It is easy to install packages by just making them into a library > file and throwing it into a directory. So why aren't we doing it? Perhaps we ought to rethink the strategy under a different light: what are the real requirement we have for Python imports ? Perhaps the outcome is only the addition of say one or two features and those can probably easily be added to the builtin system... then we can just forget about the whole import hook dilema for quite a while (AFAIK, this is how we got packages into the core -- people weren't happy with the import hook). Well, just an idea... I have other threads to follow :-) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From fdrake@acm.org Thu Nov 18 16:01:47 1999 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Thu, 18 Nov 1999 11:01:47 -0500 (EST) Subject: [Python-Dev] Codecs and StreamCodecs In-Reply-To: <3833E5EC.AAFE5016@lemburg.com> References: <002401bf31b3$bf16c230$0501a8c0@bobcat> <3833E5EC.AAFE5016@lemburg.com> Message-ID: <14388.8939.911928.41746@weyr.cnri.reston.va.us> M.-A. Lemburg writes: > The problem is that the encoding names are not Python identifiers, > e.g. iso-8859-1 is allowed as identifier. This and > the fact that applications may want to ship their own codecs (which > do not get installed under the system wide encodings package) > make the registry necessary. This isn't a substantial problem. Try this on for size (probably not too different from what everyone is already thinking, but let's make it clear). This could be in encodings/__init__.py; I've tried to be really clear on the names. (No testing, only partially complete.) ------------------------------------------------------------------------ import string import sys try: from cStringIO import StringIO except ImportError: from StringIO import StringIO class EncodingError(Exception): def __init__(self, encoding, error): self.encoding = encoding self.strerror = "%s %s" % (error, `encoding`) self.error = error Exception.__init__(self, encoding, error) _registry = {} def registerEncoding(encoding, encode=None, decode=None, make_stream_encoder=None, make_stream_decoder=None): encoding = encoding.lower() if _registry.has_key(encoding): info = _registry[encoding] else: info = _registry[encoding] = Codec(encoding) info._update(encode, decode, make_stream_encoder, make_stream_decoder) def getCodec(encoding): encoding = encoding.lower() if _registry.has_key(encoding): return _registry[encoding] # load the module modname = "encodings." + encoding.replace("-", "_") try: __import__(modname) except ImportError: raise EncodingError("unknown uncoding " + `encoding`) # if the module registered, use the codec as-is: if _registry.has_key(encoding): return _registry[encoding] # nothing registered, use well-known names module = sys.modules[modname] codec = _registry[encoding] = Codec(encoding) encode = getattr(module, "encode", None) decode = getattr(module, "decode", None) make_stream_encoder = getattr(module, "make_stream_encoder", None) make_stream_decoder = getattr(module, "make_stream_decoder", None) codec._update(encode, decode, make_stream_encoder, make_stream_decoder) class Codec: __encode = None __decode = None __stream_encoder_factory = None __stream_decoder_factory = None def __init__(self, name): self.name = name def encode(self, u): if self.__stream_encoder_factory: sio = StringIO() encoder = self.__stream_encoder_factory(sio) encoder.write(u) encoder.flush() return sio.getvalue() else: raise EncodingError("no encoder available for " + `self.name`) # similar for decode()... def make_stream_encoder(self, target): if self.__stream_encoder_factory: return self.__stream_encoder_factory(target) elif self.__encode: return DefaultStreamEncoder(target, self.__encode) else: raise EncodingError("no encoder available for " + `self.name`) # similar for make_stream_decoder()... def _update(self, encode, decode, make_stream_encoder, make_stream_decoder): self.__encode = encode or self.__encode self.__decode = decode or self.__decode self.__stream_encoder_factory = ( make_stream_encoder or self.__stream_encoder_factory) self.__stream_decoder_factory = ( make_stream_decoder or self.__stream_decoder_factory) ------------------------------------------------------------------------ > I don't see a problem with the registry though -- the encodings > package can take care of the registration process without any No problem at all; we just need to make sure the right magic is there for the "normal" case. > PS: we could probably even take the whole codec idea one step > further and also allow other input/output formats to be registered, File formats are different from text encodings, so let's keep them separate. Yes, a registry can be a good approach whenever the various things being registered are sufficiently similar semantically, but the behavior of the registry/lookup can be very different for each type of thing. Let's not over-generalize. -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives From fdrake@acm.org Thu Nov 18 16:02:45 1999 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Thu, 18 Nov 1999 11:02:45 -0500 (EST) Subject: [Python-Dev] Codecs and StreamCodecs In-Reply-To: <3833E5EC.AAFE5016@lemburg.com> References: <002401bf31b3$bf16c230$0501a8c0@bobcat> <3833E5EC.AAFE5016@lemburg.com> Message-ID: <14388.8997.703108.401808@weyr.cnri.reston.va.us> Er, I should note that the sample code I just sent makes use of string methods. ;) -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives From mal@lemburg.com Thu Nov 18 16:23:09 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 18 Nov 1999 17:23:09 +0100 Subject: [Python-Dev] Codecs and StreamCodecs References: <002401bf31b3$bf16c230$0501a8c0@bobcat> <3833E5EC.AAFE5016@lemburg.com> <199911181537.KAA03911@eric.cnri.reston.va.us> Message-ID: <383427ED.45A01BBB@lemburg.com> Guido van Rossum wrote: > > > The problem is that the encoding names are not Python identifiers, > > e.g. iso-8859-1 is allowed as identifier. > > This is easily taken care of by translating each string of consecutive > non-identifier-characters to an underscore, so this would import the > iso_8859_1.py module. (I also noticed in an earlier post that the > official name for Shift_JIS has an underscore, while most other > encodings use hyphens.) Right. That's one way of doing it. > > This and > > the fact that applications may want to ship their own codecs (which > > do not get installed under the system wide encodings package) > > make the registry necessary. > > But it could be enough to register a package where to look for > encodings (in addition to the system package). > > Or there could be a registry for encoding search functions. (See the > import discussion.) Like a path of search functions ? Not a bad idea... I will still want the internal dict for caching purposes though. I'm not sure how often these encodings will be, but even a few hundred function call will slow down the Unicode implementation quite a bit. The implementation could proceed as follows: def lookup(encoding): codecs = _internal_dict.get(encoding,None) if codecs: return codecs for query in sys.encoders: codecs = query(encoding) if codecs: break else: raise UnicodeError,'unkown encoding: %s' % encoding _internal_dict[encoding] = codecs return codecs For simplicity, codecs should be a tuple (encoder,decoder, stream_writer,stream_reader) of factory functions. ...that is if we can agree on these 4 APIs :-) Here are my current versions: ----------------------------------------------------------------------- class Codec: """ Defines the interface for stateless encoders/decoders. """ def __init__(self,errors='strict'): """ Creates a Codec instance. The Codec may implement different error handling schemes by providing the errors argument. These parameters are defined: 'strict' - raise an UnicodeError (or a subclass) 'ignore' - ignore the character and continue with the next (a single character) - replace errorneous characters with the given character (may also be a Unicode character) """ self.errors = errors def encode(self,u,slice=None): """ Return the Unicode object u encoded as Python string. If slice is given (as slice object), only the sliced part of the Unicode object is encoded. The method may not store state in the Codec instance. Use SteamCodec for codecs which have to keep state in order to make encoding/decoding efficient. """ ... def decode(self,s,offset=0): """ Decodes data from the Python string s and returns a tuple (Unicode object, bytes consumed). If offset is given, the decoding process starts at s[offset]. It defaults to 0. The method may not store state in the Codec instance. Use SteamCodec for codecs which have to keep state in order to make encoding/decoding efficient. """ ... StreamWriter and StreamReader define the interface for stateful encoders/decoders: class StreamWriter(Codec): def __init__(self,stream,errors='strict'): """ Creates a StreamWriter instance. stream must be a file-like object open for writing (binary) data. The StreamWriter may implement different error handling schemes by providing the errors argument. These parameters are defined: 'strict' - raise an UnicodeError (or a subclass) 'ignore' - ignore the character and continue with the next (a single character) - replace errorneous characters with the given character (may also be a Unicode character) """ self.stream = stream def write(self,u,slice=None): """ Writes the Unicode object's contents encoded to self.stream and returns the number of bytes written. If slice is given (as slice object), only the sliced part of the Unicode object is written. """ ... the base class should provide a default implementation of this method using self.encode ... def flush(self): """ Flushed the codec buffers used for keeping state. Returns values are not defined. Implementations are free to return None, raise an exception (in case there is pending data in the buffers which could not be decoded) or return any remaining data from the state buffers used. """ pass class StreamReader(Codec): def __init__(self,stream,errors='strict'): """ Creates a StreamReader instance. stream must be a file-like object open for reading (binary) data. The StreamReader may implement different error handling schemes by providing the errors argument. These parameters are defined: 'strict' - raise an UnicodeError (or a subclass) 'ignore' - ignore the character and continue with the next (a single character) - replace errorneous characters with the given character (may also be a Unicode character) """ self.stream = stream def read(self,chunksize=0): """ Decodes data from the stream self.stream and returns a tuple (Unicode object, bytes consumed). chunksize indicates the approximate maximum number of bytes to read from the stream for decoding purposes. The decoder can modify this setting as appropriate. The default value 0 indicates to read and decode as much as possible. The chunksize is intended to prevent having to decode huge files in one step. """ ... the base class should provide a default implementation of this method using self.decode ... def flush(self): """ Flushed the codec buffers used for keeping state. Returns values are not defined. Implementations are free to return None, raise an exception (in case there is pending data in the buffers which could not be decoded) or return any remaining data from the state buffers used. """ In addition to the above methods, the StreamWriter and StreamReader instances should also provide access to all other methods defined for the stream object. Stream codecs are free to combine the StreamWriter and StreamReader interfaces into one class. ----------------------------------------------------------------------- > > I don't see a problem with the registry though -- the encodings > > package can take care of the registration process without any > > user interaction. There would only have to be an API for > > looking up an encoding published by the encodings package for > > the Unicode implementation to use. The magic behind that API > > is left to the encodings package... > > I think that the collection of encodings will eventually grow large > enough to make it a requirement to avoid doing work proportional to > the number of supported encodings at startup (or even when an encoding > is referenced for the first time). Any "lazy" mechanism (of which > module search is an example) will do. Right. The list of search functions should provide this kind of lazyness. It also provides ways to implement other strategies to look for codecs, e.g. PIL could provide such a search function for its codecs, mxCrypto for the included ciphers, etc. > > BTW, nothing's wrong with your idea :-) In fact, I like it > > a lot because it keeps the encoding modules out of the > > top-level scope which is good. > > Yes. > > > PS: we could probably even take the whole codec idea one step > > further and also allow other input/output formats to be registered, > > e.g. stream ciphers or pickle mechanisms. The step in that > > direction is not a big one: we'd only have to drop the specification > > of the Unicode object in the spec and replace it with an arbitrary > > object. Of course, this will still have to be a Unicode object > > for use by the Unicode implementation. > > This is a step towards Java's architecture of stackable streams. > > But I'm always in favor of tackling what we know we need before > tackling the most generalized version of the problem. Well, I just wanted to mention the possibility... might be something to look into next year. I find it rather thrilling to be able to create encrypted streams by just hooking together a few stream codecs... f = open('myfile.txt','w') CipherWriter = sys.codec('rc5-cipher')[3] sf = StreamWriter(f,key='xxxxxxxx') UTF8Writer = sys.codec('utf-8')[3] sfx = UTF8Writer(sf) sfx.write('asdfasdfasdfasdf') sfx.close() Hmm, we should probably define the additional constructor arguments to be keyword arguments... writers/readers other than Unicode ones will probably need different kinds of parameters (such as the key in the above example). Ahem, ...I'm getting distracted here :-) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From bwarsaw@cnri.reston.va.us (Barry A. Warsaw) Thu Nov 18 16:23:41 1999 From: bwarsaw@cnri.reston.va.us (Barry A. Warsaw) (Barry A. Warsaw) Date: Thu, 18 Nov 1999 11:23:41 -0500 (EST) Subject: [Python-Dev] Codecs and StreamCodecs References: <002401bf31b3$bf16c230$0501a8c0@bobcat> <3833E5EC.AAFE5016@lemburg.com> <14388.8997.703108.401808@weyr.cnri.reston.va.us> Message-ID: <14388.10253.902424.904199@anthem.cnri.reston.va.us> >>>>> "Fred" == Fred L Drake, Jr <fdrake@acm.org> writes: Fred> Er, I should note that the sample code I just sent makes Fred> use of string methods. ;) Yay! From guido@CNRI.Reston.VA.US Thu Nov 18 16:37:08 1999 From: guido@CNRI.Reston.VA.US (Guido van Rossum) Date: Thu, 18 Nov 1999 11:37:08 -0500 Subject: [Python-Dev] Codecs and StreamCodecs In-Reply-To: Your message of "Thu, 18 Nov 1999 17:23:09 +0100." <383427ED.45A01BBB@lemburg.com> References: <002401bf31b3$bf16c230$0501a8c0@bobcat> <3833E5EC.AAFE5016@lemburg.com> <199911181537.KAA03911@eric.cnri.reston.va.us> <383427ED.45A01BBB@lemburg.com> Message-ID: <199911181637.LAA04260@eric.cnri.reston.va.us> > Like a path of search functions ? Not a bad idea... I will still > want the internal dict for caching purposes though. I'm not sure > how often these encodings will be, but even a few hundred function > call will slow down the Unicode implementation quite a bit. Of course. (It's like sys.modules caching the results of an import). [...] > def flush(self): > > """ Flushed the codec buffers used for keeping state. > > Returns values are not defined. Implementations are free to > return None, raise an exception (in case there is pending > data in the buffers which could not be decoded) or > return any remaining data from the state buffers used. > > """ I don't know where this came from, but a flush() should work like flush() on a file. It doesn't return a value, it just sends any remaining data to the underlying stream (for output). For input it shouldn't be supported at all. The idea is that flush() should do the same to the encoder state that close() followed by a reopen() would do. Well, more or less. But if the process were to be killed right after a flush(), the data written to disk should be a complete encoding, and not have a lingering shift state. --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@CNRI.Reston.VA.US Thu Nov 18 16:59:06 1999 From: guido@CNRI.Reston.VA.US (Guido van Rossum) Date: Thu, 18 Nov 1999 11:59:06 -0500 Subject: [Python-Dev] Codecs and StreamCodecs In-Reply-To: Your message of "Thu, 18 Nov 1999 09:50:36 +0100." <3833BDDC.7CD2CC1F@lemburg.com> References: <38317FBA.4F3D6B1F@lemburg.com> <199911161620.LAA02643@eric.cnri.reston.va.us> <002e01bf3069$8477b440$f29b12c2@secret.pythonware.com> <3832757E.B9503606@lemburg.com> <004101bf30ea$eb3801e0$f29b12c2@secret.pythonware.com> <3833BDDC.7CD2CC1F@lemburg.com> Message-ID: <199911181659.LAA04303@eric.cnri.reston.va.us> [Responding to some lingering mails] [/F] > > >>> u = unicode("å i åa ä e ö", "iso-latin-1") > > >>> s = u.encode("html-entities") > > >>> d = decoder("html-entities") > > >>> d.decode(s[:-1]) > > "å i åa ä e " > > >>> d.flush() > > "ö" [MAL] > Ah, ok. So the .flush() method checks for proper > string endings and then either returns the remaining > input or raises an error. No, please. See my previous post on flush(). > > input: read chunks of data, decode, and > > keep extra data in a local buffer. > > > > output: encode data into suitable chunks, > > and write to the output stream (that's why > > there's a buffersize argument to encode -- > > if someone writes a 10mb unicode string to > > an encoded stream, python shouldn't allocate > > an extra 10-30 megabytes just to be able to > > encode the darn thing...) > > So the stream codecs would be wrappers around the > string codecs. No -- the other way around. Think of the stream encoder as a little FSM engine that you feed with unicode characters and which sends bytes to the backend stream. When a unicode character comes in that requires a particular shift state, and the FSM isn't in that shift state, it emits the escape sequence to enter that shift state first. It should use standard buffered writes to the output stream; i.e. one call to feed the encoder could cause several calls to write() on the output stream, or vice versa (if you fed the encoder a single character it might keep it in its own buffer). That's all up to the codec implementation. The flush() forces the FSM into the "neutral" shift state, possibly writing an escape sequence to leave the current shift state, and empties the internal buffer. The string codec CONCEPTUALLY uses the stream codec to a cStringIO object, using flush() to force the final output. However the implementation may take a shortcut. For stateless encodings the stream codec may call on the string codec, but that's all an implementation issue. For input, things are slightly different (you don't know how much encoded data you must read to give you N Unicode characters, so you may have to make a guess and hold on to some data that you read unnecessarily -- either in encoded form or in Unicode form, at the discretion of the implementation. Using seek() on the input stream is forbidden (it could be a pipe or socket). --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@CNRI.Reston.VA.US Thu Nov 18 17:11:51 1999 From: guido@CNRI.Reston.VA.US (Guido van Rossum) Date: Thu, 18 Nov 1999 12:11:51 -0500 Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit) In-Reply-To: Your message of "Thu, 18 Nov 1999 10:39:30 +0100." <3833C952.C6F154B1@lemburg.com> References: <000101bf3173$f9805340$c0a0143f@tim> <3833C952.C6F154B1@lemburg.com> Message-ID: <199911181711.MAA04342@eric.cnri.reston.va.us> > > > Now how should we define ur"abc\u1234\n" ... ? > > > > If strings carried an encoding tag with them, the obvious answer is that > > this acts exactly like r"abc\u1234\n" acts today except gets a > > "unicode-escaped" encoding tag instead of a "[whatever the default is > > today]" encoding tag. > > > > If strings don't carry an encoding tag with them, you're in a bit of a > > pickle: you'll have to convert it to a regular string or a Unicode string, > > but in either case have no way to communicate that it may need further > > processing; i.e., no way to distinguish it from a regular or Unicode string > > produced by any other mechanism. The code I posted yesterday remains my > > best answer to that unpleasant puzzle (i.e., produce a Unicode string, > > fiddling with backslashes just enough to get the \u escapes expanded, in the > > same way Java's (conceptual) preprocessor does it). > > They don't have such tags... so I guess we're in trouble ;-) > > I guess to make ur"" have a meaning at all, we'd need to go > the Java preprocessor way here, i.e. scan the string *only* > for \uXXXX sequences, decode these and convert the rest as-is > to Unicode ordinals. > > Would that be ok ? Read Tim's code (posted about 40 messages ago in this list). Like Java, it interprets \u.... when the number of backslashes is odd, but not when it's even. So \\u.... returns exactly that, while \\\u.... returns two backslashes and a unicode character. This is nice and can be done regardless of whether we are going to interpret other \ escapes or not. --Guido van Rossum (home page: http://www.python.org/~guido/) From skip@mojam.com (Skip Montanaro) Thu Nov 18 17:34:51 1999 From: skip@mojam.com (Skip Montanaro) (Skip Montanaro) Date: Thu, 18 Nov 1999 11:34:51 -0600 (CST) Subject: [Python-Dev] just say no... In-Reply-To: <000401bf30d8$6cf30bc0$a42d153f@tim> References: <383156DF.2209053F@lemburg.com> <000401bf30d8$6cf30bc0$a42d153f@tim> Message-ID: <14388.14523.158050.594595@dolphin.mojam.com> >> FYI, the next version of the proposal ... File objects opened in >> text mode will use "t#" and binary ones use "s#". Tim> Am I the only one who sees magical distinctions between text and Tim> binary mode as a Really Bad Idea? No. Tim> I wouldn't have guessed the Unix natives here would quietly Tim> acquiesce to importing a bit of Windows madness <wink>. We figured you and Guido would come to our rescue... ;-) Skip Montanaro | http://www.mojam.com/ skip@mojam.com | http://www.musi-cal.com/ 847-971-7098 | Python: Programming the way Guido indented... From mal@lemburg.com Thu Nov 18 18:15:54 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 18 Nov 1999 19:15:54 +0100 Subject: [Python-Dev] Unicode Proposal: Version 0.7 References: <382C0A54.E6E8328D@lemburg.com> <382D625B.DC14DBDE@lemburg.com> <38316685.7977448D@lemburg.com> Message-ID: <3834425A.8E9C3B7E@lemburg.com> FYI, I've uploaded a new version of the proposal which includes new codec APIs, a new codec search mechanism and some minor fixes here and there. The latest version of the proposal is available at: http://starship.skyport.net/~lemburg/unicode-proposal.txt Older versions are available as: http://starship.skyport.net/~lemburg/unicode-proposal-X.X.txt Some POD (points of discussion) that are still open: · Unicode objects support for %-formatting · Design of the internal C API and the Python API for the Unicode character properties database -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Thu Nov 18 18:32:49 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 18 Nov 1999 19:32:49 +0100 Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit) References: <000101bf3173$f9805340$c0a0143f@tim> <3833C952.C6F154B1@lemburg.com> <199911181711.MAA04342@eric.cnri.reston.va.us> Message-ID: <38344651.960878A2@lemburg.com> Guido van Rossum wrote: > > > I guess to make ur"" have a meaning at all, we'd need to go > > the Java preprocessor way here, i.e. scan the string *only* > > for \uXXXX sequences, decode these and convert the rest as-is > > to Unicode ordinals. > > > > Would that be ok ? > > Read Tim's code (posted about 40 messages ago in this list). I did, but wasn't sure whether he was argueing for going the Java way... > Like Java, it interprets \u.... when the number of backslashes is odd, > but not when it's even. So \\u.... returns exactly that, while > \\\u.... returns two backslashes and a unicode character. > > This is nice and can be done regardless of whether we are going to > interpret other \ escapes or not. So I'll take that as: this is what we want in Python too :-) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Thu Nov 18 18:38:41 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 18 Nov 1999 19:38:41 +0100 Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit) References: <000101bf3173$f9805340$c0a0143f@tim> <3833C952.C6F154B1@lemburg.com> <199911181711.MAA04342@eric.cnri.reston.va.us> Message-ID: <383447B1.1B7B594C@lemburg.com> Would this definition be fine ? """ u = ur'<raw-unicode-escape encoded Python string>' The 'raw-unicode-escape' encoding is defined as follows: · \uXXXX sequence represent the U+XXXX Unicode character if and only if the number of leading backslashes is odd · all other characters represent themselves as Unicode ordinal (e.g. 'b' -> U+0062) """ -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From guido@CNRI.Reston.VA.US Thu Nov 18 18:46:35 1999 From: guido@CNRI.Reston.VA.US (Guido van Rossum) Date: Thu, 18 Nov 1999 13:46:35 -0500 Subject: [Python-Dev] just say no... In-Reply-To: Your message of "Thu, 18 Nov 1999 11:34:51 CST." <14388.14523.158050.594595@dolphin.mojam.com> References: <383156DF.2209053F@lemburg.com> <000401bf30d8$6cf30bc0$a42d153f@tim> <14388.14523.158050.594595@dolphin.mojam.com> Message-ID: <199911181846.NAA04547@eric.cnri.reston.va.us> > >> FYI, the next version of the proposal ... File objects opened in > >> text mode will use "t#" and binary ones use "s#". > > Tim> Am I the only one who sees magical distinctions between text and > Tim> binary mode as a Really Bad Idea? > > No. > > Tim> I wouldn't have guessed the Unix natives here would quietly > Tim> acquiesce to importing a bit of Windows madness <wink>. > > We figured you and Guido would come to our rescue... ;-) Don't count on me. My brain is totally cross-platform these days, and writing "rb" or "wb" for files containing binary data is second nature for me. I actually *like* it. Anyway, the Unicode stuff ought to have a wrapper open(filename, mode, encoding) where the 'b' will be added to the mode if you don't give it and it's needed. --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@CNRI.Reston.VA.US Thu Nov 18 18:50:20 1999 From: guido@CNRI.Reston.VA.US (Guido van Rossum) Date: Thu, 18 Nov 1999 13:50:20 -0500 Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit) In-Reply-To: Your message of "Thu, 18 Nov 1999 19:32:49 +0100." <38344651.960878A2@lemburg.com> References: <000101bf3173$f9805340$c0a0143f@tim> <3833C952.C6F154B1@lemburg.com> <199911181711.MAA04342@eric.cnri.reston.va.us> <38344651.960878A2@lemburg.com> Message-ID: <199911181850.NAA04576@eric.cnri.reston.va.us> > > Like Java, it interprets \u.... when the number of backslashes is odd, > > but not when it's even. So \\u.... returns exactly that, while > > \\\u.... returns two backslashes and a unicode character. > > > > This is nice and can be done regardless of whether we are going to > > interpret other \ escapes or not. > > So I'll take that as: this is what we want in Python too :-) I'll reserve judgement until we've got some experience with it in the field, but it seems the best compromise. It also gives a clear explanation about why we have \uXXXX when we already have \xXXXX. --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@CNRI.Reston.VA.US Thu Nov 18 18:57:36 1999 From: guido@CNRI.Reston.VA.US (Guido van Rossum) Date: Thu, 18 Nov 1999 13:57:36 -0500 Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit) In-Reply-To: Your message of "Thu, 18 Nov 1999 19:38:41 +0100." <383447B1.1B7B594C@lemburg.com> References: <000101bf3173$f9805340$c0a0143f@tim> <3833C952.C6F154B1@lemburg.com> <199911181711.MAA04342@eric.cnri.reston.va.us> <383447B1.1B7B594C@lemburg.com> Message-ID: <199911181857.NAA04617@eric.cnri.reston.va.us> > Would this definition be fine ? > """ > > u = ur'<raw-unicode-escape encoded Python string>' > > The 'raw-unicode-escape' encoding is defined as follows: > > · \uXXXX sequence represent the U+XXXX Unicode character if and > only if the number of leading backslashes is odd > > · all other characters represent themselves as Unicode ordinal > (e.g. 'b' -> U+0062) > > """ Yes. --Guido van Rossum (home page: http://www.python.org/~guido/) From skip@mojam.com (Skip Montanaro) Thu Nov 18 19:09:46 1999 From: skip@mojam.com (Skip Montanaro) (Skip Montanaro) Date: Thu, 18 Nov 1999 13:09:46 -0600 (CST) Subject: [Python-Dev] Unicode Proposal: Version 0.7 In-Reply-To: <3834425A.8E9C3B7E@lemburg.com> References: <382C0A54.E6E8328D@lemburg.com> <382D625B.DC14DBDE@lemburg.com> <38316685.7977448D@lemburg.com> <3834425A.8E9C3B7E@lemburg.com> Message-ID: <14388.20218.294814.234327@dolphin.mojam.com> I haven't been following this discussion closely at all, and have no previous experience with Unicode, so please pardon a couple stupid questions from the peanut gallery: 1. What does U+0061 mean (other than 'a')? That is, what is U? 2. I saw nothing about encodings in the Codec/StreamReader/StreamWriter description. Given a Unicode object with encoding e1, how do I write it to a file that is to be encoded with encoding e2? Seems like I would do something like u1 = unicode(s, encoding=e1) f = open("somefile", "wb") u2 = unicode(u1, encoding=e2) f.write(u2) Is that how it would be done? Does this question even make sense? 3. What will the impact be on programmers such as myself currently living with blinders on (that is, writing in plain old 7-bit ASCII)? Thx, Skip Montanaro | http://www.mojam.com/ skip@mojam.com | http://www.musi-cal.com/ 847-971-7098 | Python: Programming the way Guido indented... From jim@interet.com Thu Nov 18 19:23:53 1999 From: jim@interet.com (James C. Ahlstrom) Date: Thu, 18 Nov 1999 14:23:53 -0500 Subject: [Python-Dev] Import redesign (was: Python 1.6 status) References: <1269187709-18981857@hypernet.com> <199911181530.KAA03887@eric.cnri.reston.va.us> Message-ID: <38345249.4AFD91DA@interet.com> Guido van Rossum wrote: > > Let's first complete the requirements gathering. Yes. > Are these > requirements reasonable? Will they make an implementation too > complex? I think you can get 90% of where you want to be with something much simpler. And the simpler implementation will be useful in the 100% solution, so it is not wasted time. How about if we just design a Python archive file format; provide code in the core (in Python or C) to import from it; provide a Python program to create archive files; and provide a Standard Directory to put archives in so they can be found quickly. For extensibility and control, we add functions to the imp module. Detailed comments follow: > Compatibility issues: > --------------------- > [list of current features...] Easily met by keeping the current C code. > > New features: > ------------- > > - Integrated support for Greg Ward's distribution utilities (i.e. a > module prepared by the distutil tools should install painlessly) > > - Good support for prospective authors of "all-in-one" packaging tool > authors like Gordon McMillan's win32 installer or /F's squish. (But > I *don't* require backwards compatibility for existing tools.) These tools go well beyond just an archive file format, but hopefully a file format will help. Greg and Gordon should be able to control the format so it meets their needs. We need a standard format. > - Standard import from zip or jar files, in two ways: > > (1) an entry on sys.path can be a zip/jar file instead of a directory; > its contents will be searched for modules or packages > > (2) a file in a directory that's on sys.path can be a zip/jar file; > its contents will be considered as a package (note that this is > different from (1)!) I don't like sys.path at all. It is currently part of the problem. I suggest that archive files MUST be put into a known directory. On Windows this is the directory of the executable, sys.executable. On Unix this $PREFIX plus version, namely "%s/lib/python%s/" % (sys.prefix, sys.version[0:3]). Other platforms can have different rules. We should also have the ability to append archive files to the executable or a shared library assuming the OS allows this (Windows and Linux do allow it). This is the first location searched, nails the archive to the interpreter, insulates us from an erroneous sys.path, and enables single-file Python programs. > I don't particularly care about supporting all zip compression > schemes; if Java gets away with only supporting gzip compression > in jar files, so can we. We don't need compression. The whole ./Lib is 1.2 Meg, and if we compress it to zero we save a Meg. Irrelevant. Installers provide compression anyway so when Python programs are shipped, they will be compressed then. Problems are that Python does not ship with compression, we will have to add it, we will have to support it and its current method of compression forever, and it adds complexity. > - Easy ways to subclass or augment the import mechanism along > different dimensions. For example, while none of the following > features should be part of the core implementation, it should be > easy to add any or all: > > [ List of new features including hooks...] Sigh, this proposal does not provide for this. It seems like a job for imputil. But if the file format and import code is available from the imp module, it can be used as part of the solution. > - support for a new compression scheme to the zip importer I guess compression should be easy to add if Python ships with a compression module. > - a cache for file locations in directories/archives, to improve > startup time If the Python library is available as an archive, I think startup will be greatly improved anyway. > Implementation: > --------------- > > - There must clearly be some code in C that can import certain > essential modules (to solve the chicken-or-egg problem), but I don't > mind if the majority of the implementation is written in Python. > Using Python makes it easy to subclass. Yes. > - In order to support importing from zip/jar files using compression, > we'd at least need the zlib extension module and hence libz itself, > which may not be available everywhere. That's a good reason to omit compression. At least for now. > - I suppose that the bootstrap is solved using a mechanism very > similar to what freeze currently used (other solutions seem to be > platform dependent). Yes, except that we need to be careful to preserve the freeze feature for users. We don't want to take it over. > - I also want to still support importing *everything* from the > filesystem, if only for development. (It's hard enough to deal with > the fact that exceptions.py is needed during Py_Initialize(); > I want to be able to hack on the import code written in Python > without having to rebuild the executable all the time. Yes, we need a function in imp to turn archives off: import imp imp.archiveEnable(0) > Finally, to what extent does this impact the desire for dealing > differently with the Python bytecode compiler (e.g. supporting > optimizers written in Python)? And does it affect the desire to > implement the read-eval-print loop (the >>> prompt) in Python? I don't think it impacts these at all. Jim Ahlstrom From guido@CNRI.Reston.VA.US Thu Nov 18 19:55:02 1999 From: guido@CNRI.Reston.VA.US (Guido van Rossum) Date: Thu, 18 Nov 1999 14:55:02 -0500 Subject: [Python-Dev] Import redesign (was: Python 1.6 status) In-Reply-To: Your message of "Thu, 18 Nov 1999 14:23:53 EST." <38345249.4AFD91DA@interet.com> References: <1269187709-18981857@hypernet.com> <199911181530.KAA03887@eric.cnri.reston.va.us> <38345249.4AFD91DA@interet.com> Message-ID: <199911181955.OAA04830@eric.cnri.reston.va.us> > I think you can get 90% of where you want to be with something > much simpler. And the simpler implementation will be useful in > the 100% solution, so it is not wasted time. Agreed, but I'm not sure that it addresses the problems that started this thread. I can't really tell, since the message starting the thread just requested imputil, without saying which parts of it were needed. A followup claimed that imputil was a fine prototype but too slow for real work. I inferred that flexibility was requested. But maybe that was projection since that was on my own list. (I'm happy with the performance and find manipulating zip or jar files clumsy, so I'm not too concerned about all the nice things you can *do* with that flexibility. :-) > How about if we just design a Python archive file format; provide > code in the core (in Python or C) to import from it; provide a > Python program to create archive files; and provide a Standard > Directory to put archives in so they can be found quickly. For > extensibility and control, we add functions to the imp module. > Detailed comments follow: > These tools go well beyond just an archive file format, but hopefully > a file format will help. Greg and Gordon should be able to control the > format so it meets their needs. We need a standard format. I think the standard format should be a subclass of zip or jar (which is itself a subclass of zip). We have already written (at CNRI, as yet unreleased) the necessary Python tools to manipulate zip archives; moreover 3rd party tools are abundantly available, both on Unix and on Windows (as well as in Java). Zip files also lend themselves to self-extracting archives and similar things, because the file index is at the end, so I think that Greg & Gordon should be happy. > I don't like sys.path at all. It is currently part of the problem. Eh? That's the first thing I hear something bad about it. Maybe that's because you live on Windows -- on Unix, search paths are ubiquitous. > I suggest that archive files MUST be put into a known directory. Why? Maybe this works on Windows; on Unix this is asking for trouble because it prevents users from augmenting the installation provided by the sysadmin. Even on newer Windows versions, users without admin perms may not be allowed to add files to that privileged directory. > On Windows this is the directory of the executable, sys.executable. > On Unix this $PREFIX plus version, namely > "%s/lib/python%s/" % (sys.prefix, sys.version[0:3]). > Other platforms can have different rules. > > We should also have the ability to append archive files to the > executable or a shared library assuming the OS allows this > (Windows and Linux do allow it). This is the first location > searched, nails the archive to the interpreter, insulates us > from an erroneous sys.path, and enables single-file Python programs. OK for the executable. I'm not sure what the point is of appending an archive to the shared library? Anyway, does it matter (on Windows) if you add it to python16.dll or to python.exe? > We don't need compression. The whole ./Lib is 1.2 Meg, and if we > compress > it to zero we save a Meg. Irrelevant. Installers provide compression > anyway so when Python programs are shipped, they will be compressed > then. > > Problems are that Python does not ship with compression, we will > have to add it, we will have to support it and its current method > of compression forever, and it adds complexity. OK, OK. I think most zip tools have a way to turn off the compression. (Anyway, it's a matter of more I/O time vs. more CPU time; hardare for both is getting better faster than we can tweak the code :-) > Sigh, this proposal does not provide for this. It seems > like a job for imputil. But if the file format and import code > is available from the imp module, it can be used as part of the > solution. Well, the question is really if we want flexibility or archive files. I care more about the flexibility. If we get a clear vote for archive files, I see no problem with implementing that first. > If the Python library is available as an archive, I think > startup will be greatly improved anyway. Really? I know about all the system calls it makes, but I don't really see much of a delay -- I have a prompt in well under 0.1 second. --Guido van Rossum (home page: http://www.python.org/~guido/) From gstein@lyra.org Thu Nov 18 22:03:55 1999 From: gstein@lyra.org (Greg Stein) Date: Thu, 18 Nov 1999 14:03:55 -0800 (PST) Subject: [Python-Dev] file modes (was: just say no...) In-Reply-To: <3833B588.1E31F01B@lemburg.com> Message-ID: <Pine.LNX.4.10.9911181358380.10639-100000@nebula.lyra.org> On Thu, 18 Nov 1999, M.-A. Lemburg wrote: > Tim Peters wrote: > > [MAL] > > > File objects opened in text mode will use "t#" and binary > > > ones use "s#". > > > > [Greg Stein] > > > ... > > > The real annoying thing would be to assume that opening a file as 'r' > > > means that I *meant* text mode and to start using "t#". > > > > Isn't that exactly what MAL said would happen? Note that a "t" flag for > > "text mode" is an MS extension -- C doesn't define "t", and Python doesn't > > either; a lone "r" has always meant text mode. > > Em, I think you've got something wrong here: "t#" refers to the > parsing marker used for writing data to files opened in text mode. Nope. We've got it right :-) Tim and I used 'r' and "t" to refer to file-open modes. I used "t#" to refer to the parse marker. >... > I guess you won't notice any difference: strings define both > interfaces ("s#" and "t#") to mean the same thing. Only other > buffer compatible types may now fail to write to text files > -- which is not so bad, because it forces the programmer to > rethink what he really intended when opening the file in text > mode. It *is* bad if it breaks my existing programs in subtle ways that are a bitch to track down. > Besides, if you are writing portable scripts you should pay > close attention to "r" vs. "rb" anyway. I'm not writing portable scripts. I mentioned that once before. I don't want a difference between 'r' and 'rb' on my Linux box. It was never there before, I'm lazy, and I don't want to see it added :-). Honestly, I don't know offhand of any Python types that repond to "s#" and "t#" in different ways, such that changing file.write would end up writing something different (and thereby breaking existing code). I just don't like introduce text/binary to *nix platforms where it didn't exist before. Cheers, -g -- Greg Stein, http://www.lyra.org/ From skip@mojam.com (Skip Montanaro) Thu Nov 18 22:15:43 1999 From: skip@mojam.com (Skip Montanaro) (Skip Montanaro) Date: Thu, 18 Nov 1999 16:15:43 -0600 (CST) Subject: [Python-Dev] file modes (was: just say no...) In-Reply-To: <Pine.LNX.4.10.9911181358380.10639-100000@nebula.lyra.org> References: <3833B588.1E31F01B@lemburg.com> <Pine.LNX.4.10.9911181358380.10639-100000@nebula.lyra.org> Message-ID: <14388.31375.296388.973848@dolphin.mojam.com> Greg> I'm not writing portable scripts. I mentioned that once before. I Greg> don't want a difference between 'r' and 'rb' on my Linux box. It Greg> was never there before, I'm lazy, and I don't want to see it added Greg> :-). ... Greg> I just don't like introduce text/binary to *nix platforms where it Greg> didn't exist before. I'll vote with Greg, Guido's cross-platform conversion not withstanding. If I haven't been writing portable scripts up to this point because I only care about a single target platform, why break my scripts for me? Forcing me to use "rb" or "wb" on my open calls isn't going to make them portable anyway. There are probably many other harder to identify and correct portability issues than binary file access anyway. Seems like requiring "b" is just going to cause gratuitous breakage with no obvious increase in portability. porta-nanny.py-anyone?-ly y'rs, Skip Montanaro | http://www.mojam.com/ skip@mojam.com | http://www.musi-cal.com/ 847-971-7098 | Python: Programming the way Guido indented... From jim@interet.com Thu Nov 18 22:40:05 1999 From: jim@interet.com (James C. Ahlstrom) Date: Thu, 18 Nov 1999 17:40:05 -0500 Subject: [Python-Dev] Import redesign (was: Python 1.6 status) References: <1269187709-18981857@hypernet.com> <199911181530.KAA03887@eric.cnri.reston.va.us> <38345249.4AFD91DA@interet.com> <199911181955.OAA04830@eric.cnri.reston.va.us> Message-ID: <38348045.BB95F783@interet.com> Guido van Rossum wrote: > I think the standard format should be a subclass of zip or jar (which > is itself a subclass of zip). We have already written (at CNRI, as > yet unreleased) the necessary Python tools to manipulate zip archives; > moreover 3rd party tools are abundantly available, both on Unix and on > Windows (as well as in Java). Zip files also lend themselves to > self-extracting archives and similar things, because the file index is > at the end, so I think that Greg & Gordon should be happy. Think about multiple packages in multiple zip files. The zip files store file directories. That means we would need a sys.zippath to search the zip files. I don't want another PYTHONPATH phenomenon. Greg Stein and I once discussed this (and Gordon I think). They argued that the directories should be flattened. That is, think of all directories which can be reached on PYTHONPATH. Throw away all initial paths. The resultant archive has *.pyc at the top level, as well as package directories only. The search path is "." in every archive file. No directory information is stored, only module names, some with dots. > > I don't like sys.path at all. It is currently part of the problem. > > Eh? That's the first thing I hear something bad about it. Maybe > that's because you live on Windows -- on Unix, search paths are > ubiquitous. On windows, just print sys.path. It is junk. A commercial distribution has to "just work", and it fails if a second installation (by someone else) changes PYTHONPATH to suit their app. I am trying to get to "just works", no excuses, no complications. > > I suggest that archive files MUST be put into a known directory. > > Why? Maybe this works on Windows; on Unix this is asking for trouble > because it prevents users from augmenting the installation provided by > the sysadmin. Even on newer Windows versions, users without admin > perms may not be allowed to add files to that privileged directory. It works on Windows because programs install themselves in their own subdirectories, and can put files there instead of /windows/system32. This holds true for Windows 2000 also. A Unix-style installation to /windows/system32 would (may?) require "administrator" privilege. On Unix you are right. I didn't think of that because I am the Unix sysadmin here, so I can put things where I want. The Windows solution doesn't fit with Unix, because executables go in a ./bin directory and putting library files there is a no-no. Hmmmm... This needs more thought. Anyone else have ideas?? > > We should also have the ability to append archive files to the > > executable or a shared library assuming the OS allows this > > OK for the executable. I'm not sure what the point is of appending an > archive to the shared library? Anyway, does it matter (on Windows) if > you add it to python16.dll or to python.exe? The point of using python16.dll is to append the Python library to it, and append to python.exe (or use files) for everything else. That way, the 1.6 interpreter is linked to the 1.6 Lib, upgrading to 1.7 means replacing only one file, and there is no wasted storage in multiple Lib's. I am thinking of multiple Python programs in different directories. But maybe you are right. On Windows, if python.exe can be put in /windows/system32 then it really doesn't matter. > OK, OK. I think most zip tools have a way to turn off the > compression. (Anyway, it's a matter of more I/O time vs. more CPU > time; hardare for both is getting better faster than we can tweak the > code :-) Well, if Python now has its own compression that is built in and comes with it, then that is different. Maybe compression is OK. > Well, the question is really if we want flexibility or archive files. > I care more about the flexibility. If we get a clear vote for archive > files, I see no problem with implementing that first. I don't like flexibility, I like standardization and simplicity. Flexibility just encourages users to do the wrong thing. Everyone vote please. I don't have a solid feeling about what people want, only what they don't like. > > If the Python library is available as an archive, I think > > startup will be greatly improved anyway. > > Really? I know about all the system calls it makes, but I don't > really see much of a delay -- I have a prompt in well under 0.1 > second. So do I. I guess I was just echoing someone else's complaint. JimA From mal@lemburg.com Thu Nov 18 23:28:31 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 19 Nov 1999 00:28:31 +0100 Subject: [Python-Dev] file modes (was: just say no...) References: <Pine.LNX.4.10.9911181358380.10639-100000@nebula.lyra.org> Message-ID: <38348B9F.A31B09C4@lemburg.com> Greg Stein wrote: > > On Thu, 18 Nov 1999, M.-A. Lemburg wrote: > > Tim Peters wrote: > > > [MAL] > > > > File objects opened in text mode will use "t#" and binary > > > > ones use "s#". > > > > > > [Greg Stein] > > > > ... > > > > The real annoying thing would be to assume that opening a file as 'r' > > > > means that I *meant* text mode and to start using "t#". > > > > > > Isn't that exactly what MAL said would happen? Note that a "t" flag for > > > "text mode" is an MS extension -- C doesn't define "t", and Python doesn't > > > either; a lone "r" has always meant text mode. > > > > Em, I think you've got something wrong here: "t#" refers to the > > parsing marker used for writing data to files opened in text mode. > > Nope. We've got it right :-) > > Tim and I used 'r' and "t" to refer to file-open modes. I used "t#" to > refer to the parse marker. Ah, ok. But "t" as file opener is non-portable anyways, so I'll skip it here :-) > >... > > I guess you won't notice any difference: strings define both > > interfaces ("s#" and "t#") to mean the same thing. Only other > > buffer compatible types may now fail to write to text files > > -- which is not so bad, because it forces the programmer to > > rethink what he really intended when opening the file in text > > mode. > > It *is* bad if it breaks my existing programs in subtle ways that are a > bitch to track down. > > > Besides, if you are writing portable scripts you should pay > > close attention to "r" vs. "rb" anyway. > > I'm not writing portable scripts. I mentioned that once before. I don't > want a difference between 'r' and 'rb' on my Linux box. It was never there > before, I'm lazy, and I don't want to see it added :-). > > Honestly, I don't know offhand of any Python types that repond to "s#" and > "t#" in different ways, such that changing file.write would end up writing > something different (and thereby breaking existing code). > > I just don't like introduce text/binary to *nix platforms where it didn't > exist before. Please remember that up until now you were probably only using strings to write to files. Python strings don't differentiate between "t#" and "s#" so you wont see any change in function or find subtle errors being introduced. If you are already using the buffer feature for e.g. array which also implement "s#" but don't support "t#" for obvious reasons you'll run into trouble, but then: arrays are binary data, so changing from text mode to binary mode is well worth the effort even if you just consider it a nuisance. Since the buffer interface and its consequences haven't published yet, there are probably very few users out there who would actually run into any problems. And even if they do, its a good chance to catch subtle bugs which would only have shown up when trying to port to another platform. I'll leave the rest for Guido to answer, since it was his idea ;-) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Thu Nov 18 23:41:32 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 19 Nov 1999 00:41:32 +0100 Subject: [Python-Dev] Unicode Proposal: Version 0.7 References: <382C0A54.E6E8328D@lemburg.com> <382D625B.DC14DBDE@lemburg.com> <38316685.7977448D@lemburg.com> <3834425A.8E9C3B7E@lemburg.com> <14388.20218.294814.234327@dolphin.mojam.com> Message-ID: <38348EAC.82B41A4D@lemburg.com> Skip Montanaro wrote: > > I haven't been following this discussion closely at all, and have no > previous experience with Unicode, so please pardon a couple stupid questions > from the peanut gallery: > > 1. What does U+0061 mean (other than 'a')? That is, what is U? U+XXXX means Unicode character with ordinal hex number XXXX. It is basically just another way to say, hey I want the Unicode character at position 0xXXXX in the Unicode spec. > 2. I saw nothing about encodings in the Codec/StreamReader/StreamWriter > description. Given a Unicode object with encoding e1, how do I write > it to a file that is to be encoded with encoding e2? Seems like I > would do something like > > u1 = unicode(s, encoding=e1) > f = open("somefile", "wb") > u2 = unicode(u1, encoding=e2) > f.write(u2) > > Is that how it would be done? Does this question even make sense? The unicode() constructor converts all input to Unicode as basis for other conversions. In the above example, s would be converted to Unicode using the assumption that the bytes in s represent characters encoded using the encoding given in e1. The line with u2 would raise a TypeError, because u1 is not a string. To convert a Unicode object u1 to another encoding, you would have to call the .encode() method with the intended new encoding. The Unicode object will then take care of the conversion of its internal Unicode data into a string using the given encoding, e.g. you'd write: f.write(u1.encode(e2)) > 3. What will the impact be on programmers such as myself currently > living with blinders on (that is, writing in plain old 7-bit ASCII)? If you don't want your scripts to know about Unicode, nothing will really change. In case you do use e.g. Latin-1 characters in your scripts for strings, you are asked to include a pragma in the comment lines at the beginning of the script (so that programmers viewing your code using other encoding have a chance to figure out what you've written). Here's the text from the proposal: """ Note that you should provide some hint to the encoding you used to write your programs as pragma line in one the first few comment lines of the source file (e.g. '# source file encoding: latin-1'). If you only use 7-bit ASCII then everything is fine and no such notice is needed, but if you include Latin-1 characters not defined in ASCII, it may well be worthwhile including a hint since people in other countries will want to be able to read you source strings too. """ Other than that you can continue to use normal strings like you always have. Hope that clarifies things at least a bit, -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mhammond@skippinet.com.au Fri Nov 19 00:27:09 1999 From: mhammond@skippinet.com.au (Mark Hammond) Date: Fri, 19 Nov 1999 11:27:09 +1100 Subject: [Python-Dev] file modes (was: just say no...) In-Reply-To: <38348B9F.A31B09C4@lemburg.com> Message-ID: <003401bf3224$d231be30$0501a8c0@bobcat> [MAL] > If you are already using the buffer feature for e.g. array which > also implement "s#" but don't support "t#" for obvious reasons > you'll run into trouble, but then: arrays are binary data, > so changing from text mode to binary mode is well worth the > effort even if you just consider it a nuisance. Breaking existing code that works should be considered more than a nuisance. However, one answer would be to have "t#" _prefer_ to use the text buffer, but not insist on it. eg, the logic for processing "t#" could check if the text buffer is supported, and if not move back to the blob buffer. This should mean that all existing code still works, except for objects that support both buffers to mean different things. AFAIK there are no objects that qualify today, so it should work fine. Unix users _will_ need to revisit their thinking about "text mode" vs "binary mode" when writing these new objects (such as Unicode), but IMO that is more than reasonable - Unix users dont bother qualifying the open mode of their files, simply because it has no effect on their files. If for certain objects or requirements there _is_ a distinction, then new code can start to think these issues through. "Portable File IO" will simply be extended from simply "portable among all platforms" to "portable among all platforms and objects". Mark. From gmcm@hypernet.com Fri Nov 19 02:23:44 1999 From: gmcm@hypernet.com (Gordon McMillan) Date: Thu, 18 Nov 1999 21:23:44 -0500 Subject: [Python-Dev] Import redesign (was: Python 1.6 status) In-Reply-To: <38348045.BB95F783@interet.com> Message-ID: <1269144272-21594530@hypernet.com> [Guido] > > I think the standard format should be a subclass of zip or jar > > (which is itself a subclass of zip). We have already written > > (at CNRI, as yet unreleased) the necessary Python tools to > > manipulate zip archives; moreover 3rd party tools are > > abundantly available, both on Unix and on Windows (as well as > > in Java). Zip files also lend themselves to self-extracting > > archives and similar things, because the file index is at the > > end, so I think that Greg & Gordon should be happy. No problem (I created my own formats for relatively minor reasons). [JimA] > Think about multiple packages in multiple zip files. The zip > files store file directories. That means we would need a > sys.zippath to search the zip files. I don't want another > PYTHONPATH phenomenon. What if sys.path looked like: [DirImporter('.'), ZlibImporter('c:/python/stdlib.pyz'), ...] > Greg Stein and I once discussed this (and Gordon I think). They > argued that the directories should be flattened. That is, think > of all directories which can be reached on PYTHONPATH. Throw > away all initial paths. The resultant archive has *.pyc at the > top level, as well as package directories only. The search path > is "." in every archive file. No directory information is > stored, only module names, some with dots. While I do flat archives (no dots, but that's a different story), there's no reason the archive couldn't be structured. Flat archives are definitely simpler. [JimA] > > > I don't like sys.path at all. It is currently part of the > > > problem. [Guido] > > Eh? That's the first thing I hear something bad about it. > > Maybe that's because you live on Windows -- on Unix, search > > paths are ubiquitous. > > On windows, just print sys.path. It is junk. A commercial > distribution has to "just work", and it fails if a second > installation (by someone else) changes PYTHONPATH to suit their > app. I am trying to get to "just works", no excuses, no > complications. Py_Initialize (); PyRun_SimpleString ("import sys; del sys.path[1:]"); Yeah, there's a hole there. Fixable if you could do a little pre- Py_Initialize twiddling. > > > I suggest that archive files MUST be put into a known > > > directory. No way. Hard code a directory? Overwrite someone else's Python "standalone"? Write to a C: partition that is deliberately sized to hold nothing but Windows? Make network installations impossible? > > Why? Maybe this works on Windows; on Unix this is asking for > > trouble because it prevents users from augmenting the > > installation provided by the sysadmin. Even on newer Windows > > versions, users without admin perms may not be allowed to add > > files to that privileged directory. > > It works on Windows because programs install themselves in their > own subdirectories, and can put files there instead of > /windows/system32. This holds true for Windows 2000 also. A > Unix-style installation to /windows/system32 would (may?) require > "administrator" privilege. There's nothing Unix-style about installing to /Windows/system32. 'Course *they* have symbolic links that actually work... > On Unix you are right. I didn't think of that because I am the > Unix sysadmin here, so I can put things where I want. The > Windows solution doesn't fit with Unix, because executables go in > a ./bin directory and putting library files there is a no-no. > Hmmmm... This needs more thought. Anyone else have ideas?? The official Windows solution is stuff in registry about app paths and such. Putting the dlls in the exe's directory is a workaround which works and is more managable than the official solution. > > > We should also have the ability to append archive files to > > > the executable or a shared library assuming the OS allows > > > this That's a handy trick on Windows, but it's got nothing to do with Python. > > Well, the question is really if we want flexibility or archive > > files. I care more about the flexibility. If we get a clear > > vote for archive files, I see no problem with implementing that > > first. > > I don't like flexibility, I like standardization and simplicity. > Flexibility just encourages users to do the wrong thing. I've noticed that the people who think there should only be one way to do things never agree on what it is. > Everyone vote please. I don't have a solid feeling about > what people want, only what they don't like. Flexibility. You can put Christian's favorite Einstein quote here too. > > > If the Python library is available as an archive, I think > > > startup will be greatly improved anyway. > > > > Really? I know about all the system calls it makes, but I > > don't really see much of a delay -- I have a prompt in well > > under 0.1 second. > > So do I. I guess I was just echoing someone else's complaint. Install some stuff. Deinstall some of it. Repeat (mixing up the order) until your registry and hard drive are shattered into tiny little fragments. It doesn't take long (there's lots of stuff a defragmenter can't touch once it's there). - Gordon From mal@lemburg.com Fri Nov 19 09:08:44 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 19 Nov 1999 10:08:44 +0100 Subject: [Python-Dev] file modes (was: just say no...) References: <003401bf3224$d231be30$0501a8c0@bobcat> Message-ID: <3835139C.344F3EEE@lemburg.com> Mark Hammond wrote: > > [MAL] > > > If you are already using the buffer feature for e.g. array which > > also implement "s#" but don't support "t#" for obvious reasons > > you'll run into trouble, but then: arrays are binary data, > > so changing from text mode to binary mode is well worth the > > effort even if you just consider it a nuisance. > > Breaking existing code that works should be considered more than a > nuisance. Its an error that pretty easy to fix... that's what I was referring to with "nuisance". All you have to do is open the file in binary mode and you're done. BTW, the change will only effect platforms that don't differ between text and binary mode, e.g. Unix ones. > However, one answer would be to have "t#" _prefer_ to use the text > buffer, but not insist on it. eg, the logic for processing "t#" could > check if the text buffer is supported, and if not move back to the > blob buffer. I doubt that this is conform to what the buffer interface want's to reflect: if the getcharbuf slot is not implemented this means "I am not text". If you would write non-text to a text file, this may cause line breaks to be interpreted in ways that are incompatible with the binary data, i.e. when you read the data back in, it may fail to load because e.g. '\n' was converted to '\r\n'. > This should mean that all existing code still works, except for > objects that support both buffers to mean different things. AFAIK > there are no objects that qualify today, so it should work fine. Well, even though the code would work, it might break badly someday for the above reasons. Better fix that now when there aren't too many possible cases around than at some later point where the user has to figure out the problem for himself due to the system not warning him about this. > Unix users _will_ need to revisit their thinking about "text mode" vs > "binary mode" when writing these new objects (such as Unicode), but > IMO that is more than reasonable - Unix users dont bother qualifying > the open mode of their files, simply because it has no effect on their > files. If for certain objects or requirements there _is_ a > distinction, then new code can start to think these issues through. > "Portable File IO" will simply be extended from simply "portable among > all platforms" to "portable among all platforms and objects". Right. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 42 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Fri Nov 19 09:56:03 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 19 Nov 1999 10:56:03 +0100 Subject: [Python-Dev] Codecs and StreamCodecs References: <002401bf31b3$bf16c230$0501a8c0@bobcat> <3833E5EC.AAFE5016@lemburg.com> <199911181537.KAA03911@eric.cnri.reston.va.us> <383427ED.45A01BBB@lemburg.com> <199911181637.LAA04260@eric.cnri.reston.va.us> Message-ID: <38351EB3.153FCDFC@lemburg.com> Guido van Rossum wrote: > > > Like a path of search functions ? Not a bad idea... I will still > > want the internal dict for caching purposes though. I'm not sure > > how often these encodings will be, but even a few hundred function > > call will slow down the Unicode implementation quite a bit. > > Of course. (It's like sys.modules caching the results of an import). I've fixed the "path of search functions" approach in the latest version of the spec. > [...] > > def flush(self): > > > > """ Flushed the codec buffers used for keeping state. > > > > Returns values are not defined. Implementations are free to > > return None, raise an exception (in case there is pending > > data in the buffers which could not be decoded) or > > return any remaining data from the state buffers used. > > > > """ > > I don't know where this came from, but a flush() should work like > flush() on a file. It came from Fredrik's proposal. > It doesn't return a value, it just sends any > remaining data to the underlying stream (for output). For input it > shouldn't be supported at all. > > The idea is that flush() should do the same to the encoder state that > close() followed by a reopen() would do. Well, more or less. But if > the process were to be killed right after a flush(), the data written > to disk should be a complete encoding, and not have a lingering shift > state. Ok. I've modified the API as follows: StreamWriter: def flush(self): """ Flushes and resets the codec buffers used for keeping state. Calling this method should ensure that the data on the output is put into a clean state, that allows appending of new fresh data without having to rescan the whole stream to recover state. """ pass StreamReader: def read(self,chunksize=0): """ Decodes data from the stream self.stream and returns a tuple (Unicode object, bytes consumed). chunksize indicates the approximate maximum number of bytes to read from the stream for decoding purposes. The decoder can modify this setting as appropriate. The default value 0 indicates to read and decode as much as possible. The chunksize is intended to prevent having to decode huge files in one step. The method should use a greedy read strategy meaning that it should read as much data as is allowed within the definition of the encoding and the given chunksize, e.g. if optional encoding endings or state markers are available on the stream, these should be read too. """ ... the base class should provide a default implementation of this method using self.decode ... def reset(self): """ Resets the codec buffers used for keeping state. Note that no stream repositioning should take place. This method is primarely intended to recover from decoding errors. """ pass The .reset() method replaces the .flush() method on StreamReaders. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 42 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Fri Nov 19 09:22:48 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 19 Nov 1999 10:22:48 +0100 Subject: [Python-Dev] Import redesign (was: Python 1.6 status) References: <1269187709-18981857@hypernet.com> <199911181530.KAA03887@eric.cnri.reston.va.us> Message-ID: <383516E8.EE66B527@lemburg.com> Guido van Rossum wrote: > > Let's first complete the requirements gathering. Are these > requirements reasonable? Will they make an implementation too > complex? Am I missing anything? Since you were asking: I would like functionality equivalent to my latest import patch for a slightly different lookup scheme for module import inside packages to become a core feature. If it becomes a core feature I promise to never again start threads about relative imports :-) Here's the summary again: """ [The patch] changes the default import mechanism to work like this: >>> import d # from directory a/b/c/ try a.b.c.d try a.b.d try a.d try d fail instead of just doing the current two-level lookup: >>> import d # from directory a/b/c/ try a.b.c.d try d fail As a result, relative imports referring to higher level packages work out of the box without any ugly underscores in the import name. Plus the whole scheme is pretty simple to explain and straightforward. """ You can find the patch attached to the message "Walking up the package hierarchy" in the python-dev mailing list archive. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 42 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From andy@robanal.demon.co.uk Fri Nov 19 13:01:04 1999 From: andy@robanal.demon.co.uk (=?iso-8859-1?q?Andy=20Robinson?=) Date: Fri, 19 Nov 1999 05:01:04 -0800 (PST) Subject: [Python-Dev] Codecs and StreamCodecs Message-ID: <19991119130104.21726.rocketmail@ web605.yahoomail.com> --- "M.-A. Lemburg" <mal@lemburg.com> wrote: > Guido van Rossum wrote: > > I don't know where this came from, but a flush() > should work like > > flush() on a file. > > It came from Fredrik's proposal. > > > It doesn't return a value, it just sends any > > remaining data to the underlying stream (for > output). For input it > > shouldn't be supported at all. > > > > The idea is that flush() should do the same to the > encoder state that > > close() followed by a reopen() would do. Well, > more or less. But if > > the process were to be killed right after a > flush(), the data written > > to disk should be a complete encoding, and not > have a lingering shift > > state. > This could be useful in real life. For example, iso-2022-jp has a 'single-byte-mode' and a 'double-byte-mode' with shift-sequences to separate them. The rule is that each line in the text file or email message or whatever must begin and end in single-byte mode. So I would take flush() to mean 'shift back to ASCII now'. Calling flush and reopen would thus "almost" get the same data across. I'm trying to think if it would be dangerous. Do web and ftp servers often call flush() in the middle of transmitting a block of text? - Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From fredrik@pythonware.com Fri Nov 19 13:33:50 1999 From: fredrik@pythonware.com (Fredrik Lundh) Date: Fri, 19 Nov 1999 14:33:50 +0100 Subject: [Python-Dev] Codecs and StreamCodecs References: <19991119130104.21726.rocketmail@ web605.yahoomail.com> Message-ID: <000701bf3292$b7c49130$f29b12c2@secret.pythonware.com> Andy Robinson <captainrobbo@yahoo.com> wrote: > So I would take flush() to mean 'shift back to > ASCII now'. if we're still talking about my "just one codec, please" proposal, that's exactly what encoder.flush should do. while decoder.flush should raise an ex- ception if you're still in double byte mode (at least if running in 'strict' mode). > Calling flush and reopen would thus "almost" get the > same data across. > > I'm trying to think if it would be dangerous. Do web > and ftp servers often call flush() in the middle of > transmitting a block of text? again, if we're talking about my proposal, these flush methods are only called by the string or stream wrappers, never by the applications. see the original post for de- tails. </F> From gstein@lyra.org Fri Nov 19 13:29:50 1999 From: gstein@lyra.org (Greg Stein) Date: Fri, 19 Nov 1999 05:29:50 -0800 (PST) Subject: [Python-Dev] Import redesign [LONG] In-Reply-To: <199911181530.KAA03887@eric.cnri.reston.va.us> Message-ID: <Pine.LNX.4.10.9911190404580.10639-100000@nebula.lyra.org> On Thu, 18 Nov 1999, Guido van Rossum wrote: > Gordon McMillan wrote: >... > > I think imputil's emulation of the builtin importer is more of a > > demonstration than a serious implementation. As for speed, it > > depends on the test. > > Agreed. I like some of imputil's features, but I think the API > need to be redesigned. It what ways? It sounds like you've applied some thought. Do you have any concrete ideas yet, or "just a feeling" :-) I'm working through some changes from JimA right now, and would welcome other suggestions. I think there may be some outstanding stuff from MAL, but I'm not sure (Marc?) >... > So here's a challenge: redesign the import API from scratch. I would suggest starting with imputil and altering as necessary. I'll use that viewpoint below. > Let me start with some requirements. > > Compatibility issues: > --------------------- > > - the core API may be incompatible, as long as compatibility layers > can be provided in pure Python Which APIs are you referring to? The "imp" module? The C functions? The __import__ and reload builtins? I'm guessing some of imp, the two builtins, and only one or two C functions. > - support for rexec functionality No problem. I can think of a number of ways to do this. > - support for freeze functionality No problem. A function in "imp" must be exposed to Python to support this within the imputil framework. > - load .py/.pyc/.pyo files and shared libraries from files No problem. Again, a function is needed for platform-specific loading of shared libraries. > - support for packages No problem. Demo's in current imputil. > - sys.path and sys.modules should still exist; sys.path might > have a slightly different meaning I would suggest that both retain their *exact* meaning. We introduce sys.importers -- a list of importers to check, in sequence. The first importer on that list uses sys.path to look for and load modules. The second importer loads builtins and frozen code (i.e. modules not on sys.path). Users can insert/append new importers or alter sys.path as before. sys.modules continues to record name:module mappings. > - $PYTHONPATH and $PYTHONHOME should still be supported No problem. > (I wouldn't mind a splitting up of importdl.c into several > platform-specific files, one of which is chosen by the configure > script; but that's a bit of a separate issue.) Easy enough. The standard importer can select the appropriate platform-specific module/function to perform the load. i.e. these can move to Modules/ and be split into a module-per-platform. > New features: > ------------- > > - Integrated support for Greg Ward's distribution utilities (i.e. a > module prepared by the distutil tools should install painlessly) I don't know the specific requirements/functionality that would be required here (does Greg? :-), but I can't imagine any problem with this. > - Good support for prospective authors of "all-in-one" packaging tool > authors like Gordon McMillan's win32 installer or /F's squish. (But > I *don't* require backwards compatibility for existing tools.) Um. *No* problem. :-) > - Standard import from zip or jar files, in two ways: > > (1) an entry on sys.path can be a zip/jar file instead of a directory; > its contents will be searched for modules or packages While this could easily be done, I might argue against it. Old apps/modules that process sys.path might get confused. If compatibility is not an issue, then "No problem." An alternative would be an Importer instance added to sys.importers that is configured for a specific archive (in other words, don't add the zip file to sys.path, add ZipImporter(file) to sys.importers). Another alternative is an Importer that looks at a "sys.py_archives" list. Or an Importer that has a py_archives instance attribute. > (2) a file in a directory that's on sys.path can be a zip/jar file; > its contents will be considered as a package (note that this is > different from (1)!) No problem. This will slow things down, as a stat() for *.zip and/or *.jar must be done, in addition to *.py, *.pyc, and *.pyo. > I don't particularly care about supporting all zip compression > schemes; if Java gets away with only supporting gzip compression > in jar files, so can we. I presume we would support whatever zlib gives us, and no more. > - Easy ways to subclass or augment the import mechanism along > different dimensions. For example, while none of the following > features should be part of the core implementation, it should be > easy to add any or all: > > - support for a new compression scheme to the zip importer Presuming ZipImporter is a class (derived from Importer), then this ability is wholly dependent upon the author of ZipImporter providing the hook. The Importer class is already designed for subclassing (and its interface is very narrow, which means delegation is also *very* easy; see imputil.FuncImporter). > - support for a new archive format, e.g. tar A cakewalk. Gordon, JimA, and myself each have archive formats. :-) > - a hook to import from URLs or other data sources (e.g. a > "module server" imported in CORBA) (this needn't be supported > through $PYTHONPATH though) No problem at all. > - a hook that imports from compressed .py or .pyc/.pyo files No problem at all. > - a hook to auto-generate .py files from other filename > extensions (as currently implemented by ILU) No problem at all. > - a cache for file locations in directories/archives, to improve > startup time No problem at all. > - a completely different source of imported modules, e.g. for an > embedded system or PalmOS (which has no traditional filesystem) No problem at all. In each of the above cases, the Importer.get_code() method just needs to grab the byte codes from the XYZ data source. That data source can be cmopressed, across a network, on-the-fly generated, or whatever. Each importer can certainly create a cache based on its concept of "location". In some cases, that would be a mapping from module name to filesystem path, or to a URL, or to a compiled-in, frozen module. > - Note that different kinds of hooks should (ideally, and within > reason) properly combine, as follows: if I write a hook to recognize > .spam files and automatically translate them into .py files, and you > write a hook to support a new archive format, then if both hooks are > installed together, it should be possible to find a .spam file in an > archive and do the right thing, without any extra action. Right? Ack. Very, very difficult. The imputil scheme combines the concept of locating/loading into one step. There is only one "hook" in the imputil system. Its semantic is "map this name to a code/module object and return it; if you don't have it, then return None." Your compositing example is based on the capabilities of the find-then-load paradigm of the existing "ihooks.py". One module finds something (foo.spam) and the other module loads it (by generating a .py). All is not lost, however. I can easily envision the get_code() hook as allowing any kind of return type. If it isn't a code or module object, then another hook is called to transform it. [ actually, I'd design it similarly: a *series* of hooks would be called until somebody transforms the foo.spam into a code/module object. ] The compositing would be limited ony by the (Python-based) Importer classes. For example, my ZipImporter might expect to zip up .pyc files *only*. Obviously, you would want to alter this to support zipping any file, then use the suffic to determine what to do at unzip time. > - It should be possible to write hooks in C/C++ as well as Python Use FuncImporter to delegate to an extension module. This is one of the benefits of imputil's single/narrow interface. > - Applications embedding Python may supply their own implementations, > default search path, etc., but don't have to if they want to piggyback > on an existing Python installation (even though the latter is > fraught with risk, it's cheaper and easier to understand). An application would have full control over the contents of sys.importers. For a restricted execution app, it might install an Importer that loads files from *one* directory only which is configured from a specific Win32 Registry entry. That importer could also refuse to load shared modules. The BuiltinImporter would still be present (although the app would certainly omit all but the necessary builtins from the build). Frozen modules could be excluded. > Implementation: > --------------- > > - There must clearly be some code in C that can import certain > essential modules (to solve the chicken-or-egg problem), but I don't > mind if the majority of the implementation is written in Python. > Using Python makes it easy to subclass. I posited once before that the cost of import is mostly I/O rather than CPU, so using Python should not be an issue. MAL demonstrated that a good design for the Importer classes is also required. Based on this, I'm a *strong* advocate of moving as much as possible into Python (to get Python's ease-of-coding with little relative cost). The (core) C code should be able to search a path for a module and import it. It does not require dynamic loading or packages. This will be used to import exceptions.py, then imputil.py, then site.py. The platform-specific module that perform dynamic-loading must be a statically linked module (in Modules/ ... it doesn't have to be in the Python/ directory). site.py can complete the bootstrap by setting up sys.importers with the appropriate Importer instances (this is where an application can define its own policy). sys.path was initially set by the import.c bootstrap code (from the compiled-in path and environment variables). Note that imputil.py would not install any hooks when it is loaded. That is up to site.py. This implies the core C code will import a total of three modules using its builtin system. After that, the imputil mechanism would be importing everything (site.py would .install() an Importer which then takes over the __import__ hook). Further note that the "import" Python statement could be simplified to use only the hook. However, this would require the core importer to inject some module names into the imputil module's namespace (since it couldn't use an import statement until a hook was installed). While this simplification is "neat", it complicates the run-time system (the import statement is broken until a hook is installed). Therefore, the core C code must also support importing builtins. "sys" and "imp" are needed by imputil to bootstrap. The core importer should not need to deal with dynamic-load modules. To support frozen apps, the core importer would need to support loading the three modules as frozen modules. The builtin/frozen importing would be exposed thru "imp" for use by imputil for future imports. imputil would load and use the (builtin) platform-specific module to do dynamic-load imports. > - In order to support importing from zip/jar files using compression, > we'd at least need the zlib extension module and hence libz itself, > which may not be available everywhere. Yes. I don't see this as a requirement, though. We wouldn't start to use these by default, would we? Or insist on zlib being present? I see this as more along the lines of "we have provided a standardized Importer to do this, *provided* you have zlib support." > - I suppose that the bootstrap is solved using a mechanism very > similar to what freeze currently used (other solutions seem to be > platform dependent). The bootstrap that I outlined above could be done in C code. The import code would be stripped down dramatically because you'll drop package support and dynamic loading. Alternatively, you could probably do the path-scanning in Python and freeze that into the interpreter. Personally, I don't like this idea as it would not buy you much at all (it would still need to return to C for accessing a number of scanning functions and module importing funcs). > - I also want to still support importing *everything* from the > filesystem, if only for development. (It's hard enough to deal with > the fact that exceptions.py is needed during Py_Initialize(); > I want to be able to hack on the import code written in Python > without having to rebuild the executable all the time. My outline above does not freeze anything. Everything resides in the filesystem. The C code merely needs a path-scanning loop and functions to import .py*, builtin, and frozen types of modules. If somebody nukes their imputil.py or site.py, then they return to Python 1.4 behavior where the core interpreter uses a path for importing (i.e. no packages). They lose dynamically-loaded module support. > Let's first complete the requirements gathering. Are these > requirements reasonable? Will they make an implementation too > complex? Am I missing anything? I'm not a fan of the compositing due to it requiring a change to semantics that I believe are very useful and very clean. However, I outlined a possible, clean solution to do that (a secondary set of hooks for transforming get_code() return values). The requirements are otherwise reasonable to me, as I see that they can all be readily solved (i.e. they aren't burdensome). While this email may be long, I do not believe the resulting system would be complex. From the user-visible side of things, nothing would be changed. sys.path is still present and operates as before. They *do* have new functionality they can grow into, though (sys.importers). The underlying C code is simplified, and the platform-specific dynamic-load stuff can be distributed to distinct modules, as needed (e.g. BeOS/dynloadmodule.c and PC/dynloadmodule.c). > Finally, to what extent does this impact the desire for dealing > differently with the Python bytecode compiler (e.g. supporting > optimizers written in Python)? And does it affect the desire to > implement the read-eval-print loop (the >>> prompt) in Python? If the three startup files require byte-compilation, then you could have some issues (i.e. the byte-compiler must be present). Once you hit site.py, you have a "full" environment and can easily detect and import a read-eval-print loop module (i.e. why return to Python? just start things up right there). site.py can also install new optimizers as desired, a new Python-based parser or compiler, or whatever... If Python is built without a parser or compiler (I hope that's an option!), then the three startup modules would simply be frozen into the executable. Cheers, -g -- Greg Stein, http://www.lyra.org/ From bwarsaw@cnri.reston.va.us (Barry A. Warsaw) Fri Nov 19 16:30:15 1999 From: bwarsaw@cnri.reston.va.us (Barry A. Warsaw) (Barry A. Warsaw) Date: Fri, 19 Nov 1999 11:30:15 -0500 (EST) Subject: [Python-Dev] CVS log messages with diffs References: <199911161700.MAA02716@eric.cnri.reston.va.us> Message-ID: <14389.31511.706588.20840@anthem.cnri.reston.va.us> There was a suggestion to start augmenting the checkin emails to include the diffs of the checkin. This would let you keep a current snapshot of the tree without having to do a direct `cvs update'. I think I can add this without a ton of pain. It would not be optional however, and the emails would get larger (and some checkins could be very large). There's also the question of whether to generate unified or context diffs. Personally, I find context diffs easier to read; unified diffs are smaller but not by enough to really matter. So here's an informal poll. If you don't care either way, you don't need to respond. Otherwise please just respond to me and not to the list. 1. Would you like to start receiving diffs in the checkin messages? 2. If you answer `yes' to #1 above, would you prefer unified or context diffs? -Barry From bwarsaw@cnri.reston.va.us (Barry A. Warsaw) Fri Nov 19 17:04:51 1999 From: bwarsaw@cnri.reston.va.us (Barry A. Warsaw) (Barry A. Warsaw) Date: Fri, 19 Nov 1999 12:04:51 -0500 (EST) Subject: [Python-Dev] Another 1.6 wish Message-ID: <14389.33587.947368.547023@anthem.cnri.reston.va.us> We had some discussion a while back about enabling thread support by default, if the underlying OS supports it obviously. I'd like to see that happen for 1.6. IIRC, this shouldn't be too hard -- just a few tweaks of the configure script (and who knows what for those minority platforms that don't use configure :). -Barry From akuchlin@mems-exchange.org Fri Nov 19 17:07:07 1999 From: akuchlin@mems-exchange.org (Andrew M. Kuchling) Date: Fri, 19 Nov 1999 12:07:07 -0500 (EST) Subject: [Python-Dev] Another 1.6 wish In-Reply-To: <14389.33587.947368.547023@anthem.cnri.reston.va.us> References: <14389.33587.947368.547023@anthem.cnri.reston.va.us> Message-ID: <14389.33723.270207.374259@amarok.cnri.reston.va.us> Barry A. Warsaw writes: >We had some discussion a while back about enabling thread support by >default, if the underlying OS supports it obviously. I'd like to see That reminds me... what about the free threading patches? Perhaps they should be added to the list of issues to consider for 1.6. -- A.M. Kuchling http://starship.python.net/crew/amk/ Oh, my fingers! My arms! My legs! My everything! Argh... -- The Doctor, in "Nightmare of Eden" From petrilli@amber.org Fri Nov 19 17:23:02 1999 From: petrilli@amber.org (Christopher Petrilli) Date: Fri, 19 Nov 1999 12:23:02 -0500 Subject: [Python-Dev] Another 1.6 wish In-Reply-To: <14389.33723.270207.374259@amarok.cnri.reston.va.us>; from akuchlin@mems-exchange.org on Fri, Nov 19, 1999 at 12:07:07PM -0500 References: <14389.33587.947368.547023@anthem.cnri.reston.va.us> <14389.33723.270207.374259@amarok.cnri.reston.va.us> Message-ID: <19991119122302.B23400@trump.amber.org> Andrew M. Kuchling [akuchlin@mems-exchange.org] wrote: > Barry A. Warsaw writes: > >We had some discussion a while back about enabling thread support by > >default, if the underlying OS supports it obviously. I'd like to see Yes pretty please! One of the biggest problems we have in the Zope world is that for some unknown reason, most of hte Linux RPMs don't have threading on in them, so people end up having to compile it anyway... while this is a silly thing, it does create problems, and means that we deal with a lot of "dumb" problems. > That reminds me... what about the free threading patches? Perhaps > they should be added to the list of issues to consider for 1.6. My recolection was that unfortunately MOST of the time, they actually slowed down things because of the number of locks involved... Guido can no doubt shed more light onto this, but... there was a reason. Chris -- | Christopher Petrilli | petrilli@amber.org From gmcm@hypernet.com Fri Nov 19 18:22:37 1999 From: gmcm@hypernet.com (Gordon McMillan) Date: Fri, 19 Nov 1999 13:22:37 -0500 Subject: [Python-Dev] Import redesign (was: Python 1.6 status) In-Reply-To: <199911181530.KAA03887@eric.cnri.reston.va.us> References: Your message of "Thu, 18 Nov 1999 09:19:48 EST." <1269187709-18981857@hypernet.com> Message-ID: <1269086690-25057991@hypernet.com> [Guido] > Compatibility issues: > --------------------- > > - the core API may be incompatible, as long as compatibility > layers can be provided in pure Python Good idea. Question: we have keyword import, __import__, imp and PyImport_*. Which of those (if any) define the "core API"? [rexec, freeze: yes] > - load .py/.pyc/.pyo files and shared libraries from files Shared libraries? Might that not involve some rather shady platform-specific magic? If it can be kept kosher, I'm all for it; but I'd say no if it involved, um, undocumented features. > support for packages Absolutely. I'll just comment that the concept of package.__path__ is also affected by the next point. > > - sys.path and sys.modules should still exist; sys.path might > have a slightly different meaning > > - $PYTHONPATH and $PYTHONHOME should still be supported If sys.path changes meaning, should not $PYTHONPATH also? > New features: > ------------- > > - Integrated support for Greg Ward's distribution utilities (i.e. > a > module prepared by the distutil tools should install > painlessly) I assume that this is mostly a matter of $PYTHONPATH and other path manipulation mechanisms? > - Good support for prospective authors of "all-in-one" packaging > tool > authors like Gordon McMillan's win32 installer or /F's squish. > (But I *don't* require backwards compatibility for existing > tools.) I guess you've forgotten: I'm that *really* tall guy <wink>. > - Standard import from zip or jar files, in two ways: > > (1) an entry on sys.path can be a zip/jar file instead of a > directory; > its contents will be searched for modules or packages I don't mind this, but it depends on whether sys.path changes meaning. > (2) a file in a directory that's on sys.path can be a zip/jar > file; > its contents will be considered as a package (note that > this is different from (1)!) But it's affected by the same considerations (eg, do we start with filesystem names and wrap them in importers, or do we just start with importer instances / specifications for importer instances). > I don't particularly care about supporting all zip compression > schemes; if Java gets away with only supporting gzip > compression in jar files, so can we. I think this is a matter of what zip compression is officially blessed. I don't mind if it's none; providing / creating zipped versions for platforms that support it is nearly trivial. > - Easy ways to subclass or augment the import mechanism along > different dimensions. For example, while none of the following > features should be part of the core implementation, it should > be easy to add any or all: > > - support for a new compression scheme to the zip importer > > - support for a new archive format, e.g. tar > > - a hook to import from URLs or other data sources (e.g. a > "module server" imported in CORBA) (this needn't be supported > through $PYTHONPATH though) Which begs the question of the meaning of sys.path; and if it's still filesystem names, how do you get one of these in there? > - a hook that imports from compressed .py or .pyc/.pyo files > > - a hook to auto-generate .py files from other filename > extensions (as currently implemented by ILU) > > - a cache for file locations in directories/archives, to > improve > startup time > > - a completely different source of imported modules, e.g. for > an > embedded system or PalmOS (which has no traditional > filesystem) > > - Note that different kinds of hooks should (ideally, and within > reason) properly combine, as follows: if I write a hook to > recognize .spam files and automatically translate them into .py > files, and you write a hook to support a new archive format, > then if both hooks are installed together, it should be > possible to find a .spam file in an archive and do the right > thing, without any extra action. Right? A bit of discussion: I've got 2 kinds of archives. One can contain anything & is much like a zip (and probably should be a zip). The other contains only compressed .pyc or .pyo. The latter keys contents by logical name, not filesystem name. No extensions, and when a package is imported, the code object returned is the __init__ code object, (vs returning None and letting the import mechanism come back and ask for package.__init__). When you're building an archive, you have to go thru the .py / .pyc / .pyo / is it a package / maybe compile logic anyway. Why not get it all over with, so that at runtime there's no choices to be made. Which means (for this kind of archive) that including somebody's .spam in your archive isn't a matter of a hook, but a matter of adding to the archive's build smarts. > - It should be possible to write hooks in C/C++ as well as Python > > - Applications embedding Python may supply their own > implementations, > default search path, etc., but don't have to if they want to > piggyback on an existing Python installation (even though the > latter is fraught with risk, it's cheaper and easier to > understand). A way of tweaking that which will become sys.path before Py_Initialize would be *most* welcome. > Implementation: > --------------- > > - There must clearly be some code in C that can import certain > essential modules (to solve the chicken-or-egg problem), but I > don't mind if the majority of the implementation is written in > Python. Using Python makes it easy to subclass. > > - In order to support importing from zip/jar files using > compression, > we'd at least need the zlib extension module and hence libz > itself, which may not be available everywhere. > > - I suppose that the bootstrap is solved using a mechanism very > similar to what freeze currently used (other solutions seem to > be platform dependent). There are other possibilites here, but I have only half- formulated ideas at the moment. The critical part for embedding is to be able to *completely* control all path related logic. > - I also want to still support importing *everything* from the > filesystem, if only for development. (It's hard enough to deal > with the fact that exceptions.py is needed during > Py_Initialize(); I want to be able to hack on the import code > written in Python without having to rebuild the executable all > the time. > > Let's first complete the requirements gathering. Are these > requirements reasonable? Will they make an implementation too > complex? Am I missing anything? I'll summarize as follows: 1) What "sys.path" means (and how it's construction can be manipulated) is critical. 2) See 1. > Finally, to what extent does this impact the desire for dealing > differently with the Python bytecode compiler (e.g. supporting > optimizers written in Python)? And does it affect the desire to > implement the read-eval-print loop (the >>> prompt) in Python? I can assure you that code.py runs fine out of an archive :-). - Gordon From gstein@lyra.org Fri Nov 19 21:06:14 1999 From: gstein@lyra.org (Greg Stein) Date: Fri, 19 Nov 1999 13:06:14 -0800 (PST) Subject: [Python-Dev] Import redesign [LONG] In-Reply-To: <Pine.WNT.4.04.9911190928100.315-100000@rigoletto.ski.org> Message-ID: <Pine.LNX.4.10.9911191258180.10639-100000@nebula.lyra.org> [ taking the liberty to CC: this back to python-dev ] On Fri, 19 Nov 1999, David Ascher wrote: > > > (2) a file in a directory that's on sys.path can be a zip/jar file; > > > its contents will be considered as a package (note that this is > > > different from (1)!) > > > > No problem. This will slow things down, as a stat() for *.zip and/or *.jar > > must be done, in addition to *.py, *.pyc, and *.pyo. > > Aside: it strikes me that for Python programs which import lots of files, > 'front-loading' the stat calls could make sense. When you first look at a > directory in sys.path, you read the entire directory in memory, and > successive imports do a stat on the directory to see if it's changed, and > if not use the in-memory data. Or am I completely off my rocker here? Not at all. I thought of this last night after my email. Since the Importer can easily retain state, it can hold a cache of the directory listings. If it doesn't find the file in its cached state, then it can reload the information from disk. If it finds it in the cache, but not on disk, then it can remove the item from its cache. The problem occurs when you path is [A, B], the file is in B, and you add something to A on-the-fly. The cache might direct the importer at B, missing your file. Of course, with the appropriate caveats/warnings, the system would work quite well. It really only breaks during development (which is one reason why I didn't accept some caching changes to imputil from MAL; but that was for the Importer in there; Python's new Importer could have a cache). I'm also not quite sure what the cost of reading a directory is, compared to issuing a bunch of stat() calls. Each directory read is an opendir/readdir(s)/closedir. Note that the DBM approach is kind of similar, but will amortize this cost over many processes. Cheers, -g -- Greg Stein, http://www.lyra.org/ From Jasbahr@origin.EA.com Fri Nov 19 20:59:11 1999 From: Jasbahr@origin.EA.com (Asbahr, Jason) Date: Fri, 19 Nov 1999 14:59:11 -0600 Subject: [Python-Dev] Another 1.6 wish Message-ID: <11A17AA2B9EAD111BCEA00A0C9B4179303385C08@molach.origin.ea.com> My first Python-Dev post. :-) >We had some discussion a while back about enabling thread support by >default, if the underlying OS supports it obviously. What's the consensus about Python microthreads -- a likely candidate for incorporation in 1.6 (or later)? Also, we have a couple minor convenience functions for Python in an MSDEV environment, an exposure of OutputDebugString for writing to the DevStudio log window and a means of tripping DevStudio C/C++ layer breakpoints from Python code (currently experimental). The msvcrt module seems like a likely candidate for these, would these be welcome additions? Thanks, Jason Asbahr Origin Systems, Inc. jasbahr@origin.ea.com From gstein@lyra.org Fri Nov 19 21:35:34 1999 From: gstein@lyra.org (Greg Stein) Date: Fri, 19 Nov 1999 13:35:34 -0800 (PST) Subject: [Python-Dev] Re: [Python-checkins] CVS log messages with diffs In-Reply-To: <14389.31511.706588.20840@anthem.cnri.reston.va.us> Message-ID: <Pine.LNX.4.10.9911191310510.10639-101000@nebula.lyra.org> This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. Send mail to mime@docserver.cac.washington.edu for more info. --1658348780-1256090628-943047334=:10639 Content-Type: TEXT/PLAIN; charset=US-ASCII On Fri, 19 Nov 1999, Barry A. Warsaw wrote: > There was a suggestion to start augmenting the checkin emails to > include the diffs of the checkin. This would let you keep a current > snapshot of the tree without having to do a direct `cvs update'. I've been using diffs-in-checkin for review, rather than to keep a local snapshot updated. I guess you use the email for this (procmail truly is frightening), but I think for most people it would be for purposes of review. >...context vs unifed... > So here's an informal poll. If you don't care either way, you don't > need to respond. Otherwise please just respond to me and not to the > list. > > 1. Would you like to start receiving diffs in the checkin messages? Absolutely. > 2. If you answer `yes' to #1 above, would you prefer unified or > context diffs? Don't care. I've attached an archive of the files that I use in my CVS repository to do emailed diffs. These came from Ken Coar (an Apache guy) as an extraction from the Apache repository. Yes, they do use Perl. I'm not a Perl guy, so I probably would break things if I tried to "fix" the scripts by converting them to Python (in fact, Greg Ward helped to improve log_accum.pl for me!). I certainly would not be adverse to Python versions of these files, or other cleanups. I trimmed down the "avail" file, leaving a few examples. It works with cvs_acls.pl to provide per-CVS-module read/write access control. I'm currently running mod_dav, PyOpenGL, XML-SIG, PyWin32, and two other small projects out of this repository. It has been working quite well. Cheers, -g -- Greg Stein, http://www.lyra.org/ --1658348780-1256090628-943047334=:10639 Content-Type: APPLICATION/octet-stream; name="cvs-for-barry.tar.gz" Content-Transfer-Encoding: BASE64 Content-ID: <Pine.LNX.4.10.9911191335340.10639@nebula.lyra.org> Content-Description: Content-Disposition: attachment; filename="cvs-for-barry.tar.gz" H4sIADvBNTgAA+xce3fbNrLvv9anQGXFkho9naTdWLE3bl71uc7j2m57d2NH hxIpiRuKZPmwoqa+n/3+ZgCQICU/2niTc8+GJycWAcxgMJgZzAwAjs/j9iSI 2iMripbdb/4tj7jf++HBA/GNoKdX+qtexA+9/oN793v37/eF6OPn9jfiwb+H nOKTxokVCfFNFATJVe2uq/9/+owL8+8FU9efBLfcR7+Hab1/6fz3H3x/L5v/ 7T5koX9vu3f/G9G7ZTrWPv/h878pTmaOqKqJr4qJ6znCjUUaO7ZIAjEO/CQK PLGYORHaQVxQNJ+7SVUARhBQNLcSN/ArmwQXO37SEYx04kZxIvAeLUXgC0t4 rs+4LRE509QD250PYeTEMaDRgTueUW3ixIljA5s1tVwfGBLgst3IGScBMCUz SxaNZ5Y/ZXwjx/WnYm7ZDihuAbkHgs7phRoCU+3JL8dHr1+fgLCDCboHwbKv SZD6dota+Ywzcubo03YiEUy4ICcZjEmcCMjCKJhG1lwSEs+C1LNpHCCvzBEa tZuAJ4nl21Zkoy5Mk05ls7KpGEQoM4Rza0lsB5QjAIC/3pJf7oh5YLsTF20b lmcBwPWTSZNHA1R3YmY8KKA21AuTXsTewhBGketgWG4ip9j1x15qA6Dh4FcQ M9MBHIObaPFbGmAqmjTEBfGXcCo6bCkmvjV3Yjmcg4niXxjELs8TVQo7cGLh B4niuOUvNWNXJSCWhLsx0BF6nhZRffrs+f7PhyfVbCpIMlvCnYC1hXEXCOHe 9w8PhRWGjhVhAi8RO4nE8hbWUqKWPLBs2+UplDKkZJlHQbwAIueDCCKhqKO+ 1c8NLWxd/RdSMbTG43TeCT1MVuVLq3zhWbH/Gam31wfs/4PL1/9797bvbZfs //3+vXtf7f/neDa/7aZx1B25fjd0Io916A1+aPMB+YedtaHtbA5h4KDzsTWF Xk+iYK4ssTN+D6WBapPikiKT+c6NNi8IpKvjyA0TsXA9T0yjIA2VjY1hJDPg 0ZK6AQLVUYuNIeyyB5TKNmFVigPPtS2sFCZRQq0NDplPso1MHq9X2uyS1Sfj ZcVxCihB5tRpyzZ6JNOikfcc6xwt5VrCRk9bMWkWiD7Pwg+FRY/bZVagh8SZ h0FkwSpS15KSJ7SwuqOURoAhP7XOXVv8ZM3DBFbn0Uz+eDx243HQAd49gBwF S/HcdTxbmqB5cA5Y2CxwjVZlWy4bMF2SKV3mGSj1nYXkLXD8F1a6JwEkXjZj 62l5NN4xWUQgbrgdp9OC1cY4JhMsADAPAvKhyY+Z/BcwgeI4cchmLxzrPXBh UibulJtLWRkDMVaMoW2dC1rJLc+T62LjpRVhOeg/fPiwCWyVzU94NDPRdRpZ I8xsEFInTOan4v2VvR47EMsgFQvLl9J19ORYHDxlZtuOl1i85v8dzXtiF3zz nRZ+9/EbzJFzgJWcyrZRFo1j12a5GAXJTMm1zxPGa2mNGxBGNN4e3A5zyANJ PpkjteOT/ZNnw1evXz0jw70regNd9uSn/Vcvnj1FWT8r23/6FCVCDkOVHT17 +foXbncvKzt8/UJIfPcx3NrJyzdPD450ZLgras9e/fKxLkvrF+KPP0S9m8zD OuCfHxw+G745evb84H+4aX0Ty1mnTlgO949PhlSvsFQV3m7towF1QWoLda0C mRqCAroUglWgI51Pm+B4mFlX18Cx1hGUYsTNelO6TnBgVj6s6+AgUwTz49H+ qyc/abBrYEaR5Y9nBHb888uX+0f/uBmJMKZzWIgq8V45PqUZVKX1C5Jp0iO4 urwSjAy/crSUfiatCcO5FcrFB2apU6m93D84HJ68ztDWpzHZn8feMrI6QTSt M1vfHP4ja7S2DTp/CaUjU0e9QAPDEC9xR7yJgn+RG7+7J+qIY+x2EgzIpfWW +FWv3Hl5eAChern/Bogb2rBxa/xoq7UjzvoaUKntnOedtyqi8GSMIhwlSgdl ysvAb5avQ8d/ccjA4TLAy9Tj/sJlMgt8mPSYVo5BXXRKoOZzFehKnx/mnuIs +sRLPmgJyIRTeexOjaIVPOH2OMeDl3Vkry9ew4ZfXZ8cSEK1oJ9lojKMV9au IPZjg0a8tL3H+L8zCiazjhX7HSsdrC0tI2rehg0/TkfwlxKEQJ9sxStxOhIy Rh1KZ+YjE8zrdaMGa9gSj1mpm5Dzx8OBUfuYgrCYRkRlNX572ztDu1gGpY2q uNPu/S2utgRhakpg9OZYWPBr7Hk1NPaPlQ2EcQ0P0pfMGgpbbZP/njXFXaFr qH1T7Invewxk9H33rm5fpIEpqIoqCNi4qGyUcIvOLlV2GLEk8YL/l+MbVC4k l8bw/Pw0HGK5kb6pySk1CsUL0iEMuAH7iKFLQ6lGH6bxTLVuwe11wkb3XdGE dr477dQ+uvZFrUvpCwo+GVOzqVDI8FwVFnlqcjP1Qf97w1APq3p0akSLyE2c IdYFnonCxFMJyUNLcUHPfja8BpEMru5lTatNWo5tF3NafWL5FOhTQ/bIuYOs 5Y6ofXvqVzVHaJ4YnfhX4PqNKqqybtFD3pIHzi2b2axQSO/bwyQYXjWIdYLM LUnkDJ4VBBvttkzVKKoDZKnAh+sZcRkTNm7GgY3C8NdNJOVELmMAVf6ZSbwZ 6aX5415uOmWfQO7npZd08ApqVw2jpH4N4Y8+hW7Gir4eEbK9NeRywSyYhxkB GVhxLFdqCqyc7URrrH3ifEiouHGTkUVOkkY+t6Wmixnbekk6axoTSrLP9pBw twyTjXWj34vxh9cOSRJ+DMkEbqh3ckKrpr1eYQajLU3j1cbulkb/543gzRik yOuAETccOLkLiFgtX9QpW9/+b58S0Eka1ykfzWuGTPLLNLEVTdO5g+hQ5lmA JbLgA1MASuEoL4Mc7AyVi19kJJwSZR/ZYK5hY+ScG7Jp8B+DQx39P47Vu6lF ij8mbvBIuhBY62fuJDE6BhA7E1VuUBXObxCVzFvwmTnwBJQFRwSPOWz/IefO +eCMmVn1lqiDX/RHsawllI+AnkGsEsCNmoz79ZsagHpHWz236GcvI4LI6757 K06Ts++O8kQ50LqUiu5yw0wINjTj2LHx3KRRF3VWCKpUxNRUm7fbZ7I0IySr uVeq+V8Rt95lOepWa6W22zqvdbsDSfMFu09S2jCYJo+OBsIEgLh6XWxtZVMo S7IR35DX0JQ6NYkIaTVj+YapJRknNyQbKfW3o1iWWcGBesmmpzYsF9EA33W+ G/AIV2vensZnd3kV3pEtLjI+5GonGcHSNF+Kx5QmIwHZxD+5/+DGcs/IyJq1 oJN+PRG0GZYEvFESJZwjBTBvkfB2jxsTEhdRKTxPznXGUMxRmlCZ5cUB5VCd RSzSUMTB3OH8kRMheBW/0qYK7xgRCsrh1mNRpw01okeSUgeOaQDXYTYfACMa hDDbyVJMg8Du8BCIve0fhfK5PxZiGR4qWcTqafLo0Y/56Pb2tKfteLGjp39T /REvHL2LN5k4keOPKc3rJAvHkTs+bHmoQUjKEKRxphQtjULu8VkLK3JkUjZL aurNm5l17mRwot7v9OuEVyOg9jK3kYNQmhhmsN7x0bhTIDqTcchF912j813z tNN422s/PLvbrCm5q4VKC7dFm/JeqmSoSvuiA9SIviFjVE4yYrYggTvtSJyn nR4p3aY4iVJ/DNkmAmR2mdpiQJL2ilIAJg02joeZawEr3NOD58+VxqF44xKl S0NSIS4I+f+IUOXKp3SDJpyipQPfdj7syNpTf/fap0pdd0T12oZYEJWeZbLz Z4dCRPIrxd4bElywPdHMrq4zL6TAovRoCX9EPe8NBBlk1nl6b2ZG0YAo4tgU pddnHzivfk6a6QUQYe6gsbcr+r1eT0iH/++8YY7F2XNthzR9tCyhSQKZnMdy C+H12CKs2YDoK4ws+EUMlMeyxmojdy4WsAG0E673TUaEeRYsxDyFg7CwYu6S dlWLaH6l/RTa8mZjhJ8+7ZpY5DPADHGaHw4L7TpIQqgfoF4ZTiA7YndDLILo fUuNyY3rHu9HjKwReGbL/dckcsfvy7QUXmGJa346H8pud7VZNpuw4uRt1ByU jZxGNo+nYndd4kz5rKL6ttPp8JyGvLvNs2fviDu2Gnswd5MELOx0zii8uyIJ Z5De3u4NVlqSAzB2GjymFqhuFQFaQgWQRLQKbppFLBeX914xFYFGrnQAaIAI 5osiVNGoQ8WYgCaKZB30YcsI7ZQ3mPn0d9oPY3j12/GdWHr10t3j9ZYVlbTJ 8P4YuRnosvc5Sl3PHs5kAFDwPWWZAgBWB2ZT7Rb8k9LMRtPYGbdqc9dv1SCN EX7a1hL/Y4GpLR2LvX9umbhzp0H/ZTmumY48jGH9LaaqO71tu5v9x6872X/V VkUbIri1dLoGA6Su7kDmaKIC/26f/hIhRlMmTzCpgqjOY9IsJ14pMAGyNzO8 bkk0FUpXr9v5jnw8Lt7UMVoCpx9LsmXbEfsXQSRS2PcJaYiEJYcEk17LU94z K54xFoUjr/rIIOD3BUsTE2sHQxk1xBx6Fkmm9Q1eyLIlshirFE0r2GwS1Dtv QpSPV8hNV9o+62aYq0auhxcSublzLDMJCht0xFB9mSNQ7QpJGaaxpUTeUCu5 MCiIQnQmF7LcPsFr8UUhONQ0qJCwlNwhp26Itu7EHcuzVYXEo8m0PL8hqnpb I9/01vYdJoh7ycNY2kzBmP6gA1s2b0Lq/ZVi0oQKRfUk2Mnqy2kV2eI4HdHe yQ7taaDt/tGLX972zta3PeItFcap92tM3hpNV6ahkLOhJlI/PiUhLhPs4J0v fgzs5W1sGx9DRUJ4l+duFPgUXKM0nVvxe9Ho9bZ5Q4COKvlu4lqe+7sjRlbs jsW5Fbm0g00p/pprY4ahpeE0CikDUaOIlCO7fBcWpWxasj22n4+fqT1SQMoq /G4QmkXq2o3ao2aT8vUozMwZyjfvwF6hblCRmc5SyKkmk6qlYdG1Xe0vxrK6 Zpz/2pWWBBWDCi+8mxJ2V2QrLiVXSaU7VbIdyvszKnj2uRPu922/05FYznjW N6WgVIk8AQ8cQiLlZYfEhUrzTKpuK4dXbivTuyuNmeCVxmxwV9oSxaqtzBmX 6jGdWb1rZ9VK9sLy6TnjvGMAN42Ps9HBitzY8VkBXhMoaZ0tDg1zPq6QZHT7 s0/YHHWoYRJ4njzoh9laOMI5x6rHMVXoRG2DNEoqkcFA/zSvWDaw9EO0mITM pD6WFIVWlBjiNCCBkQ1VqlJv6u4WIFhquD7b0C016J+RCChDKxcvjWsAQ1qp ZZwa6oyIZsxgtdJcKa/R7Se0fchzYXFMmc8U++L6qBPvbKswU55B5COIQKCO MLkJvPBSDo43pGUezkI0MA+TpXm6qaNYrhWSY9NXJg3dTLUyt2XL9J+MXKFK bLLBNfONBafKrKlW1xUK2kbLDQTVb5UW/0aJ3dqWq9Yri51e4DDUTLjU1CoY 5wOWtl6W7TxA6EG2MSCRJX6OYMf1ES0d4Ywh32ASC3h+RJdYqlNMxydPD17t aQ7KhBJ81E3xNAqkgmK+yc/NXIvuuyOVbzjt/sjx+U7X9CniVtZA15OIZS4H GCkaj2VkL535ltQUw8uQqUvpHmzqAjOXePfEmu50OQsnl4hvszXi8PULhUm2 f6mPWTwnVSBSRXlZUadgBrJf5c9I6H0+MaZB8axC81mYgaa6AH2kzqtl8KvQ 6kzMur4PoQUv5Uxe1jcGO8g5pqDjLvHo1D+7S+kVmkwVJyDslM5SHFpjOaMV Y07U+Z6h2rWVkyIDSNVtmWOGcgCez/loaHElPPOsCK3O+9ykd8UzY5pXmpAU 6L327rs3Rzu1rgsPQOYypAg7Cz57U6qBTycj2NWq16PEIs3kQ6BUt5pq31CD IbVXss2WKd/IUOp7jOA+pEOWfK6RU4F6ckae5b9X0XR23NS0iELse3EAJDAx fKZbzFMEcSEU2oRVdnadXbCDBZ/yNkxzDgnzIBqvgoQPo8Oqc46WD7UzCn12 T4cnoKzFoROv5mMuaBompraZMUPsiXZf2wrO3fDcZfVk3n0nN7q832Ewkz2g v4DWaFvsIAzCMnpa5Bo1l1QshxrAf0EvPfrbbuuuqJdCNy6lRM/0NgyMU7FW 17DY6OyGISk1tyX6ZUF57qo0cba1xplgPpXvqDy3IRyAyUcAeoniu3dXmPOt aIMN2cG6Ts3FP1uzRa+VW+ZeYmO1ubE+5mzflMC7NCelOulQCt70z5hOTDEq dMSj9STEqheoxJ95zWQKj9SJ9M0PMCG0YvYPtwrHPdZSXZjyreLRCgAYRwdz GH22wly41kLnRyMNYA1tmMi1wOZ5zBxcARfs81pw84DlCnjBwDZVgJIdvNUi slXc+ARS80hkjnRluSiMLZ9A6TouZg5NV7ZNxAJNYmEelz+YkB1pid9SeXKd d2VYTPW5VimkktLaMJdQysOtNGPhq2XHmArOcLGGPOHG23dW+/f99j977YfD 7lmzdXpa67em3FRqjDzFlIHRsSXaqSdl1rmIINJJiFjeI5tnWYgN6b9VTN0+ Udxw9e0xDUsbPpTxhiPoxjPHRvVhOg85PlkmfBsnCabMUr66w/GSMuTZDQK4 5sACnCHZ/4m+1TM3zoFKt9uGq1fhDbjAN24S8M8nysrrM/yX+tjyEkUU2OnY UVl63/J4iXLVCXh9vcy4TwEo0zFf75SvOOSfbuFMlFvZYZHLdL+q/NgswXwp +Hr9rWof9AYY1pmPKnuhNwBer/9V5YauR4BoBo7mzuWcudL+IxRqmn6YcZB/ V2xnLhjPxTpDkrlQ5QBL7zuVynVoITLJxPMGa8FagPIA1psyPYiNCyOduSJ0 Fyzh+2xyy3KsdctIyz6aQxU8Z481h5alGweI1M2rYCGmji9DPFZJ3jNixuq7 JuyJZblOHSYXJiDzimgXYmZH8voCFVyiQhs30J8bTigjUf0q5IVDPtXrp/Li L8/o6nzqDWRQI012aX43NvmClUu3RsZLWO3MDVOZUi9zQ8y4nT2OPxnNb5WP 2A4qOrj/0jfj/jOe4v1PqYdDRFLhLd4Avfr+p+h/v32/fP/zQa/39f7n53hu fv9z3TVJfWlTZz3VtckK7TSOg8he592qTwlo/05fAUosefHSs8ac6gYOugyv LgQV7lDLa6R/+Rol5xtC+XmDObzKeTonUPN25SfflPsyl9gErPPPxyfqtr3J s890aPw2zl/TrB6x9MgQKRccyxAn+i4CiyFHBHnKHZKkbvGXJSf1yIFQV5Ax 9bYD+Z4TLxZ09kZ+CCC/B6suD0vXPach4O9HUI6pnRfmN4xXtu2yGHxdXFbY W+OFr9Frft6Vr2T/iaNY2ykgur0+rvv+S+/7ftn+b9/vf7X/n+NR338xJ371 IzBxGsrjquqrGNAIRN/ssKrvw3jyiy022dMY7jl/f0V/DECUz2y0gJE+BxLz zXjWS7mMGJ9GufTbMdrEqC/GaK1XxoA9U0oVZ5f/o8iJw8Dn5DLd3DYPv5cJ M3IwipDLvwjjCyeKEDoYcQ8tWK7+MgB9poBNI9miAn0dlc30HHUPb4cKNt4+ wogShzck9s4e6XHumcWPCn3i1bfbwaRNRO0xVr2lK7PBI4eOAPBZv/pmHQ2s c9q5Lbp56pUJrMAewFx6MZef06WzpFJYSL60wH59bvVZ5/+zJNxiH9f4/2K7 /0PZ/vf6X7//8lke+P/k+8ezCuVMVg/X5eZA1D7271Zrj6sXA/5glsoGX3Ie TxuXdVBC7+DTy8St4J+M+vtfrcvnftbp/21/AvA6/+9+r6z/937of43/P8uj /b9s4i//BGA5ARB3KoXv2Kkdk8idzhL5fbnzgDwedj9K59v4iBPvuwBH9ukn 6oww07Y+fUin/bsTBdJWZB9cKnwyj5w/cnos8ryMY69ANOJLDVGiPwz3Bb5I KIlJmIv/ns8S5p+kKjGIrpil/tdv86lv81FvlyxU0u390mr4xZ6S/c+X+1vs 42r/r7/dR2XZ//ua//08z7r87/54LL8oJ+2+/D4f7ZJBc2Cr7On08fs4onSq aMiE64uOeBGlo1HclEnZX45LSwrFrP/X3tU2tZEc4c/SrxjWqkjihGTwxR9E SMLZnEOdQS6M7z6cbNcirbDQyyr7AqYC+e3pp7tndlbCL6ly+ZKqnSobkGbn tadnpvvpZ4sFWlJEaQeqamkDF6xqp8U/IZ0HjaMbSwq9wagsYxhe4GNgoChw t2UGOL3doxkLuqev4F4XFYSIQ3N8+vqcNMLh+fHgtC+bg9zdM8+GDa10DDwB XbGh1DQQrYBJk5JmEMfyMpVCHnfNsbDcmVdHZy+plf0dGY/drnmVi81SzJ9q ydwGnmybdzr/3t5BB25RfDGGrHbFTmCJRms8c2zC5fmzi5er3Csak3mch+YT zyjH4oyjDxFjmWcYSRm0J13zLIlgpQkL4tfxpkLlCfDG+efB2cnhuRn8bM7/ cSTTY2D77LsNWT7jMp05NnUIGZDugRRXN3XraTjOWDSxFYuZg90N4VgRgsDq xiDXy+JFx8yiaMWilYSjmVhvFRwSXEyzQMEm/Dt4E6JJmM8FsRLEy0BrIxmC hyJPgHgM4skkgHU5yJfc/qCIueSHEL35yAT+d1TOwcFfzUtnuUbcpIrkIW28 vH9iL2aDDa8FPjXZQmj5FLXhaHW5pAWg2+rLwtTzhScZS8O8l2M9IjXvmjtp tArhXMf2CtfIHGFJrUOSHLY5aXR2eCFYHK1aiC15NYqMG/Mvmf/tjtbX3b43 v99hYSUd/q/b7eIDXv8d+R+xkW/1eSwT+GYAyiWdnH22L115hMR8sBKTZIfl JCy6I/RuNOWo3HIV46EnX/OQp6UQ8WGfr50zcR1Gwn1cOtp5hzrKemuFx+cE hXOr5pgr+LBbnCb5dBRCydB4iyIGpGk+t9HuS5qnml/cRUSZjWU4PfoYLmgK +8a0zoEmcrDZ5v5+Uw7nN+CixsKCxGlwQkEarRNa05Gu1Wr7++YEquHmQzwv nSM1i2qKGv9xR9sDP4LI51XGqh8TYAL6IiiyTRJo1av4w/IOimiees+wQyZA Dpl05AqKsz2VIa1SSup8zm1LS22jVRJIwYE4Nc8GL84OT8zLwYvjZ6JIsT8A Kik3D44YglmXJD2e55l/xLdTbVrlkzR6wygYWVCssx39oQ3nand01xAXZehE zIkAFQNB5zkIs5IRmr76jXoXzmf0XRLnl3KV8hTnPI5nsh8l9kZTgsWCZpRK wVN2IfCKLnoh1RwWX7uLSDolWRI/Gr5spiqeLDMSSiFbc0QaGXI5opFbOHib tFIuB1AXrE926LTJi87YRccDYhvhTaFrRjTFftDHvO+YAZUWzSPeLbUeGroE /LNct+23da1xeCu4MvC4oalM8GTJo2ezdrUG1HYzJXnapnW3rXun5cmx8Tcc jYbf3rbNgk6RtA6lBm8pyVSDGnnpjwbqgSLaoT3axLzX3cBb3qFRnjJHgK8N HHy91E3QG2vU8JpAqPLQ26e66XG9429bNBrCveyV4Dw1HsMynm67qpMoUkWz IaK2K3TrW46ndBzMw7lHEM+6km0J/1U39PorLdPx9W7URVWgPElcIYWiLBHi r/VLCkRep/68ZpSaVzxamr96vTGOLvJLZcGlcxQuCA+RnTb4ORuEZjN2TVA+ N4FxdXHLy0sLCRDNGdzjOr3VWv+O1Njp4clRcM+e42sa8UCQt6/Pnx+dnZlh YzyNDppvlrMlAhyw0BY4YZnG7nDZ5DIVjkVZG7v7/P/BsNHcD3z2pVKo2bvW 8OaH9gHHGmk8AkdZthW0tUf3OFiEqWYOtVFnOsueVYYH1NScFMYN37zTeh3E WcEJpiONMsuICt4BNNEO1369JbGllvmpzUjl/jubodfv79cZlcxYaM3cNc0e +Fca721AKA2WApQbDcMkTH3NjCBjfMgn16BjI4+LTzSiNNC2yfQDpYG+v8cM QBRqHDH2Io7HtJnRmHB3KVO+nMJlGs7fx0yuQELDiInW4a8ciV1IidIkCRYA jEJsBYk15rBY5hxnwtaxNHMxbFyaH8O2EpCFDRrrvRum28NHvQc+beiHrcZk Hl6mGWgIcoRrtl38Zu/3Ydp5uz28k589paNScoHXs+kKcy2gbDTvy8c4CZ+J 4QSeLgVn2NIJ+olO9MWSo2l6D8xIhxutHFTaULMF4eS8EgdX+lyr6nkNjTI1 7tGxH7GufJqbh7e2qV3Taj5uYtgHvwA7emhPOW1pIlcAnHar8ScOZJHOtM3f zGPT58AVreqYY2lxkSU9EjghMHyH4FKYbp2VDI5txci0YfWLFhe0ZKdCn7Ih Q7vGjQK6vdXI5ceiqN+qDcZzjjlIJCB1yAcy8fVTEeEoEz1K38jZJEHfT9+8 fCkVT5fv+YkDwFhzDi+nNWOw1qj3WkenJCQ/9DqN3PGXFmtuqLmtQuMdxjVH Z7lYXzydWr3r1Ikc+exG78yNITfZngjt1ls+iIitVbJocZJRbKHLfE53ght1 10uQQRGeS4IKHi0BuxqjwN4WGggdQuODwWdM7d+zxWpt3WBIMDU1LOTGlWkh j8MIl+20cmyXZnPP5NiCzTv6J01WGzswH13kHkWCqry4RVt2O14okpSK2brC /FltigXSuBr2ekrfJUhlLaMg2/UKVTyy6wNrRUY028pKA1ISlCsgu6nPHqi5 hDP3pITbp81mGfHOhZ+VFOSztCGeZpZVwkNhpdl7YENMWecPSwXY34s9YWj1 AP9c2xfu6wJsY42MyI2vK7tUhj6zTcka5cbREsETMO7kk8l0NIVc/hImdJy2 S+pOR80WZgu3ZATlAkO5mfSB+kwZ2iOlRXQVi8apOSpYLOx3iOKh4xLKLqsk 1j5FdbyJub+/M7CtSl+VyvZ/VjjfvI4v+H/N7uPdtff/7T3drd7/9l2SHjcE GHZ3eZWSQNzpuw30Q/T8Tg/m+lG4mCHfx0WFB/s/Txv+PyD+vnEdX8B//fjk 6Qb+Y2+v8v99l7Th/zM2GHeU5OPIfIDbZMoG0jQEYZUCLZIoZUcJHUFvLP6B T1WwK4M2CC6QcbyMAjUlQK7WyMl8IocunGUkDK/YSPFbtFh01z1znQccVBK1 C8sB216lkklsff9ydaRi7N/G9GDL7WVxD0X2XMAKXgvDDsYkZyt7ClME3e3H OHIeBG6QrqfBvs97pPaeo+fH54Oz5j2dhHH/b8jfzEiT0L0rtZ6lek2LdJYi +yROjJvlUiP8ogUzUioc1mutbrNw73GOreDjfXFkZ9vLG4x/3+jSd0ZpjtS+ r7uIDpSpxqD9uo38tpHknKFrAvoYLHPpbZpFi1Zgm+OFjdC3HFFyfAoSdVMO KNEWaVP6psSTT+UkwhHS34gzsaUO3pwjTMUFtH9VqQjQkFL1qaJQkLTtSEgS D04h+db0rGZx/jb2fCj1hrCIsE0IfrXcQcP5nsRGcdCscjYMc8JMXqXcN5Ew cKVRZIsWRmA6iSOqUsw+rb8obRF4n5/HcCKLJa2vjsl8lTIBM1zVQrMKOBEc qMIFSz9HYRrhcY6vlUmbwq+1Q8d+WA3o2nw6MMp1uLVlyWRQiZIhy9WCJgDM uh7fOtugsCZSpnlx1C6WS6mW9ph0m/l5NnIV0HfuHUicQurNgywvnU9yx3Rt ey1z8yPzU8HxYs6coUUGwiyALuBYfHHcoICam9Bdv4fGkRbT2Au1kEzzrZDF 8PwVvTDhJBOSqkXXcZk/xMDzOQqeT3Hw+F8+e+ZYeVxjmWSZr8DSFTaVQaI0 G0gLlH+bw5Xo4u0ktGP5KpbNrGue4WrL6DN+zE4LNPFIrCM1TyI42svWX14X EibICAhZCLWaWwq7Xgb5TATGL5iNACxp9r59fEo6Rn6lHNA38lacVklR/VvC nE/zGQxFM7Azq9hf0IZHC6bbffC57kU4+4pHHdkHts2IRQJGcI//nBSE/Qt9 tRpztFiZndRpojXN+Uj8sKG8mEiAkSJtiRgxl7HlImKvl3zBpi3ThDw3qYgd VnqRuInxWmDetkE04q9+5W2mPfGa374Ga35EGaGNsjBrtSUAHx1b0SZKUxJm pPXEflT07O7ObHkyZodUu9dWLhCOeiNlkXBXad+w6puU+QM7RMvT5cxFDXUu zxYjB1e8v0sEf1iI3f90Kp//7fnp29bxhfv/3pMfn66f/5/8uTr/f5ek+G87 8Zvob7YPItzPMTowuXt8eSlBsj6ZIqOlQOYOUEqqOI6wCPJrhZvvKQd4a5Q6 iFtbo3CvwaXuvUuWqxOP4jhPYMZ/Mb1G6QAt2fLZzQRwG78AFk/za8/zy50p KYkce3zHw1+songl0BFGkOkmLBeUMdzdE0YZOgCYDeEDN7fl6PyYATMxrxfe ToDYwyzE5Wb9nRfsOr+R6Pdx5Krs6DgU2EIgzm2tpF/nU5kL27S+IqZOJGY+ ze1LMYSxiE9s5W5jVDkUcpQJCOrXkF+fLDcyi19Ho6SNYVoMCEP5x1F5aGTy JRXjYloCYGCyvVL9DE5Jo5XQuEjZrgDNQz2kWV9m08mt5TFLsumIYeFcLR1j u22HJfcQ6+FFDPQVRm2CF6L44Hm9bN7wTrEeFXA4ElGx8aiWEcpxRTmXTCLe Vrw+SUcmwZbLwZ1Zxly5oEkQiM9SgwzU/cRMzCrgWu8c26uAeVJt+ci2sFwS IKh4vQLwZRq9bod/FCLw1lxPw1KVBWq0I/BGHygrfeHZwpWX+23LJYlrMs4i CtMpnWsWUXLJOVX25FnpwYBxRk5+0g+29bQZ0zOW2Qz/gL6fRbeAuKifjWte SUO6aGISWXHnUzPgMtKdEpQpNJdY9LxaffCSRcQ+FL7GEax/tJatUpWqVKUq ValKVapSlapUpSpVqUpVqlKVqlSlKlWpSlX6fuk/kVTZJACgAAA= --1658348780-1256090628-943047334=:10639-- From bwarsaw@python.org Fri Nov 19 21:45:14 1999 From: bwarsaw@python.org (Barry A. Warsaw) Date: Fri, 19 Nov 1999 16:45:14 -0500 (EST) Subject: [Python-Dev] Re: [Python-checkins] CVS log messages with diffs References: <14389.31511.706588.20840@anthem.cnri.reston.va.us> <Pine.LNX.4.10.9911191310510.10639-101000@nebula.lyra.org> Message-ID: <14389.50410.358686.637483@anthem.cnri.reston.va.us> >>>>> "GS" == Greg Stein <gstein@lyra.org> writes: GS> I've been using diffs-in-checkin for review, rather than to GS> keep a local snapshot updated. Interesting; I hadn't though about this use for the diffs. GS> I've attached an archive of the files that I use in my CVS GS> repository to do emailed diffs. These came from Ken Coar (an GS> Apache guy) as an extraction from the Apache repository. Yes, GS> they do use Perl. I'm not a Perl guy, so I probably would GS> break things if I tried to "fix" the scripts by converting GS> them to Python (in fact, Greg Ward helped to improve GS> log_accum.pl for me!). I certainly would not be adverse to GS> Python versions of these files, or other cleanups. Well, we all know Greg Ward's one of those subversive types, but then again it's great to have (hopefully now-loyal) defectors in our camp, just to keep us honest :) Anyway, thanks for sending the code, it'll come in handy if I get stuck. Of course, my P**l skills are so rusted I don't think even an oilcan-armed Dorothy could lube 'em up, so I'm not sure how much use I can put them to. Besides, I already have a huge kludge that gets run on each commit, and I don't think it'll be too hard to add diff generation... IF the informal vote goes that way. -Barry From gmcm@hypernet.com Fri Nov 19 21:56:20 1999 From: gmcm@hypernet.com (Gordon McMillan) Date: Fri, 19 Nov 1999 16:56:20 -0500 Subject: [Python-Dev] Import redesign [LONG] In-Reply-To: <Pine.LNX.4.10.9911191258180.10639-100000@nebula.lyra.org> References: <Pine.WNT.4.04.9911190928100.315-100000@rigoletto.ski.org> Message-ID: <1269073918-25826188@hypernet.com> [David Ascher got involuntarily forwarded] > > Aside: it strikes me that for Python programs which import lots > > of files, 'front-loading' the stat calls could make sense. > > When you first look at a directory in sys.path, you read the > > entire directory in memory, and successive imports do a stat on > > the directory to see if it's changed, and if not use the > > in-memory data. Or am I completely off my rocker here? I posted something here about dircache not too long ago. Essentially, I found it completely unreliable on NT and on Linux to stat the directory. There was some test code attached. - Gordon From gstein@lyra.org Fri Nov 19 22:09:36 1999 From: gstein@lyra.org (Greg Stein) Date: Fri, 19 Nov 1999 14:09:36 -0800 (PST) Subject: [Python-Dev] Another 1.6 wish In-Reply-To: <19991119122302.B23400@trump.amber.org> Message-ID: <Pine.LNX.4.10.9911191359370.10639-100000@nebula.lyra.org> On Fri, 19 Nov 1999, Christopher Petrilli wrote: > Andrew M. Kuchling [akuchlin@mems-exchange.org] wrote: > > Barry A. Warsaw writes: > > >We had some discussion a while back about enabling thread support by > > >default, if the underlying OS supports it obviously. I'd like to see Definitely. I think you still want a --disable-threads option, but the default really ought to include them. > Yes pretty please! One of the biggest problems we have in the Zope world > is that for some unknown reason, most of hte Linux RPMs don't have threading > on in them, so people end up having to compile it anyway... while this > is a silly thing, it does create problems, and means that we deal with > a lot of "dumb" problems. Yah. It's a pain. My RedHat 6.1 box has 1.5.2 with threads. I haven't actually had to build my own Python(!). Man... imagine that. After almost five years of using Linux/Python, I can actually rely on the OS getting it right! :-) > > That reminds me... what about the free threading patches? Perhaps > > they should be added to the list of issues to consider for 1.6. > > My recolection was that unfortunately MOST of the time, they actually > slowed down things because of the number of locks involved... Guido > can no doubt shed more light onto this, but... there was a reason. Yes, there were problems in the first round with locks and lock contention. The main issue is that a list must always use a lock to keep itself consistent. Always. There is no way for an application to say "hey, list object! I've got a higher-level construct here that guarantees there will be no cross-thread use of this list. Ignore the locking." Another issue that can't be avoided is using atomic increment/decrement for the object refcounts. Guido has already asked me about free threading patches for 1.6. I don't know if his intent was to include them, or simply to have them available for those who need them. Certainly, this time around they will be simpler since Guido folded in some of the support stuff (e.g. PyThreadState and per-thread exceptions). There are some other supporting changes that could definitely go into the core interpreter. The slow part comes when you start to add integrity locks to list, dict, etc. That is when the question on whether to include free threading comes up. Design-wise, there is a change or two that I would probably make. Note that shoving free-threading into the standard interpreter would get more eyeballs at the thing, and that people may have great ideas for reducing the overheads. Cheers, -g -- Greg Stein, http://www.lyra.org/ From gstein@lyra.org Fri Nov 19 22:11:02 1999 From: gstein@lyra.org (Greg Stein) Date: Fri, 19 Nov 1999 14:11:02 -0800 (PST) Subject: [Python-Dev] Another 1.6 wish In-Reply-To: <11A17AA2B9EAD111BCEA00A0C9B4179303385C08@molach.origin.ea.com> Message-ID: <Pine.LNX.4.10.9911191409570.10639-100000@nebula.lyra.org> On Fri, 19 Nov 1999, Asbahr, Jason wrote: > >We had some discussion a while back about enabling thread support by > >default, if the underlying OS supports it obviously. > > What's the consensus about Python microthreads -- a likely candidate > for incorporation in 1.6 (or later)? microthreads? eh? > Also, we have a couple minor convenience functions for Python in an > MSDEV environment, an exposure of OutputDebugString for writing to > the DevStudio log window and a means of tripping DevStudio C/C++ layer > breakpoints from Python code (currently experimental). The msvcrt > module seems like a likely candidate for these, would these be > welcome additions? Sure. I don't see why not. I know that I've use OutputDebugString a bazillion times from the Python layer. The breakpoint thingy... dunno, but I don't see a reason to exclude it. Cheers, -g -- Greg Stein, http://www.lyra.org/ From skip@mojam.com (Skip Montanaro) Fri Nov 19 22:11:38 1999 From: skip@mojam.com (Skip Montanaro) (Skip Montanaro) Date: Fri, 19 Nov 1999 16:11:38 -0600 (CST) Subject: [Python-Dev] Import redesign [LONG] In-Reply-To: <Pine.LNX.4.10.9911191258180.10639-100000@nebula.lyra.org> References: <Pine.WNT.4.04.9911190928100.315-100000@rigoletto.ski.org> <Pine.LNX.4.10.9911191258180.10639-100000@nebula.lyra.org> Message-ID: <14389.51994.809130.22062@dolphin.mojam.com> Greg> The problem occurs when you path is [A, B], the file is in B, and Greg> you add something to A on-the-fly. The cache might direct the Greg> importer at B, missing your file. Typically your path will be relatively short (< 20 directories), right? Just stat the directories before consulting the cache. If any changed since the last time the cache was built, then invalidate the entire cache (or that portion of the cached information that is downstream from the first modified directory). It's still going to be cheaper than performing listdir for each directory in the path, and like you said, only require flushes during development or installation actions. Skip Montanaro | http://www.mojam.com/ skip@mojam.com | http://www.musi-cal.com/ 847-971-7098 | Python: Programming the way Guido indented... From skip@mojam.com (Skip Montanaro) Fri Nov 19 22:15:14 1999 From: skip@mojam.com (Skip Montanaro) (Skip Montanaro) Date: Fri, 19 Nov 1999 16:15:14 -0600 (CST) Subject: [Python-Dev] Import redesign [LONG] In-Reply-To: <1269073918-25826188@hypernet.com> References: <Pine.WNT.4.04.9911190928100.315-100000@rigoletto.ski.org> <1269073918-25826188@hypernet.com> Message-ID: <14389.52210.833368.249942@dolphin.mojam.com> Gordon> I posted something here about dircache not too long ago. Gordon> Essentially, I found it completely unreliable on NT and on Linux Gordon> to stat the directory. There was some test code attached. The modtime of the directory's stat info should only change if you add or delete entries in the directory. Were you perhaps expecting changes when other operations took place, like rewriting an existing file? Skip Montanaro | http://www.mojam.com/ skip@mojam.com | http://www.musi-cal.com/ 847-971-7098 | Python: Programming the way Guido indented... From skip@mojam.com (Skip Montanaro) Fri Nov 19 22:34:42 1999 From: skip@mojam.com (Skip Montanaro) (Skip Montanaro) Date: Fri, 19 Nov 1999 16:34:42 -0600 Subject: [Python-Dev] Import redesign [LONG] In-Reply-To: <1269073918-25826188@hypernet.com> References: <Pine.WNT.4.04.9911190928100.315-100000@rigoletto.ski.org> <1269073918-25826188@hypernet.com> Message-ID: <199911192234.QAA24710@dolphin.mojam.com> Gordon wrote: Gordon> I posted something here about dircache not too long ago. Gordon> Essentially, I found it completely unreliable on NT and on Linux Gordon> to stat the directory. There was some test code attached. to which I replied: Skip> The modtime of the directory's stat info should only change if you Skip> add or delete entries in the directory. Were you perhaps Skip> expecting changes when other operations took place, like rewriting Skip> an existing file? I took a couple minutes to write a simple script to check things. It created a file, changed its mode, then unlinked it. I was a bit surprised that deleting a file didn't appear to change the directory's mod time. Then I realized that since file times are only recorded with one-second precision, you might see no change to the directory's mtime in some circumstances. Adding a sleep to the script between directory operations resolved the apparent inconsistency. Still, as Gordon stated, you probably can't count on directory modtimes to tell you when to invalidate the cache. It's consistent, just not reliable... if-we-slow-import-down-enough-we-can-use-this-trick-though-ly y'rs, Skip Montanaro | http://www.mojam.com/ skip@mojam.com | http://www.musi-cal.com/ 847-971-7098 | Python: Programming the way Guido indented... From mhammond@skippinet.com.au Sat Nov 20 00:04:28 1999 From: mhammond@skippinet.com.au (Mark Hammond) Date: Sat, 20 Nov 1999 11:04:28 +1100 Subject: [Python-Dev] Another 1.6 wish In-Reply-To: <11A17AA2B9EAD111BCEA00A0C9B4179303385C08@molach.origin.ea.com> Message-ID: <005f01bf32ea$d0b82b90$0501a8c0@bobcat> > Also, we have a couple minor convenience functions for Python in an > MSDEV environment, an exposure of OutputDebugString for writing to > the DevStudio log window and a means of tripping DevStudio C/C++ layer > breakpoints from Python code (currently experimental). The msvcrt > module seems like a likely candidate for these, would these be > welcome additions? These are both available in the win32api module. They dont really fit in the "msvcrt" module, as they are not part of the C runtime library, but the win32 API itself. This is really a pointer to the fact that some or all of the win32api should be moved into the core - registry access is the thing people most want, but there are plenty of other useful things that people reguarly use... Guido objects to the coding style, but hopefully that wont be a big issue. IMO, the coding style isnt "bad" - it is just more an "MS" flavour than a "Python" flavour - presumably people reading the code will have some experience with Windows, so it wont look completely foreign to them. The good thing about taking it "as-is" is that it has been fairly well bashed on over a few years, so is really quite stable. The final "coding style" issue is that there are no "doc strings" - all documentation is embedded in C comments, and extracted using a tool called "autoduck" (similar to "autodoc"). However, Im sure we can arrange something there, too. Mark. From jcw@equi4.com Sat Nov 20 00:21:43 1999 From: jcw@equi4.com (Jean-Claude Wippler) Date: Sat, 20 Nov 1999 01:21:43 +0100 Subject: [Python-Dev] Import redesign [LONG] References: <Pine.WNT.4.04.9911190928100.315-100000@rigoletto.ski.org> <1269073918-25826188@hypernet.com> <199911192234.QAA24710@dolphin.mojam.com> Message-ID: <3835E997.8A4F5BC5@equi4.com> Skip Montanaro wrote: > [dir stat cache times] > I took a couple minutes to write a simple script to check things. It > created a file, changed its mode, then unlinked it. I was a bit > surprised that deleting a file didn't appear to change the directory's > mod time. Then I realized that since file times are only recorded > with one-second Or two, on Windows with older (FAT, as opposed to VFAT) file systems. > precision, you might see no change to the directory's mtime in some > circumstances. Adding a sleep to the script between directory > operations resolved the apparent inconsistency. Still, as Gordon > stated, you probably can't count on directory modtimes to tell you > when to invalidate the cache. It's consistent, just not reliable... > > if-we-slow-import-down-enough-we-can-use-this-trick-though-ly y'rs, If the dir stat time is less than 2 seconds ago, flush - always. If the dir stat time says it hasn't been changed for at least 2 seconds then you can cache all entries and trust that any change is detected. In other words: take the *current* time into account, then it can work. I think. Maybe. Until you get into network drives and clock skew... -- Jean-Claude From gmcm@hypernet.com Sat Nov 20 03:43:32 1999 From: gmcm@hypernet.com (Gordon McMillan) Date: Fri, 19 Nov 1999 22:43:32 -0500 Subject: [Python-Dev] Import redesign [LONG] In-Reply-To: <3835E997.8A4F5BC5@equi4.com> Message-ID: <1269053086-27079185@hypernet.com> Jean-Claude wrote: > Skip Montanaro wrote: > > > [dir stat cache times] > > ... Then I realized that since > > file times are only recorded with one-second > > Or two, on Windows with older (FAT, as opposed to VFAT) file > systems. Oh lordy, it gets worse. With a time.sleep(1.0) between new files, Linux detects the change in the dir's mtime immediately. Cool. On NT, I get an average 2.0 sec delay. But sometimes it doesn't detect a delay in 100 secs (and my script quits). Then I added a stat of some file in the directory before the stat of the directory, (not the file I added). Now it acts just like Linux - no delay (on both FAT and NTFS partitions). OK... > I think. Maybe. Until you get into network drives and clock > skew... No success whatsoever in either direction across Samba. In fact the mtime of my Linux home directory as seen from NT is Jan 1, 1980. - Gordon From gstein@lyra.org Sat Nov 20 12:06:48 1999 From: gstein@lyra.org (Greg Stein) Date: Sat, 20 Nov 1999 04:06:48 -0800 (PST) Subject: [Python-Dev] updated imputil Message-ID: <Pine.LNX.4.10.9911200356050.10639-100000@nebula.lyra.org> I've updated imputil... The main changes is that I added SysPathImporter and BuiltinImporter. I also did some restructing to help with bootstrapping the module (remove dependence on os.py). For testing a revamped Python import system, you can importing the thing and call imputil._test_revamp() to set it up. This will load normal, builtin, and frozen modules via imputil. Dynamic modules are still handled by Python, however. I ran a timing comparisons of importing all modules in /usr/lib/python1.5 (using standard and imputil-based importing). The standard mechanism can do it in about 8.8 seconds. Through imputil, it does it in about 13.0 seconds. Note that I haven't profiled/optimized any of the Importer stuff (yet). The point about dynamic modules actually discovered a basic problem that I need to resolve now. The current imputil assumes that if a particular Importer loaded the top-level module in a package, then that Importer is responsible for loading all other modules within that package. In my particular test, I tried to import "xml.parsers.pyexpat". The two package modules were handled by SysPathImporter. The pyexpat module is a dynamic load module, so it is *not* handled by the Importer -- bam. Failure. Basically, each part of "xml.parsers.pyexpat" may need to use a different Importer... Off to ponder, -g -- Greg Stein, http://www.lyra.org/ From gstein@lyra.org Sat Nov 20 12:11:37 1999 From: gstein@lyra.org (Greg Stein) Date: Sat, 20 Nov 1999 04:11:37 -0800 (PST) Subject: [Python-Dev] updated imputil In-Reply-To: <Pine.LNX.4.10.9911200356050.10639-100000@nebula.lyra.org> Message-ID: <Pine.LNX.4.10.9911200411060.10639-100000@nebula.lyra.org> oops... forgot: http://www.lyra.org/greg/python/imputil.py -g On Sat, 20 Nov 1999, Greg Stein wrote: > I've updated imputil... The main changes is that I added SysPathImporter > and BuiltinImporter. I also did some restructing to help with > bootstrapping the module (remove dependence on os.py). > > For testing a revamped Python import system, you can importing the thing > and call imputil._test_revamp() to set it up. This will load normal, > builtin, and frozen modules via imputil. Dynamic modules are still > handled by Python, however. > > I ran a timing comparisons of importing all modules in /usr/lib/python1.5 > (using standard and imputil-based importing). The standard mechanism can > do it in about 8.8 seconds. Through imputil, it does it in about 13.0 > seconds. Note that I haven't profiled/optimized any of the Importer stuff > (yet). > > The point about dynamic modules actually discovered a basic problem that I > need to resolve now. The current imputil assumes that if a particular > Importer loaded the top-level module in a package, then that Importer is > responsible for loading all other modules within that package. In my > particular test, I tried to import "xml.parsers.pyexpat". The two package > modules were handled by SysPathImporter. The pyexpat module is a dynamic > load module, so it is *not* handled by the Importer -- bam. Failure. > > Basically, each part of "xml.parsers.pyexpat" may need to use a different > Importer... > > Off to ponder, > -g > > -- > Greg Stein, http://www.lyra.org/ > > > _______________________________________________ > Python-Dev maillist - Python-Dev@python.org > http://www.python.org/mailman/listinfo/python-dev > -- Greg Stein, http://www.lyra.org/ From skip@mojam.com (Skip Montanaro) Sat Nov 20 14:16:58 1999 From: skip@mojam.com (Skip Montanaro) (Skip Montanaro) Date: Sat, 20 Nov 1999 08:16:58 -0600 (CST) Subject: [Python-Dev] Import redesign [LONG] In-Reply-To: <1269053086-27079185@hypernet.com> References: <3835E997.8A4F5BC5@equi4.com> <1269053086-27079185@hypernet.com> Message-ID: <14390.44378.83128.546732@dolphin.mojam.com> Gordon> No success whatsoever in either direction across Samba. In fact Gordon> the mtime of my Linux home directory as seen from NT is Jan 1, Gordon> 1980. Ain't life grand? :-( Ah, well, it was a nice idea... S From jim@interet.com Mon Nov 22 16:43:39 1999 From: jim@interet.com (James C. Ahlstrom) Date: Mon, 22 Nov 1999 11:43:39 -0500 Subject: [Python-Dev] Import redesign [LONG] References: <Pine.LNX.4.10.9911190404580.10639-100000@nebula.lyra.org> Message-ID: <383972BB.C65DEB26@interet.com> Greg Stein wrote: > > I would suggest that both retain their *exact* meaning. We introduce > sys.importers -- a list of importers to check, in sequence. The first > importer on that list uses sys.path to look for and load modules. The > second importer loads builtins and frozen code (i.e. modules not on > sys.path). We should retain the current order. I think is is: first builtin, next frozen, next sys.path. I really think frozen modules should be loaded in preference to sys.path. After all, they are compiled in. > Users can insert/append new importers or alter sys.path as before. I agree with Greg that sys.path should remain as it is. A list of importers can add the extra functionality. Users will probably want to adjust the order of the list. > > Implementation: > > --------------- > > > > - There must clearly be some code in C that can import certain > > essential modules (to solve the chicken-or-egg problem), but I don't > > mind if the majority of the implementation is written in Python. > > Using Python makes it easy to subclass. > > I posited once before that the cost of import is mostly I/O rather than > CPU, so using Python should not be an issue. MAL demonstrated that a good > design for the Importer classes is also required. Based on this, I'm a > *strong* advocate of moving as much as possible into Python (to get > Python's ease-of-coding with little relative cost). Yes, I agree. And I think the main() should be written in Python. Lots of Python should be written in Python. > The (core) C code should be able to search a path for a module and import > it. It does not require dynamic loading or packages. This will be used to > import exceptions.py, then imputil.py, then site.py. But these can be frozen in (as you mention below). I dislike depending on sys.path to load essential modules. If they are not frozen in, then we need a command line argument to specify their path, with sys.path used otherwise. Jim Ahlstrom From jim@interet.com Mon Nov 22 17:25:46 1999 From: jim@interet.com (James C. Ahlstrom) Date: Mon, 22 Nov 1999 12:25:46 -0500 Subject: [Python-Dev] Import redesign (was: Python 1.6 status) References: <1269144272-21594530@hypernet.com> Message-ID: <38397C9A.DF6B7112@interet.com> Gordon McMillan wrote: > [JimA] > > Think about multiple packages in multiple zip files. The zip > > files store file directories. That means we would need a > > sys.zippath to search the zip files. I don't want another > > PYTHONPATH phenomenon. > > What if sys.path looked like: > [DirImporter('.'), ZlibImporter('c:/python/stdlib.pyz'), ...] Well, that changes the current meaning of sys.path. > > > > I suggest that archive files MUST be put into a known > > > > directory. > > No way. Hard code a directory? Overwrite someone else's > Python "standalone"? Write to a C: partition that is > deliberately sized to hold nothing but Windows? Make > network installations impossible? Ooops. I didn't mean a known directory you couldn't change. But I did mean a directory you shouldn't change. But you are right. The directory should be configurable. But I would still like to see a highly encouraged directory. I don't yet have a good design for this. Anyone have ideas on an official way to find library files? I think a Python library file is a Good Thing, but it is not useful if the archive can't be found. I am thinking of a busy SysAdmin with someone nagging him/her to install Python. SysAdmin doesn't want another headache. What if Python becomes popular and users want it on Unix and PC's? More work! There should be a standard way to do this that just works and is dumb-stupid-simple. This is a Python promotion issue. Yes everyone here can make sys.path work, but that is not the point. > The official Windows solution is stuff in registry about app > paths and such. Putting the dlls in the exe's directory is a > workaround which works and is more managable than the > official solution. I agree completely. > > > > We should also have the ability to append archive files to > > > > the executable or a shared library assuming the OS allows > > > > this > > That's a handy trick on Windows, but it's got nothing to do > with Python. It also works on Linux. I don't know about other systems. > Flexibility. You can put Christian's favorite Einstein quote here > too. I hope we can still have ease of use with all this flexibility. As I said, we need to promote Python. Jim Ahlstrom From mal@lemburg.com Tue Nov 23 13:32:42 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Tue, 23 Nov 1999 14:32:42 +0100 Subject: [Python-Dev] Unicode Proposal: Version 0.8 References: <382C0A54.E6E8328D@lemburg.com> <382D625B.DC14DBDE@lemburg.com> <38316685.7977448D@lemburg.com> <3834425A.8E9C3B7E@lemburg.com> Message-ID: <383A977A.C20E6518@lemburg.com> FYI, I've uploaded a new version of the proposal which includes the encodings package, definition of the 'raw unicode escape' encoding (available via e.g. ur""), Unicode format strings and a new method .breaklines(). The latest version of the proposal is available at: http://starship.skyport.net/~lemburg/unicode-proposal.txt Older versions are available as: http://starship.skyport.net/~lemburg/unicode-proposal-X.X.txt Some POD (points of discussion) that are still open: · Stream readers: What about .readline(), .readlines() ? These could be implemented using .read() as generic functions instead of requiring their implementation by all codecs. Also see Line Breaks. · Python interface for the Unicode property database · What other special Unicode formatting characters should be enhanced to work with Unicode input ? Currently only the following special semantics are defined: u"%s %s" % (u"abc", "abc") should return u"abc abc". Pretty quiet around here lately... -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 38 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From jcw@equi4.com Tue Nov 23 15:17:36 1999 From: jcw@equi4.com (Jean-Claude Wippler) Date: Tue, 23 Nov 1999 16:17:36 +0100 Subject: [Python-Dev] New thread ideas in Perl-land Message-ID: <383AB010.DD46A1FB@equi4.com> Just got a note about a paper on a new way of dealing with threads, as presented to the Perl-Porters list. The idea is described in: http://www.cpan.org/modules/by-authors/id/G/GB/GBARTELS/thread_0001.txt I have no time to dive in, comment, or even judge the relevance of this, but perhaps someone else on this list wishes to check it out. The author of this is Greg London <bartels@pixelmagic.com>. -- Jean-Claude From mhammond@skippinet.com.au Tue Nov 23 22:45:14 1999 From: mhammond@skippinet.com.au (Mark Hammond) Date: Wed, 24 Nov 1999 09:45:14 +1100 Subject: [Python-Dev] Unicode Proposal: Version 0.8 In-Reply-To: <383A977A.C20E6518@lemburg.com> Message-ID: <002301bf3604$68fd8f00$0501a8c0@bobcat> > Pretty quiet around here lately... My guess is that most positions and opinions have been covered. It is now probably time for less talk, and more code! It is time to start an implementation plan? Do we start with /F's Unicode implementation (which /G *smirk* seemed to approve of)? Who does what? When can we start to play with it? And a key point that seems to have been thrust in our faces at the start and hardly mentioned recently - does the proposal as it stands meet our sponsor's (HP) requirements? Mark. From gstein@lyra.org Wed Nov 24 00:40:44 1999 From: gstein@lyra.org (Greg Stein) Date: Tue, 23 Nov 1999 16:40:44 -0800 (PST) Subject: [Python-Dev] Re: updated imputil In-Reply-To: <Pine.LNX.4.10.9911200356050.10639-100000@nebula.lyra.org> Message-ID: <Pine.LNX.4.10.9911231549120.10639-100000@nebula.lyra.org> <enable-ramble-mode> :-) On Sat, 20 Nov 1999, Greg Stein wrote: >... > The point about dynamic modules actually discovered a basic problem that I > need to resolve now. The current imputil assumes that if a particular > Importer loaded the top-level module in a package, then that Importer is > responsible for loading all other modules within that package. In my > particular test, I tried to import "xml.parsers.pyexpat". The two package > modules were handled by SysPathImporter. The pyexpat module is a dynamic > load module, so it is *not* handled by the Importer -- bam. Failure. > > Basically, each part of "xml.parsers.pyexpat" may need to use a different > Importer... I've thought about this and decided the issue is with my particular Importer, rather than the imputil design. The PathImporter traverses a set of paths and establishes a package hierarchy based on a filesystem layout. It should be able to load dynamic modules from within that filesystem area. A couple alternatives, and why I don't believe they work as well: * A separate importer to just load dynamic libraries: this would need to replicate PathImporter's mapping of Python module/package hierarchy onto the filesystem. There would also be a sequencing issue because one Importer's paths would be searched before the other's paths. Current Python import rules establishes that a module earlier in sys.path (whether a dyn-lib or not) is loaded before one later in the path. This behavior could be broken if two Importers were used. * A design whereby other types of modules can be placed into the filesystem and multiple Importers are used to load parts of the path (e.g. PathImporter for xml.parsers and DynLibImporter for pyexpat). This design doesn't work well because the mapping of Python module/package to the filesystem is established by PathImporter -- try to mix a "private" mapping design among Importers creates too much coupling. There is also an argument that the design is fundamentally incorrect :-). I would argue against that, however. I'm not sure what form an argument *against* imputil would be, so I'm not sure how to preempty it :-). But we can get an idea of various arguments by hypothesizing different scenarios and requireing that the imputil design satisifies them. In the above two alternatives, they were examing the use of a secondary Importer to load things out of the filesystem (and it explained why two Importers in whatever configuration is not a good thing). Let's state for argument's sake that files of some type T must be placable within the filesystem (i.e. according to the layout defined by PathImporter). We'll also say that PathImporter doesn't understand T, since the latter was designed later or is private to some app. The way to solve this is to allow PathImporter to recognize it through some configuration of the instance (e.g. self.recognized_types). A set of hooks in the PathImporter would then understand how to map files of type T to a code or module object. (alternatively, a generalized set of hooks at the Importer class level) Note that you could easily have a utility function that scans sys.importers for a PathImporter instance and adds the data to recognize a new type -- this would allow for simple installation of new types. Note that PathImporter inherently defines a 1:1 mapping from a module to a file. Archives (zip or jar files) cannot be recognized and handled by PathImporter. An archive defines an entirely different style of mapping between a module/package and a file in the filesystem. Of course, an Importer that uses archives can certainly look for them in sys.path. The imputil design is derived directly from the "import" statement. "Here is a module/package name, give me a module." (this is embodied in the get_code() method in Importer) The find/load design established by ihooks is very filesystem-based. In many situations, a find/load is very intertwined. If you want to take the URL case, then just examine the actual network activity -- preferably, you want a single transaction (e.g. one HTTP GET). Find/load implies two transactions. With nifty context handling between the two steps, you can get away with a single transaction. But the point is that the design requires you to get work around its inherent two-step mechanism and establish a single step. This is weird, of course, because importing is never *just* a find or a load, but always both. Well... since I've satisfied to myself that PathImporter needs to load dynamic lib modules, I'm off to code it... Cheers, -g -- Greg Stein, http://www.lyra.org/ From gstein@lyra.org Wed Nov 24 01:45:29 1999 From: gstein@lyra.org (Greg Stein) Date: Tue, 23 Nov 1999 17:45:29 -0800 (PST) Subject: [Python-Dev] breaking out code for dynamic loading Message-ID: <Pine.LNX.4.10.9911231731000.10639-100000@nebula.lyra.org> Guido, I can't find the message, but it seems that at some point you mentioned wanting to break out importdl.c into separate files. The configure process could then select the appropriate one to use for the platform. Sounded great until I looked at importdl.c. There are a 13 variants of dynamic loading. That would imply 13 separate files/modules. I'd be happy to break these out, but are you actually interested in that many resulting modules? If so, then any suggestions for naming? (e.g. aix_dynload, win32_dynload, mac_dynload) Here are the variants: * NeXT, using FVM shlibs (USE_RLD) * NeXT, using frameworks (USE_DYLD) * dl / GNU dld (USE_DL) * SunOS, IRIX 5 shared libs (USE_SHLIB) * AIX dynamic linking (_AIX) * Win32 platform (MS_WIN32) * Win16 platform (MS_WIN16) * OS/2 dynamic linking (PYOS_OS2) * Mac CFM (USE_MAC_DYNAMIC_LOADING) * HP/UX dyn linking (hpux) * NetBSD shared libs (__NetBSD__) * FreeBSD shared libs (__FreeBSD__) * BeOS shared libs (__BEOS__) Could I suggest a new top-level directory in the Python distribution named "Platform"? Move BeOS, PC, and PCbuild in there (bring back Mac?). Add new directories for each of the above platforms and move the appropriate portion of importdl.c into there as a Python C Extension Module. (the module would still be statically linked into the interpreter!) ./configure could select the module and write a Setup.dynload, much like it does with Setup.thread. Cheers, -g -- Greg Stein, http://www.lyra.org/ From gstein@lyra.org Wed Nov 24 02:43:50 1999 From: gstein@lyra.org (Greg Stein) Date: Tue, 23 Nov 1999 18:43:50 -0800 (PST) Subject: [Python-Dev] another round of imputil work completed In-Reply-To: <Pine.LNX.4.10.9911231549120.10639-100000@nebula.lyra.org> Message-ID: <Pine.LNX.4.10.9911231837480.10639-100000@nebula.lyra.org> On Tue, 23 Nov 1999, Greg Stein wrote: >... > Well... since I've satisfied to myself that PathImporter needs to load > dynamic lib modules, I'm off to code it... All right. imputil.py now comes with code to emulate the builtin Python import mechanism. It loads all the same types of files, uses sys.path, and (pointed out by JimA) loads builtins before looking on the path. The only "feature" it doesn't support is using package.__path__ to look for submodules. I never liked that thing, so it isn't in there. (imputil *does* set the __path__ attribute, tho) Code is available at: http://www.lyra.org/greg/python/imputil.py Next step is to add a "standard" library/archive format. JimA and I have been tossing some stuff back and forth on this. Cheers, -g -- Greg Stein, http://www.lyra.org/ From mal@lemburg.com Wed Nov 24 08:34:52 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 24 Nov 1999 09:34:52 +0100 Subject: [Python-Dev] Unicode Proposal: Version 0.8 References: <002301bf3604$68fd8f00$0501a8c0@bobcat> Message-ID: <383BA32C.2E6F4780@lemburg.com> Mark Hammond wrote: > > > Pretty quiet around here lately... > > My guess is that most positions and opinions have been covered. It is > now probably time for less talk, and more code! Or that everybody is on holidays... like Guido. > It is time to start an implementation plan? Do we start with /F's > Unicode implementation (which /G *smirk* seemed to approve of)? Who > does what? When can we start to play with it? This depends on whether HP agrees on the current specs. If they do, there should be code by mid December, I guess. > And a key point that seems to have been thrust in our faces at the > start and hardly mentioned recently - does the proposal as it stands > meet our sponsor's (HP) requirements? Haven't heard anything from them yet (this is probably mainly due to Guido being offline). -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 37 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Wed Nov 24 09:32:46 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 24 Nov 1999 10:32:46 +0100 Subject: [Python-Dev] Import Design Message-ID: <383BB0BE.BF116A28@lemburg.com> Before hooking on to some more PathBuiltinImporters ;-), I'd like to spawn a thread leading in a different direction... There has been some discussion on what we really expect of the import mechanism to be able to do. Here's a summary of what I think we need: * compatibility with the existing import mechanism * imports from library archives (e.g. .pyl or .par-files) * a modified intra package import lookup scheme (the thingy which I call "walk-me-up-Scotty" patch -- see previous posts) And for some fancy stuff: * imports from URLs (e.g. these could be put on the path for automatic inclusion in the import scan or be passed explicitly to __import__) * a (file based) static lookup cache to enhance lookup performance which is enabled via a command line switch (rather than being enabled per default), so that the user can decide whether to apply this optimization or not The point I want to make is: there aren't all that many features we are really looking for, so why not incorporate these into the builtin importer and only *then* start thinking about schemes for hooks, managers, etc. ?! -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 37 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From andy@robanal.demon.co.uk Wed Nov 24 11:40:16 1999 From: andy@robanal.demon.co.uk (=?iso-8859-1?q?Andy=20Robinson?=) Date: Wed, 24 Nov 1999 03:40:16 -0800 (PST) Subject: [Python-Dev] Unicode Proposal: Version 0.8 Message-ID: <19991124114016.7706.rocketmail@web601.mail.yahoo.com> --- Mark Hammond <mhammond@skippinet.com.au> wrote: > > Pretty quiet around here lately... > > My guess is that most positions and opinions have > been covered. It is > now probably time for less talk, and more code! > > It is time to start an implementation plan? Do we > start with /F's > Unicode implementation (which /G *smirk* seemed to > approve of)? Who > does what? When can we start to play with it? > > And a key point that seems to have been thrust in > our faces at the > start and hardly mentioned recently - does the > proposal as it stands > meet our sponsor's (HP) requirements? > > Mark. I had a long chat with them on Friday :-) They want it done, but nobody is actively working on it now as far as I can tell, and they are very busy. The per-thread thing was a red herring - they just want to be able to do (for example) web servers handling different encodings from a central unicode database, so per-output-stream works just fine. They will be at IPC8; I'd suggest that a round of prototyping, we insist they read it and then discuss it at IPC8, and be prepared to rework things thereafter are important. Hopefully then we'll have a plan on how to tackle the much larger (but less interesting to python-dev) job of writing and verifying all the codecs and utilities. Andy Robinson ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Thousands of Stores. Millions of Products. All in one place. Yahoo! Shopping: http://shopping.yahoo.com From jim@interet.com Wed Nov 24 14:43:57 1999 From: jim@interet.com (James C. Ahlstrom) Date: Wed, 24 Nov 1999 09:43:57 -0500 Subject: [Python-Dev] Re: updated imputil References: <Pine.LNX.4.10.9911231549120.10639-100000@nebula.lyra.org> Message-ID: <383BF9AD.E183FB98@interet.com> Greg Stein wrote: > * A separate importer to just load dynamic libraries: this would need to > replicate PathImporter's mapping of Python module/package hierarchy onto > the filesystem. There would also be a sequencing issue because one > Importer's paths would be searched before the other's paths. Current > Python import rules establishes that a module earlier in sys.path > (whether a dyn-lib or not) is loaded before one later in the path. This > behavior could be broken if two Importers were used. I would like to argue that on Windows, import of dynamic libraries is broken. If a file something.pyd is imported, then sys.path is searched to find the module. If a file something.dll is imported, the same thing happens. But Windows defines its own search order for *.dll files which Python ignores. I would suggest that this is wrong for files named *.dll, but OK for files named *.pyd. A SysAdmin should be able to install and maintain *.dll as she has been trained to do. This makes maintaining Python installations simpler and more un-surprising. I have no solution to the backward compatibilty problem. But the code is only a couple lines. A LoadLibrary() call does its own path searching. Jim Ahlstrom From jim@interet.com Wed Nov 24 15:06:17 1999 From: jim@interet.com (James C. Ahlstrom) Date: Wed, 24 Nov 1999 10:06:17 -0500 Subject: [Python-Dev] Import Design References: <383BB0BE.BF116A28@lemburg.com> Message-ID: <383BFEE9.B4FE1F19@interet.com> "M.-A. Lemburg" wrote: > The point I want to make is: there aren't all that many features > we are really looking for, so why not incorporate these into > the builtin importer and only *then* start thinking about > schemes for hooks, managers, etc. ?! Marc has made this point before, and I think it should be considered carefully. It is a lot of work to re-create the current import logic in Python and it is almost guaranteed to be slower. So why do it? I like imputil.py because it leads to very simple Python installations. I view this as a Python promotion issue. If we have a boot mechanism plus archive files, we can have few-file Python installations with package addition being just adding another file. But at least some of this code must be in C. I volunteer to write the rest of it in C if that is what people want. But it would add two hundred more lines of code to import.c. So maybe now is the time to switch to imputil, instead of waiting for later. But I am indifferent as long as I can tell a Python user to just put an archive file libpy.pyl in his Python directory and everything will Just Work. Jim Ahlstrom From bwarsaw@python.org (Barry Warsaw) Tue Nov 30 20:23:40 1999 From: bwarsaw@python.org (Barry Warsaw) (Barry Warsaw) Date: Tue, 30 Nov 1999 15:23:40 -0500 (EST) Subject: [Python-Dev] CFP Developers' Day - 8th International Python Conference Message-ID: <14404.12876.847116.288848@anthem.cnri.reston.va.us> Hello Python Developers! Thursday January 27 2000, the final day of the 8th International Python Conference is Developers' Day, where Python hackers get together to discuss and reach agreements on the outstanding issues facing Python. This is also your once-a-year chance for face-to-face interactions with Python's creator Guido van Rossum and other experienced Python developers. To make Developers' Day a success, we need you! We're looking for a few good champions to lead topic sessions. As a champion, you will choose a topic that fires you up and write a short position paper for publication on the web prior to the conference. You'll also prepare introductory material for the topic overview session, and lead a 90 minute topic breakout group. We've had great champions and topics in previous years, and many features of today's Python had their start at past Developers' Days. This is your chance to help shape the future of Python for 1.6, 2.0 and beyond. If you are interested in becoming a topic champion, you must email me by Wednesday December 15, 1999. For more information, please visit the IPC8 Developers' Day web page at <http://www.python.org/workshops/2000-01/devday.html> This page has more detail on schedule, suggested topics, important dates, etc. To volunteer as a champion, or to ask other questions, you can email me at bwarsaw@python.org. -Barry From mal at lemburg.com Mon Nov 1 00:00:55 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Mon, 01 Nov 1999 00:00:55 +0100 Subject: [Python-Dev] Misleading syntax error text References: <1270838575-13870925@hypernet.com> Message-ID: <381CCA27.59506CF6@lemburg.com> [Extracted from the psa-members list...] Gordon McMillan wrote: > > Chris Fama wrote, > > And now the rub: the exact same function definition has passed > > through byte-compilation perfectly OK many times before with no > > problems... of course, this points rather clearly to the > > preceding code, but it illustrates a failing in Python's syntax > > error messages, and IMHO a fairly serious one at that, if this is > > indeed so. > > My simple experiments refuse to compile a "del getattr(..)" at > all. Hmm, it seems to be a failry generic error: >>> del f(x,y) SyntaxError: can't assign to function call How about chainging the com_assign_trailer function in Python/compile.c to: static void com_assign_trailer(c, n, assigning) struct compiling *c; node *n; int assigning; { REQ(n, trailer); switch (TYPE(CHILD(n, 0))) { case LPAR: /* '(' [exprlist] ')' */ com_error(c, PyExc_SyntaxError, assigning ? "can't assign to function call": "can't delete expression"); break; case DOT: /* '.' NAME */ com_assign_attr(c, CHILD(n, 1), assigning); break; case LSQB: /* '[' subscriptlist ']' */ com_subscriptlist(c, CHILD(n, 1), assigning); break; default: com_error(c, PyExc_SystemError, "unknown trailer type"); } } or something along those lines... BTW, has anybody tried my import patch recently ? I haven't heard any citicism since posting it and wonder what made the list fall asleep over the topic :-) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 61 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mhammond at skippinet.com.au Mon Nov 1 02:51:56 1999 From: mhammond at skippinet.com.au (Mark Hammond) Date: Mon, 1 Nov 1999 12:51:56 +1100 Subject: [Python-Dev] Benevolent dictator versus the bureaucratic committee? Message-ID: <002301bf240b$ae61fa00$0501a8c0@bobcat> I have for some time been wondering about the usefulness of this mailing list. It seems to have produced staggeringly few results since inception. This is not a critisism of any individual, but of the process. It is proof in my mind of how effective the benevolent dictator model is, and how ineffective a language run by committee would be. This "committee" never seems to be capable of reaching a consensus on anything. A number of issues dont seem to provoke any responses. As a result, many things seem to die a slow and lingering death. Often there is lots of interesting discussion, but still precious few results. In the pre python-dev days, the process seemed easier - we mailed Guido directly, and he either stated "yea" or "nay" - maybe we didnt get the response we hoped for, but at least we got a response. Now, we have the result that even if Guido does enter into a thread, the noise seems to drown out any hope of getting anything done. Guido seems to be faced with the dilemma of asserting his dictatorship in the face of many dissenting opinions from many people he respects, or putting it in the too hard basket. I fear the latter is the easiest option. At the end of this mail I list some of the major threads over the last few months, and can't see a single thread that has resulted in a CVS checkin, and only one that has resulted in agreement. This, to my mind at least, is proof that things are really not working. I long for the "good old days" - take the replacement of "ni" with built-in functionality, for example. I posit that if this was discussed on python-dev, it would have caused a huge flood of mail, and nothing remotely resembling a consensus. Instead, Guido simply wrote an essay and implemented some code that he personally liked. No debate, no discussion. Still an excellent result. Maybe not a perfect result, but a result nonetheless. However, Guido's time is becoming increasingly limited. So should we consider moving to a "benevolent lieutenent" model, in conjunction with re-ramping up the SIGS? This would provide 2 ways to get things done: * A new SIG. Take relative imports, for example. If we really do need a change in this fairly fundamental area, a SIG would be justified ("import-sig"). The responsibility of the SIG is to form a consensus (and code that reflects it), and report back to Guido (and the main newsgroup) with the result of this. It worked well for RE, and allowed those of us not particularly interested to keep out of the debate. If the SIG can not form consensus, then tough - it dies - and should not be mourned. Presumably Guido would keep a watchful eye over the SIG, providing direction where necessary, but in general stay out of the day to day traffic. New SIGs seem to have stopped since this list creation, and it seems that issues that should be discussed in new SIGS are now discussed here. * Guido could delegate some of his authority to a single individual responsible for a certain limited area - a benevolent lieutenent. We may have a lieutentant responsible for different areas, and could only exercise their authority with small, trivial changes. Eg, the "getopt helper" thread - if a lieutenant was given authority for the "standard library", they could simply make a yea or nay decision, and present it to Guido. Presumably Guido trusts this person he delegated to enough that the majority of the lieutenant's recommendations would be accepted. Presumably there would be a small number of lieutentants, and they would then become the new "python-dev" - say up to 5 people. This list then discusses high level strategies and seek direction from each other when things get murky. This select group of people may not (indeed, probably would not) include me, but I would have no problem with that - I would prefer to see results achieved than have my own ego stroked by being included in a select, but ineffective group. In parting, I repeat this is not a direct critisism, simply an observation of the last few months. I am on this list, so I am definately as guilty as any one else - which is "not at all" - ie, no one is guilty, I simply see it as endemic to a committee with people of diverse backgrounds, skills and opinions. Any thoughts? Long live the dictator! :-) Mark. Recent threads, and my take on the results: * getopt helper? Too much noise regarding semantic changes. * Alternative Approach to Relative Imports * Relative package imports * Path hacking * Towards a Python based import scheme Too much noise - no one could really agree on the semantics. Implementation thrown in the ring, and promptly forgotten. * Corporate installations Very young, but no result at all. * Embedding Python when using different calling conventions Quite young, but no result as yet, and I have no reason to believe there will be. * Catching "return" and "return expr" at compile time Seemed to be blessed - yay! Dont believe I have seen a check-in yet. * More Python command-line features Seemed general agreement, but nothing happened? * Tackling circular dependencies in 2.0? Lots of noise, but no results other than "GC may be there in 2.0" * Buffer interface in abstract.c Determined it could break - no solution proposed. Lots of noise regarding if is is a good idea at all! * mmapfile module No result. * Quick-and-dirty weak references No result. * Portable "spawn" module for core? No result. * Fake threads Seemed to spawn stackless Python, but in the face of Guido being "at best, lukewarm" about this issue, I would again have to conclude "no result". An authorative "no" in this area may have saved lots of effort and heartache. * add Expat to 1.6 No result. * I'd like list.pop to accept an optional second argument giving a default value No result * etc No result. From jack at oratrix.nl Mon Nov 1 10:56:48 1999 From: jack at oratrix.nl (Jack Jansen) Date: Mon, 01 Nov 1999 10:56:48 +0100 Subject: [Python-Dev] Embedding Python when using different calling conventions. In-Reply-To: Message by "M.-A. Lemburg" <mal@lemburg.com> , Sat, 30 Oct 1999 10:46:30 +0200 , <381AB066.B54A47E0@lemburg.com> Message-ID: <19991101095648.DC2E535BB1E@snelboot.oratrix.nl> > OTOH, we could take chance to reorganize these macros from bottom > up: when I started coding extensions I found them not very useful > mostly because I didn't have control over them meaning "export > this symbol" or "import the symbol". Especially the DL_IMPORT > macro is strange because it seems to handle both import *and* > export depending on whether Python is compiled or not. This would be very nice. The DL_IMPORT/DL_EXPORT stuff is really weird unless you're working with it all the time. We were trying to build a plugin DLL for PythonWin and first you spend hours finding out that you have to set DL_IMPORT (and how to set it), and the you spend another few hours before you realize that you can't simply copy the DL_IMPORT and DL_EXPORT from, say, timemodule.c because timemodule.c is going to be in the Python core (and hence can use DL_IMPORT for its init() routine declaration) while your module is going to be a plugin so it can't. I would opt for a scheme where the define shows where the symbols is expected to live (DL_CORE and DL_THISMODULE would be needed at least, but probably one or two more for .h files). -- Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++ Jack.Jansen at oratrix.com | ++++ if you agree copy these lines to your sig ++++ www.oratrix.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm From jack at oratrix.nl Mon Nov 1 11:12:37 1999 From: jack at oratrix.nl (Jack Jansen) Date: Mon, 01 Nov 1999 11:12:37 +0100 Subject: [Python-Dev] Benevolent dictator versus the bureaucratic committee? In-Reply-To: Message by "Mark Hammond" <mhammond@skippinet.com.au> , Mon, 1 Nov 1999 12:51:56 +1100 , <002301bf240b$ae61fa00$0501a8c0@bobcat> Message-ID: <19991101101238.3D6FA35BB1E@snelboot.oratrix.nl> I think I agree with Mark's post, although I do see a little more light (the relative imports dicussion resulted in working code, for instance). The benevolent lieutenant idea may work, _if_ the lieutenants can be found. I myself will quickly join Mark in wishing the new python-dev well and abandoning ship (half a :-). If that doesn't work maybe we should try at the very least to create a "memory". If you bring up a subject for discussion and you don't have working code that's fine the first time. But if anyone brings it up a second time they're supposed to have code. That way at least we won't be rehashing old discussions (as happend on the python-list every time, with subjects like GC or optimizations). And maybe we should limit ourselves in our replies: don't speak up too much in discussions if you're not going to write code. I know that I'm pretty good at answering with my brilliant insights to everything myself:-). It could well be that refining and refining the design (as in the getopt discussion) results in such a mess of opinions that no-one has the guts to write the code anymore. -- Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++ Jack.Jansen at oratrix.com | ++++ if you agree copy these lines to your sig ++++ www.oratrix.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm From mal at lemburg.com Mon Nov 1 12:09:21 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Mon, 01 Nov 1999 12:09:21 +0100 Subject: [Python-Dev] dircache.py References: <1270737688-19939033@hypernet.com> Message-ID: <381D74E0.1AE3DA6A@lemburg.com> Gordon McMillan wrote: > > Pursuant to my volunteering to implement Guido's plan to > combine cmp.py, cmpcache.py, dircmp.py and dircache.py > into filecmp.py, I did some investigating of dircache.py. > > I find it completely unreliable. On my NT box, the mtime of the > directory is updated (on average) 2 secs after a file is added, > but within 10 tries, there's always one in which it takes more > than 100 secs (and my test script quits). My Linux box hardly > ever detects a change within 100 secs. > > I've tried a number of ways of testing this ("this" being > checking for a change in the mtime of the directory), the latest > of which is below. Even if dircache can be made to work > reliably and surprise-free on some platforms, I doubt it can be > done cross-platform. So I'd recommend that it just get dropped. > > Comments? Note that you'll have to flush and close the tmp file to actually have it written to the file system. That's why you are not seeing any new mtimes on Linux. Still, I'd suggest declaring it obsolete. Filesystem access is usually cached by the underlying OS anyway, so adding another layer of caching on top of it seems not worthwhile (plus, the OS knows better when and what to cache). Another argument against using stat() time entries for caching purposes is the resolution of 1 second. It makes the dircache.py unreliable per se for fast changing directories. The problem is most probably even worse for NFS and on Samba mounted WinXX filesystems the mtime trick doesn't work at all (stat() returns the creation time for atime, mtime and ctime). -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 60 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From gward at cnri.reston.va.us Mon Nov 1 14:28:51 1999 From: gward at cnri.reston.va.us (Greg Ward) Date: Mon, 1 Nov 1999 08:28:51 -0500 Subject: [Python-Dev] Benevolent dictator versus the bureaucratic committee? In-Reply-To: <002301bf240b$ae61fa00$0501a8c0@bobcat>; from mhammond@skippinet.com.au on Mon, Nov 01, 1999 at 12:51:56PM +1100 References: <002301bf240b$ae61fa00$0501a8c0@bobcat> Message-ID: <19991101082851.A16952@cnri.reston.va.us> On 01 November 1999, Mark Hammond said: > I have for some time been wondering about the usefulness of this > mailing list. It seems to have produced staggeringly few results > since inception. Perhaps this is an indication of stability rather than stagnation. Of course we can't have *total* stability or Python 1.6 will never appear, but... > * Portable "spawn" module for core? > No result. ...I started this little thread to see if there was any interest, and to find out the easy way if VMS/Unix/DOS-style "spawn sub-process with list of strings as command-line arguments" makes any sense at all on the Mac without actually having to go learn about the Mac. The result: if 'spawn()' is added to the core, it should probably be 'os.spawn()', but it's not really clear if this is necessary or useful to many people; and, no, it doesn't make sense on the Mac. That answered my questions, so I don't really see the thread as a failure. I might still turn the distutils.spawn module into an appendage of the os module, but there doesn't seem to be a compelling reason to do so. Not every thread has to result in working code. In other words, negative results are results too. Greg From skip at mojam.com Mon Nov 1 17:58:41 1999 From: skip at mojam.com (Skip Montanaro) Date: Mon, 1 Nov 1999 10:58:41 -0600 (CST) Subject: [Python-Dev] Benevolent dictator versus the bureaucratic committee? In-Reply-To: <002301bf240b$ae61fa00$0501a8c0@bobcat> References: <002301bf240b$ae61fa00$0501a8c0@bobcat> Message-ID: <14365.50881.778143.590205@dolphin.mojam.com> Mark> * Catching "return" and "return expr" at compile time Mark> Seemed to be blessed - yay! Dont believe I have seen a check-in Mark> yet. I did post a patch to compile.c here and to the announce list. I think the temporal distance between the furor in the main list and when it appeared "in print" may have been a problem. Also, as the author of that code I surmised that compile.c was the wrong place for it. I would have preferred to see it in some Python code somewhere, but there's no obvious place to put it. Finally, there is as yet no convention about how to handle warnings. (Maybe some sort of PyLint needs to be "blessed" and made part of the distribution.) Perhaps python-dev would be good to generate SIGs, sort of like a hurricane spinning off tornadoes. Skip Montanaro | http://www.mojam.com/ skip at mojam.com | http://www.musi-cal.com/ 847-971-7098 | Python: Programming the way Guido indented... From guido at CNRI.Reston.VA.US Mon Nov 1 19:41:32 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Mon, 01 Nov 1999 13:41:32 -0500 Subject: [Python-Dev] Misleading syntax error text In-Reply-To: Your message of "Mon, 01 Nov 1999 00:00:55 +0100." <381CCA27.59506CF6@lemburg.com> References: <1270838575-13870925@hypernet.com> <381CCA27.59506CF6@lemburg.com> Message-ID: <199911011841.NAA06233@eric.cnri.reston.va.us> > How about chainging the com_assign_trailer function in Python/compile.c > to: Please don't use the python-dev list for issues like this. The place to go is the python-bugs database (http://www.python.org/search/search_bugs.html) or you could just send me a patch (please use a context diff and include the standard disclaimer language). --Guido van Rossum (home page: http://www.python.org/~guido/) From mal at lemburg.com Mon Nov 1 20:06:39 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Mon, 01 Nov 1999 20:06:39 +0100 Subject: [Python-Dev] Misleading syntax error text References: <1270838575-13870925@hypernet.com> <381CCA27.59506CF6@lemburg.com> <199911011841.NAA06233@eric.cnri.reston.va.us> Message-ID: <381DE4BF.951B03F0@lemburg.com> Guido van Rossum wrote: > > > How about chainging the com_assign_trailer function in Python/compile.c > > to: > > Please don't use the python-dev list for issues like this. The place > to go is the python-bugs database > (http://www.python.org/search/search_bugs.html) or you could just send > me a patch (please use a context diff and include the standard disclaimer > language). This wasn't really a bug report... I was actually looking for some feedback prior to sending a real (context) patch. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 60 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From jim at interet.com Tue Nov 2 16:43:56 1999 From: jim at interet.com (James C. Ahlstrom) Date: Tue, 02 Nov 1999 10:43:56 -0500 Subject: [Python-Dev] Benevolent dictator versus the bureaucratic committee? References: <002301bf240b$ae61fa00$0501a8c0@bobcat> Message-ID: <381F06BC.CC2CBFBD@interet.com> Mark Hammond wrote: > > I have for some time been wondering about the usefulness of this > mailing list. It seems to have produced staggeringly few results > since inception. I appreciate the points you made, but I think this list is still a valuable place to air design issues. I don't want to see too many Python core changes anyway. Just my 2.E-2 worth. Jim Ahlstrom From Vladimir.Marangozov at inrialpes.fr Wed Nov 3 23:34:44 1999 From: Vladimir.Marangozov at inrialpes.fr (Vladimir Marangozov) Date: Wed, 3 Nov 1999 23:34:44 +0100 (NFT) Subject: [Python-Dev] paper available Message-ID: <199911032234.XAA26442@pukapuka.inrialpes.fr> I've OCR'd Saltzer's paper. It's available temporarily (in MS Word format) at http://sirac.inrialpes.fr/~marangoz/tmp/Saltzer.zip Since there may be legal problems with LNCS, I will disable the link shortly (so those of you who have not received a copy and are interested in reading it, please grab it quickly) If prof. Saltzer agrees (and if he can, legally) put it on his web page, I guess that the paper will show up at http://mit.edu/saltzer/ Jeremy, could you please check this with prof. Saltzer? (This version might need some corrections due to the OCR process, despite that I've made a significant effort to clean it up) -- Vladimir MARANGOZOV | Vladimir.Marangozov at inrialpes.fr http://sirac.inrialpes.fr/~marangoz | tel:(+33-4)76615277 fax:76615252 From guido at CNRI.Reston.VA.US Thu Nov 4 21:58:53 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Thu, 04 Nov 1999 15:58:53 -0500 Subject: [Python-Dev] wish list Message-ID: <199911042058.PAA15437@eric.cnri.reston.va.us> I got the wish list below. Anyone care to comment on how close we are on fulfilling some or all of this? --Guido van Rossum (home page: http://www.python.org/~guido/) ------- Forwarded Message Date: Thu, 04 Nov 1999 20:26:54 +0700 From: "Claudio Ram?n" <rmn70 at hotmail.com> To: guido at python.org Hello, I'm a python user (excuse my english, I'm spanish and...). I think it is a very complete language and I use it in solve statistics, phisics, mathematics, chemistry and biology problemns. I'm not an experienced programmer, only a scientific with problems to solve. The motive of this letter is explain to you a needs that I have in the python use and I think in the next versions... * GNU CC for Win32 compatibility (compilation of python interpreter and "Freeze" utility). I think MingWin32 (Mummint Khan) is a good alternative eviting the cygwin dll user. * Add low level programming capabilities for system access and speed of code fragments eviting the C-C++ or Java code use. Python, I think, must be a complete programming language in the "programming for every body" philosofy. * Incorporate WxWindows (wxpython) and/or Gtk+ (now exist a win32 port) GUI in the standard distribution. For example, Wxpython permit an html browser. It is very importan for document presentations. And Wxwindows and Gtk+ are faster than tk. * Incorporate a database system in the standard library distribution. To be possible with relational and documental capabilites and with import facility of DBASE, Paradox, MSAccess files. * Incorporate a XML/HTML/Math-ML editor/browser with graphics capability (to be possible with XML how internal file format). And to be possible with Microsoft Word import export facility. For example, AbiWord project can be an alternative but if lacks programming language. If we can make python the programming language for AbiWord project... Thanks. Ram?n Molina. rmn70 at hotmail.com ______________________________________________________ Get Your Private, Free Email at http://www.hotmail.com ------- End of Forwarded Message From skip at mojam.com Thu Nov 4 22:06:53 1999 From: skip at mojam.com (Skip Montanaro) Date: Thu, 4 Nov 1999 15:06:53 -0600 (CST) Subject: [Python-Dev] wish list In-Reply-To: <199911042058.PAA15437@eric.cnri.reston.va.us> References: <199911042058.PAA15437@eric.cnri.reston.va.us> Message-ID: <14369.62829.389307.377095@dolphin.mojam.com> * Incorporate a database system in the standard library distribution. To be possible with relational and documental capabilites and with import facility of DBASE, Paradox, MSAccess files. I know Digital Creations has a dbase module knocking around there somewhere. I hacked on it for them a couple years ago. You might see if JimF can scrounge it up and donate it to the cause. Skip Montanaro | http://www.mojam.com/ skip at mojam.com | http://www.musi-cal.com/ 847-971-7098 | Python: Programming the way Guido indented... From fdrake at acm.org Thu Nov 4 22:08:26 1999 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Thu, 4 Nov 1999 16:08:26 -0500 (EST) Subject: [Python-Dev] wish list In-Reply-To: <199911042058.PAA15437@eric.cnri.reston.va.us> References: <199911042058.PAA15437@eric.cnri.reston.va.us> Message-ID: <14369.62922.994300.233350@weyr.cnri.reston.va.us> Guido van Rossum writes: > I got the wish list below. Anyone care to comment on how close we are > on fulfilling some or all of this? Claudio Ram?n <rmn70 at hotmail.com> wrote: > * Incorporate WxWindows (wxpython) and/or Gtk+ (now exist a win32 port) GUI > in the standard distribution. For example, Wxpython permit an html browser. > It is very importan for document presentations. And Wxwindows and Gtk+ are > faster than tk. And GTK+ looks better, too. ;-) None the less, I don't think GTK+ is as solid or mature as Tk. There are still a lot of oddities, and several warnings/errors get messages printed on stderr/stdout (don't know which) rather than raising exceptions. (This is a failing of GTK+, not PyGTK.) There isn't an equivalent of the Tk text widget, which is a real shame. There are people working on something better, but it's not a trivial project and I don't have any idea how its going. > * Incorporate a database system in the standard library distribution. To be > possible with relational and documental capabilites and with import facility > of DBASE, Paradox, MSAccess files. Doesn't sound like part of a core library really, though I could see combining the Win32 extensions with the core package to produce a single installable. That should at least provide access to MSAccess, and possible the others, via ODBC. > * Incorporate a XML/HTML/Math-ML editor/browser with graphics capability (to > be possible with XML how internal file format). And to be possible with > Microsoft Word import export facility. For example, AbiWord project can be > an alternative but if lacks programming language. If we can make python the > programming language for AbiWord project... I think this would be great to have. But I wouldn't put the editor/browser in the core. I would stick something like the XML-SIG's package in, though, once that's better polished. -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> Corporation for National Research Initiatives From jim at interet.com Fri Nov 5 01:09:40 1999 From: jim at interet.com (James C. Ahlstrom) Date: Thu, 04 Nov 1999 19:09:40 -0500 Subject: [Python-Dev] wish list References: <199911042058.PAA15437@eric.cnri.reston.va.us> Message-ID: <38222044.46CB297E@interet.com> Guido van Rossum wrote: > > I got the wish list below. Anyone care to comment on how close we are > on fulfilling some or all of this? > * GNU CC for Win32 compatibility (compilation of python interpreter and > "Freeze" utility). I think MingWin32 (Mummint Khan) is a good alternative > eviting the cygwin dll user. I don't know what this means. > * Add low level programming capabilities for system access and speed of code > fragments eviting the C-C++ or Java code use. Python, I think, must be a > complete programming language in the "programming for every body" philosofy. I don't know what this means in practical terms either. I use the C interface for this. > * Incorporate WxWindows (wxpython) and/or Gtk+ (now exist a win32 port) GUI > in the standard distribution. For example, Wxpython permit an html browser. > It is very importan for document presentations. And Wxwindows and Gtk+ are > faster than tk. As a Windows user, I don't feel comfortable publishing GUI code based on these tools. Maybe they have progressed and I should look at them again. But I doubt the Python world is going to standardize on a single GUI anyway. Does anyone out there publish Windows Python code with a Windows Python GUI? If so, what GUI toolkit do you use? Jim Ahlstrom From rushing at nightmare.com Fri Nov 5 08:22:22 1999 From: rushing at nightmare.com (Sam Rushing) Date: Thu, 4 Nov 1999 23:22:22 -0800 (PST) Subject: [Python-Dev] wish list In-Reply-To: <668469884@toto.iv> Message-ID: <14370.34222.884193.260990@seattle.nightmare.com> James C. Ahlstrom writes: > Guido van Rossum wrote: > > I got the wish list below. Anyone care to comment on how close we are > > on fulfilling some or all of this? > > > * GNU CC for Win32 compatibility (compilation of python interpreter and > > "Freeze" utility). I think MingWin32 (Mummint Khan) is a good alternative > > eviting the cygwin dll user. > > I don't know what this means. mingw32: 'minimalist gcc for win32'. it's gcc on win32 without trying to be unix. It links against crtdll, so for example it can generate small executables that run on any win32 platform. Also, an alternative to plunking down money ever year to keep up with MSVC++ I used to use mingw32 a lot, and it's even possible to set up egcs to cross-compile to it. At one point using egcs on linux I was able to build a stripped-down python.exe for win32... http://agnes.dida.physik.uni-essen.de/~janjaap/mingw32/ -Sam From jim at interet.com Fri Nov 5 15:04:59 1999 From: jim at interet.com (James C. Ahlstrom) Date: Fri, 05 Nov 1999 09:04:59 -0500 Subject: [Python-Dev] wish list References: <14370.34222.884193.260990@seattle.nightmare.com> Message-ID: <3822E40B.99BA7CA0@interet.com> Sam Rushing wrote: > mingw32: 'minimalist gcc for win32'. it's gcc on win32 without trying > to be unix. It links against crtdll, so for example it can generate OK, thanks. But I don't believe this is something that Python should pursue. Binaries are available for Windows and Visual C++ is widely available and has a professional debugger (etc.). Jim Ahlstrom From skip at mojam.com Fri Nov 5 18:17:58 1999 From: skip at mojam.com (Skip Montanaro) Date: Fri, 5 Nov 1999 11:17:58 -0600 (CST) Subject: [Python-Dev] paper available In-Reply-To: <199911032234.XAA26442@pukapuka.inrialpes.fr> References: <199911032234.XAA26442@pukapuka.inrialpes.fr> Message-ID: <14371.4422.96832.498067@dolphin.mojam.com> Vlad> I've OCR'd Saltzer's paper. It's available temporarily (in MS Word Vlad> format) at http://sirac.inrialpes.fr/~marangoz/tmp/Saltzer.zip I downloaded it and took a very quick peek at it, but it's applicability to Python wasn't immediately obvious to me. Did you download it in response to some other thread I missed somewhere? Skip Montanaro | http://www.mojam.com/ skip at mojam.com | http://www.musi-cal.com/ 847-971-7098 | Python: Programming the way Guido indented... From gstein at lyra.org Fri Nov 5 23:19:49 1999 From: gstein at lyra.org (Greg Stein) Date: Fri, 5 Nov 1999 14:19:49 -0800 (PST) Subject: [Python-Dev] wish list In-Reply-To: <3822E40B.99BA7CA0@interet.com> Message-ID: <Pine.LNX.4.10.9911051418330.32496-100000@nebula.lyra.org> On Fri, 5 Nov 1999, James C. Ahlstrom wrote: > Sam Rushing wrote: > > mingw32: 'minimalist gcc for win32'. it's gcc on win32 without trying > > to be unix. It links against crtdll, so for example it can generate > > OK, thanks. But I don't believe this is something that > Python should pursue. Binaries are available for Windows > and Visual C++ is widely available and has a professional > debugger (etc.). If somebody is willing to submit patches, then I don't see a problem with it. There are quite a few people who are unable/unwilling to purchase VC++. People may also need to build their own Python rather than using the prebuilt binaries. Cheers, -g -- Greg Stein, http://www.lyra.org/ From gstein at lyra.org Sun Nov 7 14:24:24 1999 From: gstein at lyra.org (Greg Stein) Date: Sun, 7 Nov 1999 05:24:24 -0800 (PST) Subject: [Python-Dev] updated modules Message-ID: <Pine.LNX.4.10.9911070518020.32496-100000@nebula.lyra.org> Hi all... I've updated some of the modules at http://www.lyra.org/greg/python/. Specifically, there is a new httplib.py, davlib.py, qp_xml.py, and a new imputil.py. The latter will be updated again RSN with some patches from Jim Ahlstrom. Besides some tweaks/fixes/etc, I've also clarified the ownership and licensing of the things. httplib and davlib are (C) Guido, licensed under the Python license (well... anything he chooses :-). qp_xml and imputil are still Public Domain. I also added some comments into the headers to note where they come from (I've had a few people remark that they ran across the module but had no idea who wrote it or where to get updated versions :-), and I inserted a CVS Id to track the versions (yes, I put them into CVS just now). Note: as soon as I figure out the paperwork or whatever, I'll also be skipping the whole "wetsign.txt" thingy and just transfer everything to Guido. He remarked a while ago that he will finally own some code in the Python distribution(!) despite not writing it :-) I might encourage others to consider the same... Cheers, -g -- Greg Stein, http://www.lyra.org/ From mal at lemburg.com Mon Nov 8 10:33:30 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Mon, 08 Nov 1999 10:33:30 +0100 Subject: [Python-Dev] wish list References: <199911042058.PAA15437@eric.cnri.reston.va.us> Message-ID: <382698EA.4DBA5E4B@lemburg.com> Guido van Rossum wrote: > > * GNU CC for Win32 compatibility (compilation of python interpreter and > "Freeze" utility). I think MingWin32 (Mummint Khan) is a good alternative > eviting the cygwin dll user. I think this would be a good alternative for all those not having MS VC for one reason or another. Since Mingw32 is free this might be an appropriate solution for e.g. schools which don't want to spend lots of money for VC licenses. > * Add low level programming capabilities for system access and speed of code > fragments eviting the C-C++ or Java code use. Python, I think, must be a > complete programming language in the "programming for every body" philosofy. Don't know what he meant here... > * Incorporate WxWindows (wxpython) and/or Gtk+ (now exist a win32 port) GUI > in the standard distribution. For example, Wxpython permit an html browser. > It is very importan for document presentations. And Wxwindows and Gtk+ are > faster than tk. GUIs tend to be fast moving targets, better leave them out of the main distribution. > * Incorporate a database system in the standard library distribution. To be > possible with relational and documental capabilites and with import facility > of DBASE, Paradox, MSAccess files. Database interfaces are usually way to complicated and largish for the standard dist. IMHO, they should always be packaged separately. Note that simple interfaces such as a standard CSV file import/export module would be neat extensions to the dist. > * Incorporate a XML/HTML/Math-ML editor/browser with graphics capability (to > be possible with XML how internal file format). And to be possible with > Microsoft Word import export facility. For example, AbiWord project can be > an alternative but if lacks programming language. If we can make python the > programming language for AbiWord project... I'm getting the feeling that Ramon is looking for a complete visual programming environment here. XML support in the standard dist (faster than xmllib.py) would be nice. Before that we'd need solid builtin Unicode support though... -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 53 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From captainrobbo at yahoo.com Tue Nov 9 14:57:46 1999 From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=) Date: Tue, 9 Nov 1999 05:57:46 -0800 (PST) Subject: [Python-Dev] Internationalisation Case Study Message-ID: <19991109135746.20446.rocketmail@web608.mail.yahoo.com> Guido has asked me to get involved in this discussion, as I've been working practically full-time on i18n for the last year and a half and have done quite a bit with Python in this regard. I thought the most helpful thing would be to describe the real-world business problems I have been tackling so people can understand what one might want from an encoding toolkit. In this (long) post I have included: 1. who I am and what I want to do 2. useful sources of info 3. a real world i18n project 4. what I'd like to see in an encoding toolkit Grab a coffee - this is a long one. 1. Who I am -------------- Firstly, credentials. I'm a Python programmer by night, and when I can involve it in my work which happens perhaps 20% of the time. More relevantly, I did a postgrad course in Japanese Studies and lived in Japan for about two years; in 1990 when I returned, I was speaking fairly fluently and could read a newspaper with regular reference tio a dictionary. Since then my Japanese has atrophied badly, but it is good enough for IT purposes. For the last year and a half I have been internationalizing a lot of systems - more on this below. My main personal interest is that I am hoping to launch a company using Python for reporting, data cleaning and transformation. An encoding library is sorely needed for this. 2. Sources of Knowledge ------------------------------ We should really go for world class advice on this. Some people who could really contribute to this discussion are: - Ken Lunde, author of "CJKV Information Processing" and head of Asian Type Development at Adobe. - Jeffrey Friedl, author of "Mastering Regular Expressions", and a long time Japan resident and expert on things Japanese - Maybe some of the Ruby community? I'll list up books URLs etc. for anyone who needs them on request. 3. A Real World Project ---------------------------- 18 months ago I was offered a contract with one of the world's largest investment management companies (which I will nickname HugeCo) , who (after many years having analysts out there) were launching a business in Japan to attract savers; due to recent legal changes, Japanese people can now freely buy into mutual funds run by foreign firms. Given the 2% they historically get on their savings, and the 12% that US equities have returned for most of this century, this is a business with huge potential. I've been there for a while now, rotating through many different IT projects. HugeCo runs its non-US business out of the UK. The core deal-processing business runs on IBM AS400s. These are kind of a cross between a relational database and a file system, and speak their own encoding called EBCDIC. Five years ago the AS400 had limited connectivity to everything else, so they also started deploying Sybase databases on Unix to support some functions. This means 'mirroring' data between the two systems on a regular basis. IBM has always included encoding information on the AS400 and it converts from EBCDIC to ASCII on request with most of the transfer tools (FTP, database queries etc.) To make things work for Japan, everyone realised that a double-byte representation would be needed. Japanese has about 7000 characters in most IT-related character sets, and there are a lot of ways to store it. Here's a potted language lesson. (Apologies to people who really know this field -- I am not going to be fully pedantic or this would take forever). Japanese includes two phonetic alphabets (each with about 80-90 characters), the thousands of Kanji, and English characters, often all in the same sentence. The first attempt to display something was to make a single -byte character set which included ASCII, and a simplified (and very ugly) katakana alphabet in the upper half of the code page. So you could spell out the sounds of Japanese words using 'half width katakana'. The basic 'character set' is Japan Industrial Standard 0208 ("JIS"). This was defined in 1978, the first official Asian character set to be defined by a government. This can be thought of as a printed chart showing the characters - it does not define their storage on a computer. It defined a logical 94 x 94 grid, and each character has an index in this grid. The "JIS" encoding was a way of mixing ASCII and Japanese in text files and emails. Each Japanese character had a double-byte value. It had 'escape sequences' to say 'You are now entering ASCII territory' or the opposite. In 1978 Microsoft quickly came up with Shift-JIS, a smarter encoding. This basically said "Look at the next byte. If below 127, it is ASCII; if between A and B, it is a half-width katakana; if between B and C, it is the first half of a double-byte character and the next one is the second half". Extended Unix Code (EUC) does similar tricks. Both have the property that there are no control characters, and ASCII is still ASCII. There are a few other encodings too. Unfortunately for me and HugeCo, IBM had their own standard before the Japanese government did, and it differs; it is most commonly called DBCS (Double-Byte Character Set). This involves shift-in and shift-out sequences (0x16 and 0x17, cannot remember which way round), so you can mix single and double bytes in a field. And we used AS400s for our core processing. So, back to the problem. We had a FoxPro system using ShiftJIS on the desks in Japan which we wanted to replace in stages, and an AS400 database to replace it with. The first stage was to hook them up so names and addresses could be uploaded to the AS400, and data files consisting of daily report input could be downloaded to the PCs. The AS400 supposedly had a library which did the conversions, but no one at IBM knew how it worked. The people who did all the evaluations had basically proved that 'Hello World' in Japanese could be stored on an AS400, but never looked at the conversion issues until mid-project. Not only did we need a conversion filter, we had the problem that the character sets were of different sizes. So it was possible - indeed, likely - that some of our ten thousand customers' names and addresses would contain characters only on one system or the other, and fail to survive a round trip. (This is the absolute key issue for me - will a given set of data survive a round trip through various encoding conversions?) We figured out how to get the AS400 do to the conversions during a file transfer in one direction, and I wrote some Python scripts to make up files with each official character in JIS on a line; these went up with conversion, came back binary, and I was able to build a mapping table and 'reverse engineer' the IBM encoding. It was straightforward in theory, "fun" in practice. I then wrote a python library which knew about the AS400 and Shift-JIS encodings, and could translate a string between them. It could also detect corruption and warn us when it occurred. (This is another key issue - you will often get badly encoded data, half a kanji or a couple of random bytes, and need to be clear on your strategy for handling it in any library). It was slow, but it got us our gateway in both directions, and it warned us of bad input. 360 characters in the DBCS encoding actually appear twice, so perfect round trips are impossible, but practically you can survive with some validation of input at both ends. The final story was that our names and addresses were mostly safe, but a few obscure symbols weren't. A big issue was that field lengths varied. An address field 40 characters long on a PC might grow to 42 or 44 on an AS400 because of the shift characters, so the software would truncate the address during import, and cut a kanji in half. This resulted in a string that was illegal DBCS, and errors in the database. To guard against this, you need really picky input validation. You not only ask 'is this string valid Shift-JIS', you check it will fit on the other system too. The next stage was to bring in our Sybase databases. Sybase make a Unicode database, which works like the usual one except that all your SQL code suddenly becomes case sensitive - more (unrelated) fun when you have 2000 tables. Internally it stores data in UTF8, which is a 'rearrangement' of Unicode which is much safer to store in conventional systems. Basically, a UTF8 character is between one and three bytes, there are no nulls or control characters, and the ASCII characters are still the same ASCII characters. UTF8<->Unicode involves some bit twiddling but is one-to-one and entirely algorithmic. We had a product to 'mirror' data between AS400 and Sybase, which promptly broke when we fed it Japanese. The company bought a library called Unilib to do conversions, and started rewriting the data mirror software. This library (like many) uses Unicode as a central point in all conversions, and offers most of the world's encodings. We wanted to test it, and used the Python routines to put together a regression test. As expected, it was mostly right but had some differences, which we were at least able to document. We also needed to rig up a daily feed from the legacy FoxPro database into Sybase while it was being replaced (about six months). We took the same library, built a DLL wrapper around it, and I interfaced to this with DynWin , so we were able to do the low-level string conversion in compiled code and the high-level control in Python. A FoxPro batch job wrote out delimited text in shift-JIS; Python read this in, ran it through the DLL to convert it to UTF8, wrote that out as UTF8 delimited files, ftp'ed them to an in directory on the Unix box ready for daily import. At this point we had a lot of fun with field widths - Shift-JIS is much more compact than UTF8 when you have a lot of kanji (e.g. address fields). Another issue was half-width katakana. These were the earliest attempt to get some form of Japanese out of a computer, and are single-byte characters above 128 in Shift-JIS - but are not part of the JIS0208 standard. They look ugly and are discouraged; but when you ar enterinh a long address in a field of a database, and it won't quite fit, the temptation is to go from two-bytes-per -character to one (just hit F7 in windows) to save space. Unilib rejected these (as would Java), but has optional modes to preserve them or 'expand them out' to their full-width equivalents. The final technical step was our reports package. This is a 4GL using a really horrible 1980s Basic-like language which reads in fixed-width data files and writes out Postscript; you write programs saying 'go to x,y' and 'print customer_name', and can build up anything you want out of that. It's a monster to develop in, but when done it really works - million page jobs no problem. We had bought into this on the promise that it supported Japanese; actually, I think they had got the equivalent of 'Hello World' out of it, since we had a lot of problems later. The first stage was that the AS400 would send down fixed width data files in EBCDIC and DBCS. We ran these through a C++ conversion utility, again using Unilib. We had to filter out and warn about corrupt fields, which the conversion utility would reject. Surviving records then went into the reports program. It then turned out that the reports program only supported some of the Japanese alphabets. Specifically, it had a built in font switching system whereby when it encountered ASCII text, it would flip to the most recent single byte text, and when it found a byte above 127, it would flip to a double byte font. This is because many Chinese fonts do (or did) not include English characters, or included really ugly ones. This was wrong for Japanese, and made the half-width katakana unprintable. I found out that I could control fonts if I printed one character at a time with a special escape sequence, so wrote my own bit-scanning code (tough in a language without ord() or bitwise operations) to examine a string, classify every byte, and control the fonts the way I wanted. So a special subroutine is used for every name or address field. This is apparently not unusual in GUI development (especially web browsers) - you rarely find a complete Unicode font, so you have to switch fonts on the fly as you print a string. After all of this, we had a working system and knew quite a bit about encodings. Then the curve ball arrived: User Defined Characters! It is not true to say that there are exactly 6879 characters in Japanese, and more than counting the number of languages on the Indian sub-continent or the types of cheese in France. There are historical variations and they evolve. Some people's names got missed out, and others like to write a kanji in an unusual way. Others arrived from China where they have more complex variants of the same characters. Despite the Japanese government's best attempts, these people have dug their heels in and want to keep their names the way they like them. My first reaction was 'Just Say No' - I basically said that it one of these customers (14 out of a database of 8000) could show me a tax form or phone bill with the correct UDC on it, we would implement it but not otherwise (the usual workaround is to spell their name phonetically in katakana). But our marketing people put their foot down. A key factor is that Microsoft has 'extended the standard' a few times. First of all, Microsoft and IBM include an extra 360 characters in their code page which are not in the JIS0208 standard. This is well understood and most encoding toolkits know what 'Code Page 932' is Shift-JIS plus a few extra characters. Secondly, Shift-JIS has a User-Defined region of a couple of thousand characters. They have lately been taking Chinese variants of Japanese characters (which are readable but a bit old-fashioned - I can imagine pipe-smoking professors using these forms as an affectation) and adding them into their standard Windows fonts; so users are getting used to these being available. These are not in a standard. Thirdly, they include something called the 'Gaiji Editor' in Japanese Win95, which lets you add new characters to the fonts on your PC within the user-defined region. The first step was to review all the PCs in the Tokyo office, and get one centralized extension font file on a server. This was also fun as people had assigned different code points to characters on differene machines, so what looked correct on your word processor was a black square on mine. Effectively, each company has its own custom encoding a bit bigger than the standard. Clearly, none of these extensions would convert automatically to the other platforms. Once we actually had an agreed list of code points, we scanned the database by eye and made sure that the relevant people were using them. We decided that space for 128 User-Defined Characters would be allowed. We thought we would need a wrapper around Unilib to intercept these values and do a special conversion; but to our amazement it worked! Somebody had already figured out a mapping for at least 1000 characters for all the Japanes encodings, and they did the round trips from Shift-JIS to Unicode to DBCS and back. So the conversion problem needed less code than we thought. This mapping is not defined in a standard AFAIK (certainly not for DBCS anyway). We did, however, need some really impressive validation. When you input a name or address on any of the platforms, the system should say (a) is it valid for my encoding? (b) will it fit in the available field space in the other platforms? (c) if it contains user-defined characters, are they the ones we know about, or is this a new guy who will require updates to our fonts etc.? Finally, we got back to the display problems. Our chosen range had a particular first byte. We built a miniature font with the characters we needed starting in the lower half of the code page. I then generalized by name-printing routine to say 'if the first character is XX, throw it away, and print the subsequent character in our custom font'. This worked beautifully - not only could we print everything, we were using type 1 embedded fonts for the user defined characters, so we could distill it and also capture it for our internal document imaging systems. So, that is roughly what is involved in building a Japanese client reporting system that spans several platforms. I then moved over to the web team to work on our online trading system for Japan, where I am now - people will be able to open accounts and invest on the web. The first stage was to prove it all worked. With HTML, Java and the Web, I had high hopes, which have mostly been fulfilled - we set an option in the database connection to say 'this is a UTF8 database', and Java converts it to Unicode when reading the results, and we set another option saying 'the output stream should be Shift-JIS' when we spew out the HTML. There is one limitations: Java sticks to the JIS0208 standard, so the 360 extra IBM/Microsoft Kanji and our user defined characters won't work on the web. You cannot control the fonts on someone else's web browser; management accepted this because we gave them no alternative. Certain customers will need to be warned, or asked to suggest a standard version of a charactere if they want to see their name on the web. I really hope the web actually brings character usage in line with the standard in due course, as it will save a fortune. Our system is multi-language - when a customer logs in, we want to say 'You are a Japanese customer of our Tokyo Operation, so you see page X in language Y'. The language strings all all kept in UTF8 in XML files, so the same file can hold many languages. This and the database are the real-world reasons why you want to store stuff in UTF8. There are very few tools to let you view UTF8, but luckily there is a free Word Processor that lets you type Japanese and save it in any encoding; so we can cut and paste between Shift-JIS and UTF8 as needed. And that's it. No climactic endings and a lot of real world mess, just like life in IT. But hopefully this gives you a feel for some of the practical stuff internationalisation projects have to deal with. See my other mail for actual suggestions - Andy Robinson ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From captainrobbo at yahoo.com Tue Nov 9 14:58:39 1999 From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=) Date: Tue, 9 Nov 1999 05:58:39 -0800 (PST) Subject: [Python-Dev] Internationalization Toolkit Message-ID: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> Here are the features I'd like to see in a Python Internationalisation Toolkit. I'm very open to persuasion about APIs and how to do it, but this is roughly the functionality I would have wanted for the last year (see separate post "Internationalization Case Study"): Built-in types: --------------- "Unicode String" and "Normal String". The normal string is can hold all 256 possible byte values and is analogous to java's Byte Array - in other words an ordinary Python string. Unicode strings iterate (and are manipulated) per character, not per byte. You knew that already. To manipulate anything in a funny encoding, you convert it to Unicode, manipulate it there, then convert it back. Easy Conversions ---------------------- This is modelled on Java which I think has it right. When you construct a Unicode string, you may supply an optional encoding argument. I'm not bothered if conversion happens in a global function, a constructor method or whatever. MyUniString = ToUnicode('hello') # assumes ASCII MyUniString = ToUnicode('pretend this is Japanese', 'ShiftJIS') #specified The converse applies when converting back. The encoding designators should agree with Java. If data is encountered which is not valid for the encoding, there are several strategies, and it would be nice if they could be specified explicitly: 1. replace offending characters with a question mark 2. try to recover intelligently (possible in some cases) 3. raise an exception A 'Unicode' designator is needed which performs a dummy conversion. File Opening: --------------- It should be possible to work with files as we do now - just streams of binary data. It should also be possible to read, say, a file of locally endoded addresses into a Unicode string. e.g. open(myfile, 'r', 'ShiftJIS'). It should also be possible to open a raw Unicode file and read the bytes into ordinary Python strings, or Unicode strings. In this case one needs to watch out for the byte-order marks at the beginning of the file. Not sure of a good API to do this. We could have OrdinaryFile objects and UnicodeFile objects, or proliferate the arguments to 'open. Doing the Conversions ---------------------------- All conversions should go through Unicode as the central point. Here is where we can start to define the territory. Some conversions are algorithmic, some are lookups, many are a mixture with some simple state transitions (e.g. shift characters to denote switches from double-byte to single-byte). I'd like to see an 'encoding engine' modelled on something like mxTextTools - a state machine with a few simple actions, effectively a mini-language for doing simple operations. Then a new encoding can be added in a data-driven way, and still go at C-like speeds. Making this open and extensible (and preferably not needing to code C to do it) is the only way I can see to get a really good solid encodings library. Not all encodings need go in the standard distribution, but all should be downloadable from www.python.org. A generalized two-byte-to-two-byte mapping is 128kb. But there are compact forms which can reduce these to a few kb, and also make the data intelligible. It is obviously desirable to store stuff compactly if we can unpack it fast. Typed Strings ---------------- When you are writing data conversion tools to sit in the middle of a bunch of databases, you could save a lot of grief with a string that knows its encoding. What follows could be done as a Python wrapper around something ordinary strings rather than as a new type, and thus need not be part of the language. This is analogous to Martin Fowler's Quantity pattern in Analysis Patterns, where a number knows its units and you cannot add dollars and pounds accidentally. These would do implicit conversions; and they would stop you assigning or confusing differently encoded strings. They would also validate when constructed. 'Typecasting' would be allowed but would require explicit code. So maybe something like... >>>ts1 = TypedString('hello', 'cp932ms') # specify encoding, it remembers it >>>ts2 = TypedString('goodbye','cp5035') >>>ts1 + ts2 #or any of a host of other encoding options EncodingError >>>ts3 = TypedString(ts1, 'cp5035') #converts it implicitly going via Unicode >>>ts4 = ts1.cast('ShiftJIS') #the developer knows that in this case the string is compatible. Going Deeper ---------------- The project I describe involved many more issues than just a straight conversion. I envisage an encodings package or module which power users could get at directly. We have be able to answer the questions: 'is string X a valid instance of encoding Y?' 'is string X nearly a valid instance of encoding Y, maybe with a little corruption, or is it something totally different?' - this one might be a task left to a programmer, but the toolkit should help where it can. 'can string X be converted from encoding Y to encoding Z without loss of data? If not, exactly what will get trashed' ? This is a really useful utility. More generally, I want tools to reason about character sets and encodings. I have 'Character Set' and 'Character Mapping' classes - very app-specific and proprietary - which let me express and answer questions about whether one character set is a superset of another and reason about round trips. I'd like to do these properly for the toolkit. They would need some C support for speed, but I think they could still be data driven. So we could have an Endoding object which could be pickled, and we could keep a directory full of them as our database. There might actually be two encoding objects - one for single-byte, one for multi-byte, with the same API. There are so many subtle differences between encodings (even within the Shift-JIS family) - company X has ten extra characters, and that is technically a new encoding. So it would be really useful to reason about these and say 'find me all JIS-compatible encodings', or 'report on the differences between Shift-JIS and 'cp932ms'. GUI Issues ------------- The new Pythonwin breaks somewhat on Japanese - editor windows are fine but console output is show as single-byte garbage. I will try to evaluate IDLE on a Japanese test box this week. I think these two need to work for double-byte languages for our credibility. Verifiability and printing ----------------------------- We will need to prove it all works. This means looking at text on a screen or on paper. A really wicked demo utility would be a GUI which could open files and convert encodings in an editor window or spreadsheet window, and specify conversions on copy/paste. If it could save a page as HTML (just an encoding tag and data between <PRE> tags, then we could use Netscape/IE for verification. Better still, a web server demo could convert on python.org and tag the pages appropriately - browsers support most common encodings. All the encoding stuff is ultimately a bit meaningless without a way to display a character. I am hoping that PDF and PDFgen may add a lot of value here. Adobe (and Ken Lunde) have spent years coming up with a general architecture for this stuff in PDF. Basically, the multi-byte fonts they use are encoding independent, and come with a whole bunch of mapping tables. So I can ask for the same Japanese font in any of about ten encodings - font name is a combination of face name and encoding. The font itself does the remapping. They make available downloadable font packs for Acrobat 4.0 for most languages now; these are good places to raid for building encoding databases. It also means that I can write a Python script to crank out beautiful-looking code page charts for all of our encodings from the database, and input and output to regression tests. I've done it for Shift-JIS at Fidelity, and would have to rewrite it once I am out of here. But I think that some good graphic design here would lead to a product that blows people away - an encodings library that can print out its own contents for viewing and thus help demonstrate its own correctness (or make errors stick out like a sore thumb). Am I mad? Have I put you off forever? What I outline above would be a serious project needing months of work; I'd be really happy to take a role, if we could find sponsors for the project. But I believe we could define the standard for years to come. Furthermore, it would go a long way to making Python the corporate choice for data cleaning and transformation - territory I think we should own. Regards, Andy Robinson Robinson Analytics Ltd. ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From guido at CNRI.Reston.VA.US Tue Nov 9 17:46:41 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Tue, 09 Nov 1999 11:46:41 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: Your message of "Tue, 09 Nov 1999 05:58:39 PST." <19991109135839.25864.rocketmail@web607.mail.yahoo.com> References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> Message-ID: <199911091646.LAA21467@eric.cnri.reston.va.us> Andy, Thanks a bundle for your case study and your toolkit proposal. It's interesting that you haven't touched upon internationalization of user interfaces (dialog text, menus etc.) -- that's a whole nother can of worms. Marc-Andre Lemburg has a proposal for work that I'm asking him to do (under pressure from HP who want Python i18n badly and are willing to pay!): http://starship.skyport.net/~lemburg/unicode-proposal.txt I think his proposal will go a long way towards your toolkit. I hope to hear soon from anybody who disagrees with Marc-Andre's proposal, because without opposition this is going to be Python 1.6's offering for i18n... (Together with a new Unicode regex engine by /F.) One specific question: in you discussion of typed strings, I'm not sure why you couldn't convert everything to Unicode and be done with it. I have a feeling that the answer is somewhere in your case study -- maybe you can elaborate? --Guido van Rossum (home page: http://www.python.org/~guido/) From akuchlin at mems-exchange.org Tue Nov 9 18:21:03 1999 From: akuchlin at mems-exchange.org (Andrew M. Kuchling) Date: Tue, 9 Nov 1999 12:21:03 -0500 (EST) Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <199911091646.LAA21467@eric.cnri.reston.va.us> References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> <199911091646.LAA21467@eric.cnri.reston.va.us> Message-ID: <14376.22527.323888.677816@amarok.cnri.reston.va.us> Guido van Rossum writes: >I think his proposal will go a long way towards your toolkit. I hope >to hear soon from anybody who disagrees with Marc-Andre's proposal, >because without opposition this is going to be Python 1.6's offering >for i18n... The proposal seems reasonable to me. >(Together with a new Unicode regex engine by /F.) This is good news! Would it be a from-scratch regex implementation, or would it be an adaptation of an existing engine? Would it involve modifications to the existing re module, or a completely new unicodere module? (If, unlike re.py, it has POSIX longest-match semantics, that would pretty much settle the question.) -- A.M. Kuchling http://starship.python.net/crew/amk/ All around me darkness gathers, fading is the sun that shone, we must speak of other matters, you can be me when I'm gone... -- The train's clattering, in SANDMAN #67: "The Kindly Ones:11" From guido at CNRI.Reston.VA.US Tue Nov 9 18:26:38 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Tue, 09 Nov 1999 12:26:38 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: Your message of "Tue, 09 Nov 1999 12:21:03 EST." <14376.22527.323888.677816@amarok.cnri.reston.va.us> References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> <199911091646.LAA21467@eric.cnri.reston.va.us> <14376.22527.323888.677816@amarok.cnri.reston.va.us> Message-ID: <199911091726.MAA21754@eric.cnri.reston.va.us> [AMK] > The proposal seems reasonable to me. Thanks. I really hope that this time we can move forward united... > >(Together with a new Unicode regex engine by /F.) > > This is good news! Would it be a from-scratch regex implementation, > or would it be an adaptation of an existing engine? Would it involve > modifications to the existing re module, or a completely new unicodere > module? (If, unlike re.py, it has POSIX longest-match semantics, that > would pretty much settle the question.) It's from scratch, and I believe it's got Perl style, not POSIX style semantics -- per Tim Peters' recommendations. Do we need to open the discussion again? It involves a redone re module (supporting Unicode as well as 8-bit), but its API could be unchanged. /F does the parsing and compilation in Python, only the matching engine is in C -- not sure how that impacts performance, but I imagine with aggressive caching it would be okay. --Guido van Rossum (home page: http://www.python.org/~guido/) From akuchlin at mems-exchange.org Tue Nov 9 18:40:07 1999 From: akuchlin at mems-exchange.org (Andrew M. Kuchling) Date: Tue, 9 Nov 1999 12:40:07 -0500 (EST) Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <199911091726.MAA21754@eric.cnri.reston.va.us> References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> <199911091646.LAA21467@eric.cnri.reston.va.us> <14376.22527.323888.677816@amarok.cnri.reston.va.us> <199911091726.MAA21754@eric.cnri.reston.va.us> Message-ID: <14376.23671.250752.637144@amarok.cnri.reston.va.us> Guido van Rossum writes: >It's from scratch, and I believe it's got Perl style, not POSIX style >semantics -- per Tim Peters' recommendations. Do we need to open the >discussion again? No, no; I'm actually happier with Perl-style, because it's far better documented and familiar to people. Worse *is* better, after all. My concern is simply that I've started translating re.py into C, and wonder how this affects the translation. This isn't a pressing issue, because the C version isn't finished yet. >It involves a redone re module (supporting Unicode as well as 8-bit), >but its API could be unchanged. /F does the parsing and compilation >in Python, only the matching engine is in C -- not sure how that >impacts performance, but I imagine with aggressive caching it would be >okay. Can I get my paws on a copy of the modified re.py to see what ramifications it has, or is this all still an unreleased work-in-progress? Doing the compilation in Python is a good idea, and will make it possible to implement alternative syntaxes. I would have liked to make it possible to generate PCRE bytecodes from Python, but what stopped me is the chance of bogus bytecode causing the engine to dump core, loop forever, or some other nastiness. (This is particularly important for code that uses rexec.py, because you'd expect regexes to be safe.) Fixing the engine to be stable when faced with bad bytecodes appears to require many additional checks that would slow down the common case of correct code, which is unappealing. -- A.M. Kuchling http://starship.python.net/crew/amk/ Anybody else on the list got an opinion? Should I change the language or not? -- Guido van Rossum, 28 Dec 91 From ping at lfw.org Tue Nov 9 19:08:05 1999 From: ping at lfw.org (Ka-Ping Yee) Date: Tue, 9 Nov 1999 10:08:05 -0800 (PST) Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <14376.23671.250752.637144@amarok.cnri.reston.va.us> Message-ID: <Pine.LNX.4.10.9911091004240.7102-100000@localhost> On Tue, 9 Nov 1999, Andrew M. Kuchling wrote: > Guido van Rossum writes: > >It's from scratch, and I believe it's got Perl style, not POSIX style > >semantics -- per Tim Peters' recommendations. Do we need to open the > >discussion again? > > No, no; I'm actually happier with Perl-style, because it's far better > documented and familiar to people. Worse *is* better, after all. I would concur with the preference for Perl-style semantics. Aside from the issue of consistency with other scripting languages, i think it's easier to predict the behaviour of these semantics. You can run the algorithm in your head, and try the backtracking yourself. It's good for the algorithm to be predictable and well understood. > Doing the compilation in Python is a good idea, and will make it > possible to implement alternative syntaxes. Also agree. I still have some vague wishes for a simpler, more readable (more Pythonian?) way to express patterns -- perhaps not as powerful as full regular expressions, but useful for many simpler cases (an 80-20 solution). -- ?!ng From bwarsaw at cnri.reston.va.us Tue Nov 9 19:15:04 1999 From: bwarsaw at cnri.reston.va.us (Barry A. Warsaw) Date: Tue, 9 Nov 1999 13:15:04 -0500 (EST) Subject: [Python-Dev] Internationalization Toolkit References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> <199911091646.LAA21467@eric.cnri.reston.va.us> <14376.22527.323888.677816@amarok.cnri.reston.va.us> <199911091726.MAA21754@eric.cnri.reston.va.us> <14376.23671.250752.637144@amarok.cnri.reston.va.us> Message-ID: <14376.25768.368164.88151@anthem.cnri.reston.va.us> >>>>> "AMK" == Andrew M Kuchling <akuchlin at mems-exchange.org> writes: AMK> No, no; I'm actually happier with Perl-style, because it's AMK> far better documented and familiar to people. Worse *is* AMK> better, after all. Plus, you can't change re's semantics and I think it makes sense if the Unicode engine is as close semantically as possible to the existing engine. We need to be careful not to worsen performance for 8bit strings. I think we're already on the edge of acceptability w.r.t. P*** and hopefully we can /improve/ performance here. MAL's proposal seems quite reasonable. It would be excellent to see these things done for Python 1.6. There's still some discussion on supporting internationalization of applications, e.g. using gettext but I think those are smaller in scope. -Barry From akuchlin at mems-exchange.org Tue Nov 9 20:36:28 1999 From: akuchlin at mems-exchange.org (Andrew M. Kuchling) Date: Tue, 9 Nov 1999 14:36:28 -0500 (EST) Subject: [Python-Dev] I18N Toolkit In-Reply-To: <14376.25768.368164.88151@anthem.cnri.reston.va.us> References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> <199911091646.LAA21467@eric.cnri.reston.va.us> <14376.22527.323888.677816@amarok.cnri.reston.va.us> <199911091726.MAA21754@eric.cnri.reston.va.us> <14376.23671.250752.637144@amarok.cnri.reston.va.us> <14376.25768.368164.88151@anthem.cnri.reston.va.us> Message-ID: <14376.30652.201552.116828@amarok.cnri.reston.va.us> Barry A. Warsaw writes: (in relation to support for Unicode regexes) >We need to be careful not to worsen performance for 8bit strings. I >think we're already on the edge of acceptability w.r.t. P*** and >hopefully we can /improve/ performance here. I don't think that will be a problem, given that the Unicode engine would be a separate C implementation. A bit of 'if type(strg) == UnicodeType' in re.py isn't going to cost very much speed. (Speeding up PCRE -- that's another question. I'm often tempted to rewrite pcre_compile to generate an easier-to-analyse parse tree, instead of its current complicated-but-memory-parsimonious compiler, but I'm very reluctant to introduce a fork like that.) -- A.M. Kuchling http://starship.python.net/crew/amk/ The world does so well without me, that I am moved to wish that I could do equally well without the world. -- Robertson Davies, _The Diary of Samuel Marchbanks_ From mhammond at skippinet.com.au Tue Nov 9 23:27:45 1999 From: mhammond at skippinet.com.au (Mark Hammond) Date: Wed, 10 Nov 1999 09:27:45 +1100 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <199911091646.LAA21467@eric.cnri.reston.va.us> Message-ID: <001c01bf2b01$a58d5d50$0501a8c0@bobcat> > I think his proposal will go a long way towards your toolkit. I hope > to hear soon from anybody who disagrees with Marc-Andre's proposal, No disagreement as such, but a small hole: From tim_one at email.msn.com Wed Nov 10 06:57:14 1999 From: tim_one at email.msn.com (Tim Peters) Date: Wed, 10 Nov 1999 00:57:14 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <199911091726.MAA21754@eric.cnri.reston.va.us> Message-ID: <000001bf2b40$70183840$d82d153f@tim> [Guido, on "a new Unicode regex engine by /F"] > It's from scratch, and I believe it's got Perl style, not POSIX style > semantics -- per Tim Peters' recommendations. Do we need to open the > discussion again? No, but I get to whine just a little <wink>: I didn't recommend either approach. I asked many futile questions about HP's requirements, and sketched implications either way. If HP *has* a requirement wrt POSIX-vs-Perl, it would be good to find that out before it's too late. I personally prefer POSIX semantics -- but, as Andrew so eloquently said, worse is better here; all else being equal it's best to follow JPython's Perl-compatible re lead. last-time-i-ever-say-what-i-really-think<wink>-ly y'rs - tim From tim_one at email.msn.com Wed Nov 10 07:25:07 1999 From: tim_one at email.msn.com (Tim Peters) Date: Wed, 10 Nov 1999 01:25:07 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <199911091646.LAA21467@eric.cnri.reston.va.us> Message-ID: <000201bf2b44$55b8ad00$d82d153f@tim> > Marc-Andre Lemburg has a proposal for work that I'm asking him to do > (under pressure from HP who want Python i18n badly and are willing to > pay!): http://starship.skyport.net/~lemburg/unicode-proposal.txt I can't make time for a close review now. Just one thing that hit my eye early: Python should provide a built-in constructor for Unicode strings which is available through __builtins__: u = unicode(<encoded Python string>[,<encoding name>= <default encoding>]) u = u'<utf-8 encoded Python string>' Two points on the Unicode literals (u'abc'): UTF-8 is a very nice encoding scheme, but is very hard for people "to do" by hand -- it breaks apart and rearranges bytes at the bit level, and everything other than 7-bit ASCII requires solid strings of "high-bit" characters. This is painful for people to enter manually on both counts -- and no common reference gives the UTF-8 encoding of glyphs directly. So, as discussed earlier, we should follow Java's lead and also introduce a \u escape sequence: octet: hexdigit hexdigit unicodecode: octet octet unicode_escape: "\\u" unicodecode Inside a u'' string, I guess this should expand to the UTF-8 encoding of the Unicode character at the unicodecode code position. For consistency, then, it should probably expand the same way inside "regular strings" too. Unlike Java does, I'd rather not give it a meaning outside string literals. The other point is a nit: The vast bulk of UTF-8 encodings encode characters in UCS-4 space outside of Unicode. In good Pythonic fashion, those must either be explicitly outlawed, or explicitly defined. I vote for outlawed, in the sense of detected error that raises an exception. That leaves our future options open. BTW, is ord(unicode_char) defined? And as what? And does ord have an inverse in the Unicode world? Both seem essential. international-in-spite-of-himself-ly y'rs - tim From fredrik at pythonware.com Wed Nov 10 09:08:06 1999 From: fredrik at pythonware.com (Fredrik Lundh) Date: Wed, 10 Nov 1999 09:08:06 +0100 Subject: Internal Format (Re: [Python-Dev] Internationalization Toolkit) References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> <199911091646.LAA21467@eric.cnri.reston.va.us> Message-ID: <00f401bf2b53$3c234e90$f29b12c2@secret.pythonware.com> Guido van Rossum <guido at CNRI.Reston.VA.US> wrote: > http://starship.skyport.net/~lemburg/unicode-proposal.txt Marc-Andre writes: The internal format for Unicode objects should either use a Python specific fixed cross-platform format <PythonUnicode> (e.g. 2-byte little endian byte order) or a compiler provided wchar_t format (if available). Using the wchar_t format will ease embedding of Python in other Unicode aware applications, but will also make internal format dumps platform dependent. having been there and done that, I strongly suggest a third option: a 16-bit unsigned integer, in platform specific byte order (PY_UNICODE_T). along all other roads lie code bloat and speed penalties... (besides, this is exactly how it's already done in unicode.c and what 'sre' prefers...) </F> From captainrobbo at yahoo.com Wed Nov 10 09:09:26 1999 From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=) Date: Wed, 10 Nov 1999 00:09:26 -0800 (PST) Subject: [Python-Dev] Internationalization Toolkit Message-ID: <19991110080926.2400.rocketmail@web602.mail.yahoo.com> In general, I like this proposal a lot, but I think it only covers half the story. How we actually build the encoder/decoder for each encoding is a very big issue. Thoughts on this below. First, a little nit > u = u'<utf-8 encoded Python string>' I don't like using funny prime characters - why not an explicit function like "utf8()" On to the important stuff:> > unicodec.register(<encname>,<encoder>,<decoder> > [,<stream_encoder>, <stream_decoder>]) > This registers the codecs under the given encoding > name in the module global dictionary > unicodec.codecs. Stream codecs are optional: > the unicodec module will provide appropriate > wrappers around <encoder> and > <decoder> if not given. I would MUCH prefer a single 'Encoding' class or type to wrap up these things, rather than up to four disconnected objects/functions. Essentially it would be an interface standard and would offer methods to do the four things above. There are several reasons for this. (1) there are quite a lot of things you might want to do with an encoding object, and we could extend the interface in future easily. As a minimum, give it the four methods implied by the above, two of which can be defaults. But I'd like an encoding to be able to tell me the set of characters to which it applies; validate a string; and maybe tell me if it is a subset or superset of another. (2) especially with double-byte encodings, they will need to load up some kind of database on startup and use this for both encoding and decoding - much better to share it and encapsulate it inside one object (3) for some languages, there are extra functions wanted. For Japanese, you need two or three functions to expand half-width to full-width katakana, convert double-byte english to single-byte and vice versa. A Japanese encoding object would be a handy place to put this knowledge. (4) In the real world you get many encodings which are subtle variations of the same thing, plus or minus a few characters. One bit of code might be able to share the work of several encodings, by setting a few flags. Certainly true of Japanese. (5) encoding/decoding algorithms can be program or data or (very often) a bit of both. We have not yet discussed where to keep all the mapping tables, but if data is involved it should be hidden in an object. (6) See my comments on a state machine for doing the encodings. If this is done well, we might two different standard objects which conform to the Encoding interface (a really light one for single-byte encodings, and a bigger one for multi-byte), and everything else could be data driven. (6) Easy to grow - encodings can be prototyped and proven in Python, ported to C if needed or when ready. In summary, firm up the concept of an Encoding object and give it room to grow - that's the key to real-world usefulness. If people feel the same way I'll have a go at an interface for that, and try show how it would have simplified specific problems I have faced. We also need to think about where encoding info will live. You cannot avoid mapping tables, although you can hide them inside code modules or pickled objects if you want. Should there be a standard "..\Python\Enc" directory? And we're going to need some kind of testing and certification procedure when adding new encodings. This stuff has to be right. Guido asked about TypedString. This can probably be done on top of the built-in stuff - it is just a convenience which would clarify intent, reduce lines of code and prevent people shooting themselves in the foot when juggling a lot of strings in different (non-Unicode) encodings. I can do a Python module to implement that on top of whatever is built. Regards, Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From fredrik at pythonware.com Wed Nov 10 09:14:21 1999 From: fredrik at pythonware.com (Fredrik Lundh) Date: Wed, 10 Nov 1999 09:14:21 +0100 Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit) References: <000201bf2b44$55b8ad00$d82d153f@tim> Message-ID: <00f501bf2b53$9872e610$f29b12c2@secret.pythonware.com> Tim Peters wrote: > UTF-8 is a very nice encoding scheme, but is very hard for people "to do" by > hand -- it breaks apart and rearranges bytes at the bit level, and > everything other than 7-bit ASCII requires solid strings of "high-bit" > characters. unless you're using a UTF-8 aware editor, of course ;-) (some days, I think we need some way to tell the compiler what encoding we're using for the source file...) > This is painful for people to enter manually on both counts -- > and no common reference gives the UTF-8 encoding of glyphs > directly. So, as discussed earlier, we should follow Java's lead > and also introduce a \u escape sequence: > > octet: hexdigit hexdigit > unicodecode: octet octet > unicode_escape: "\\u" unicodecode > > Inside a u'' string, I guess this should expand to the UTF-8 encoding of the > Unicode character at the unicodecode code position. For consistency, then, > it should probably expand the same way inside "regular strings" too. Unlike > Java does, I'd rather not give it a meaning outside string literals. good idea. and by some reason, patches for this is included in the unicode distribution (see the attached str2utf.c). > The other point is a nit: The vast bulk of UTF-8 encodings encode > characters in UCS-4 space outside of Unicode. In good Pythonic fashion, > those must either be explicitly outlawed, or explicitly defined. I vote for > outlawed, in the sense of detected error that raises an exception. That > leaves our future options open. I vote for 'outlaw'. </F> /* A small code snippet that translates \uxxxx syntax to UTF-8 text. To be cut and pasted into Python/compile.c */ /* Written by Fredrik Lundh, January 1999. */ /* Documentation (for the language reference): \uxxxx -- Unicode character with hexadecimal value xxxx. The character is stored using UTF-8 encoding, which means that this sequence can result in up to three encoded characters. Note that the 'u' must be followed by four hexadecimal digits. If fewer digits are given, the sequence is left in the resulting string exactly as given. If more digits are given, only the first four are translated to Unicode, and the remaining digits are left in the resulting string. */ #define Py_CHARMASK(ch) ch void convert(const char *s, char *p) { while (*s) { if (*s != '\\') { *p++ = *s++; continue; } s++; switch (*s++) { /* -------------------------------------------------------------------- */ /* copy this section to the appropriate place in compile.c... */ case 'u': /* \uxxxx => UTF-8 encoded unicode character */ if (isxdigit(Py_CHARMASK(s[0])) && isxdigit(Py_CHARMASK(s[1])) && isxdigit(Py_CHARMASK(s[2])) && isxdigit(Py_CHARMASK(s[3]))) { /* fetch hexadecimal character value */ unsigned int n, ch = 0; for (n = 0; n < 4; n++) { int c = Py_CHARMASK(*s); s++; ch = (ch << 4) & ~0xF; if (isdigit(c)) ch += c - '0'; else if (islower(c)) ch += 10 + c - 'a'; else ch += 10 + c - 'A'; } /* store as UTF-8 */ if (ch < 0x80) *p++ = (char) ch; else { if (ch < 0x800) { *p++ = 0xc0 | (ch >> 6); *p++ = 0x80 | (ch & 0x3f); } else { *p++ = 0xe0 | (ch >> 12); *p++ = 0x80 | ((ch >> 6) & 0x3f); *p++ = 0x80 | (ch & 0x3f); } } break; } else goto bogus; /* -------------------------------------------------------------------- */ default: bogus: *p++ = '\\'; *p++ = s[-1]; break; } } *p++ = '\0'; } main() { int i; unsigned char buffer[100]; convert("Link\\u00f6ping", buffer); for (i = 0; buffer[i]; i++) if (buffer[i] < 0x20 || buffer[i] >= 0x80) printf("\\%03o", buffer[i]); else printf("%c", buffer[i]); } From gstein at lyra.org Thu Nov 11 10:18:52 1999 From: gstein at lyra.org (Greg Stein) Date: Thu, 11 Nov 1999 01:18:52 -0800 (PST) Subject: [Python-Dev] Re: Internal Format In-Reply-To: <00f401bf2b53$3c234e90$f29b12c2@secret.pythonware.com> Message-ID: <Pine.LNX.4.10.9911110116050.638-100000@nebula.lyra.org> On Wed, 10 Nov 1999, Fredrik Lundh wrote: > Marc-Andre writes: > > The internal format for Unicode objects should either use a Python > specific fixed cross-platform format <PythonUnicode> (e.g. 2-byte > little endian byte order) or a compiler provided wchar_t format (if > available). Using the wchar_t format will ease embedding of Python in > other Unicode aware applications, but will also make internal format > dumps platform dependent. > > having been there and done that, I strongly suggest > a third option: a 16-bit unsigned integer, in platform > specific byte order (PY_UNICODE_T). along all other > roads lie code bloat and speed penalties... I agree 100% !! wchar_t will introduce portability issues right on up into the Python level. The byte-order introduces speed issues and OS interoperability issues, yet solves no portability problems (Byte Order Marks should still be present and used). There are two "platforms" out there that use Unicode: Win32 and Java. They both use UCS-2, AFAIK. Cheers, -g -- Greg Stein, http://www.lyra.org/ From fredrik at pythonware.com Wed Nov 10 09:24:16 1999 From: fredrik at pythonware.com (Fredrik Lundh) Date: Wed, 10 Nov 1999 09:24:16 +0100 Subject: cached encoding (Re: [Python-Dev] Internationalization Toolkit) References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> <199911091646.LAA21467@eric.cnri.reston.va.us> Message-ID: <010b01bf2b54$fb107430$f29b12c2@secret.pythonware.com> Guido van Rossum <guido at CNRI.Reston.VA.US> wrote: > One specific question: in you discussion of typed strings, I'm not > sure why you couldn't convert everything to Unicode and be done with > it. I have a feeling that the answer is somewhere in your case study > -- maybe you can elaborate? Marc-Andre writes: Unicode objects should have a pointer to a cached (read-only) char buffer <defencbuf> holding the object's value using the current <default encoding>. This is needed for performance and internal parsing (see below) reasons. The buffer is filled when the first conversion request to the <default encoding> is issued on the object. keeping track of an external encoding is better left for the application programmers -- I'm pretty sure that different application builders will want to handle this in radically different ways, depending on their environ- ment, underlying user interface toolkit, etc. besides, this is how Tcl would have done it. Python's not Tcl, and I think you need *very* good arguments for moving in that direction. </F> From mal at lemburg.com Wed Nov 10 10:04:39 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Wed, 10 Nov 1999 10:04:39 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <001c01bf2b01$a58d5d50$0501a8c0@bobcat> Message-ID: <38293527.3CF5C7B0@lemburg.com> Mark Hammond wrote: > > > I think his proposal will go a long way towards your toolkit. I > hope > > to hear soon from anybody who disagrees with Marc-Andre's proposal, > > No disagreement as such, but a small hole: > > >From the proposal: > > Internal Argument Parsing: > -------------------------- > ... > 's': For Unicode objects: auto convert them to the <default encoding> > and return a pointer to the object's <defencbuf> buffer. > > -- > Excellent - if someone passes a Unicode object, it can be > auto-converted to a string. This will allow "open()" to accept > Unicode strings. Well almost... it depends on the current value of <default encoding>. If it's UTF8 and you only use normal ASCII characters the above is indeed true, but UTF8 can go far beyond ASCII and have up to 3 bytes per character (for UCS2, even more for UCS4). With <default encoding> set to other exotic encodings this is likely to fail though. > However, there doesnt appear to be a reverse. Eg, if my extension > module interfaces to a library that uses Unicode natively, how can I > get a Unicode object when the user passes a string? If I had to > explicitely check for a string, then check for a Unicode on failure it > would get messy pretty quickly... Is it not possible to have "U" also > do a conversion? "U" is meant to simplify checks for Unicode objects, much like "S". It returns a reference to the object. Auto-conversions are not possible due to this, because they would create new objects which don't get properly garbage collected later on. Another problem is that Unicode types differ between platforms (MS VCLIB uses 16-bit wchar_t, while GLIBC2 uses 32-bit wchar_t). Depending on the internal format of Unicode objects this could mean calling different conversion APIs. BTW, I'm still not too sure about the underlying internal format. The problem here is that Unicode started out as 2-byte fixed length representation (UCS2) but then shifted towards a 4-byte fixed length reprensetation known as UCS4. Since having 4 bytes per character is hard sell to customers, UTF16 was created to stuff the UCS4 code points (this is how character entities are called in Unicode) into 2 bytes... with a variable length encoding. Some platforms that started early into the Unicode business such as the MS ones use UCS2 as wchar_t, while more recent ones (e.g. the glibc2 on Linux) use UCS4 for wchar_t. I haven't yet checked in what ways the two are compatible (I would suspect the top bytes in UCS4 being 0 for UCS2 codes), but would like to hear whether it wouldn't be a better idea to use UTF16 as internal format. The latter works in 2 bytes for most characters and conversion to UCS2|4 should be fast. Still, conversion to UCS2 could fail. The downside of using UTF16: it is a variable length format, so iterations over it will be slower than for UCS4. Simply sticking to UCS2 is probably out of the question, since Unicode 3.0 requires UCS4 and we are targetting Unicode 3.0. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Wed Nov 10 10:49:01 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Wed, 10 Nov 1999 10:49:01 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <000201bf2b44$55b8ad00$d82d153f@tim> Message-ID: <38293F8D.F60AE605@lemburg.com> Tim Peters wrote: > > > Marc-Andre Lemburg has a proposal for work that I'm asking him to do > > (under pressure from HP who want Python i18n badly and are willing to > > pay!): http://starship.skyport.net/~lemburg/unicode-proposal.txt > > I can't make time for a close review now. Just one thing that hit my eye > early: > > Python should provide a built-in constructor for Unicode strings > which is available through __builtins__: > > u = unicode(<encoded Python string>[,<encoding name>= > <default encoding>]) > > u = u'<utf-8 encoded Python string>' > > Two points on the Unicode literals (u'abc'): > > UTF-8 is a very nice encoding scheme, but is very hard for people "to do" by > hand -- it breaks apart and rearranges bytes at the bit level, and > everything other than 7-bit ASCII requires solid strings of "high-bit" > characters. This is painful for people to enter manually on both counts -- > and no common reference gives the UTF-8 encoding of glyphs directly. So, as > discussed earlier, we should follow Java's lead and also introduce a \u > escape sequence: > > octet: hexdigit hexdigit > unicodecode: octet octet > unicode_escape: "\\u" unicodecode > > Inside a u'' string, I guess this should expand to the UTF-8 encoding of the > Unicode character at the unicodecode code position. For consistency, then, > it should probably expand the same way inside "regular strings" too. Unlike > Java does, I'd rather not give it a meaning outside string literals. It would be more conform to use the Unicode ordinal (instead of interpreting the number as UTF8 encoding), e.g. \u03C0 for Pi. The codes are easy to look up in the standard's UnicodeData.txt file or the Unicode book for that matter. > The other point is a nit: The vast bulk of UTF-8 encodings encode > characters in UCS-4 space outside of Unicode. In good Pythonic fashion, > those must either be explicitly outlawed, or explicitly defined. I vote for > outlawed, in the sense of detected error that raises an exception. That > leaves our future options open. See my other post for a discussion of UCS4 vs. UTF16 vs. UCS2. Perhaps we could add a flag to Unicode objects stating whether the characters can be treated as UCS4 limited to the lower 16 bits (UCS4 and UTF16 are the same in most ranges). This flag could then be used to choose optimized algorithms for scanning the strings. Fredrik's implementation currently uses UCS2, BTW. > BTW, is ord(unicode_char) defined? And as what? And does ord have an > inverse in the Unicode world? Both seem essential. Good points. How about uniord(u[:1]) --> Unicode ordinal number (32-bit) unichr(i) --> Unicode object for character i (provided it is 32-bit); ValueError otherwise They are inverse of each other, but note that Unicode allows private encodings too, which will of course not necessarily make it across platforms or even from one PC to the next (see Andy Robinson's interesting case study). I've uploaded a new version of the proposal (0.3) to the URL: http://starship.skyport.net/~lemburg/unicode-proposal.txt Thanks, -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From fredrik at pythonware.com Wed Nov 10 11:50:05 1999 From: fredrik at pythonware.com (Fredrik Lundh) Date: Wed, 10 Nov 1999 11:50:05 +0100 Subject: regexp performance (Re: [Python-Dev] I18N Toolkit References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com><199911091646.LAA21467@eric.cnri.reston.va.us><14376.22527.323888.677816@amarok.cnri.reston.va.us><199911091726.MAA21754@eric.cnri.reston.va.us><14376.23671.250752.637144@amarok.cnri.reston.va.us><14376.25768.368164.88151@anthem.cnri.reston.va.us> <14376.30652.201552.116828@amarok.cnri.reston.va.us> Message-ID: <027c01bf2b69$59e60330$f29b12c2@secret.pythonware.com> Andrew M. Kuchling <akuchlin at mems-exchange.org> wrote: > (Speeding up PCRE -- that's another question. I'm often tempted to > rewrite pcre_compile to generate an easier-to-analyse parse tree, > instead of its current complicated-but-memory-parsimonious compiler, > but I'm very reluctant to introduce a fork like that.) any special pattern constructs that are in need of per- formance improvements? (compared to Perl, that is). or maybe anyone has an extensive performance test suite for perlish regular expressions? (preferrably based on how real people use regular expressions, not only on things that are known to be slow if not optimized) </F> From gstein at lyra.org Thu Nov 11 11:46:55 1999 From: gstein at lyra.org (Greg Stein) Date: Thu, 11 Nov 1999 02:46:55 -0800 (PST) Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <38293527.3CF5C7B0@lemburg.com> Message-ID: <Pine.LNX.4.10.9911110236000.18059-100000@nebula.lyra.org> On Wed, 10 Nov 1999, M.-A. Lemburg wrote: >... > Well almost... it depends on the current value of <default encoding>. Default encodings are kind of nasty when they can be altered. The same problem occurred with import hooks. Only one can be present at a time. This implies that modules, packages, subsystems, whatever, cannot set a default encoding because something else might depend on it having a different value. In the end, nobody uses the default encoding because it is unreliable, so you end up with extra implementation/semantics that aren't used/needed. Have you ever noticed how Python modules, packages, tools, etc, never define an import hook? I'll bet nobody ever monkeys with the default encoding either... I say axe it and say "UTF-8" is the fixed, default encoding. If you want something else, then do that explicitly. >... > Another problem is that Unicode types differ between platforms > (MS VCLIB uses 16-bit wchar_t, while GLIBC2 uses 32-bit > wchar_t). Depending on the internal format of Unicode objects > this could mean calling different conversion APIs. Exactly the reason to avoid wchar_t. > BTW, I'm still not too sure about the underlying internal format. > The problem here is that Unicode started out as 2-byte fixed length > representation (UCS2) but then shifted towards a 4-byte fixed length > reprensetation known as UCS4. Since having 4 bytes per character > is hard sell to customers, UTF16 was created to stuff the UCS4 > code points (this is how character entities are called in Unicode) > into 2 bytes... with a variable length encoding. History is basically irrelevant. What is the situation today? What is in use, and what are people planning for right now? >... > The downside of using UTF16: it is a variable length format, > so iterations over it will be slower than for UCS4. Bzzt. May as well go with UTF-8 as the internal format, much like Perl is doing (as I recall). Why go with a variable length format, when people seem to be doing fine with UCS-2? Like I said in the other mail note: two large platforms out there are UCS-2 based. They seem to be doing quite well with that approach. If people truly need UCS-4, then they can work with that on their own. One of the major reasons for putting Unicode into Python is to increase/simplify its ability to speak to the underlying platform. Hey! Guess what? That generally means UCS2. If we didn't need to speak to the OS with these Unicode values, then people can work with the values entirely in Python, PyUnicodeType-be-damned. Are we digging a hole for ourselves? Maybe. But there are two other big platforms that have the same hole to dig out of *IF* it ever comes to that. I posit that it won't be necessary; that the people needing UCS-4 can do so entirely in Python. Maybe we can allow the encoder to do UCS-4 to UTF-8 encoding and vice-versa. But: it only does it from String to String -- you can't use Unicode objects anywhere in there. > Simply sticking to UCS2 is probably out of the question, > since Unicode 3.0 requires UCS4 and we are targetting > Unicode 3.0. Oh? Who says? Cheers, -g -- Greg Stein, http://www.lyra.org/ From fredrik at pythonware.com Wed Nov 10 11:52:28 1999 From: fredrik at pythonware.com (Fredrik Lundh) Date: Wed, 10 Nov 1999 11:52:28 +0100 Subject: [Python-Dev] I18N Toolkit References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com><199911091646.LAA21467@eric.cnri.reston.va.us><14376.22527.323888.677816@amarok.cnri.reston.va.us><199911091726.MAA21754@eric.cnri.reston.va.us><14376.23671.250752.637144@amarok.cnri.reston.va.us><14376.25768.368164.88151@anthem.cnri.reston.va.us> <14376.30652.201552.116828@amarok.cnri.reston.va.us> Message-ID: <029c01bf2b69$af0da250$f29b12c2@secret.pythonware.com> (a copy was sent to comp.lang.python by mistake; sorry for that). Andrew M. Kuchling <akuchlin at mems-exchange.org> wrote: > I don't think that will be a problem, given that the Unicode engine > would be a separate C implementation. A bit of 'if type(strg) == > UnicodeType' in re.py isn't going to cost very much speed. a slightly hairer design issue is what combinations of pattern and string the new 're' will handle. the first two are obvious: ordinary pattern, ordinary string unicode pattern, unicode string but what about these? ordinary pattern, unicode string unicode pattern, ordinary string "coercing" patterns (i.e. recompiling, on demand) seem to be a somewhat risky business ;-) </F> From gstein at lyra.org Thu Nov 11 11:50:56 1999 From: gstein at lyra.org (Greg Stein) Date: Thu, 11 Nov 1999 02:50:56 -0800 (PST) Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <38293F8D.F60AE605@lemburg.com> Message-ID: <Pine.LNX.4.10.9911110248270.18059-100000@nebula.lyra.org> On Wed, 10 Nov 1999, M.-A. Lemburg wrote: > Tim Peters wrote: > > BTW, is ord(unicode_char) defined? And as what? And does ord have an > > inverse in the Unicode world? Both seem essential. > > Good points. > > How about > > uniord(u[:1]) --> Unicode ordinal number (32-bit) > > unichr(i) --> Unicode object for character i (provided it is 32-bit); > ValueError otherwise Why new functions? Why not extend the definition of ord() and chr()? In terms of backwards compatibility, the only issue could possibly be that people relied on chr(x) to throw an error when x>=256. They certainly couldn't pass a Unicode object to ord(), so that function can safely be extended to accept a Unicode object and return a larger integer. Cheers, -g -- Greg Stein, http://www.lyra.org/ From jcw at equi4.com Wed Nov 10 12:14:17 1999 From: jcw at equi4.com (Jean-Claude Wippler) Date: Wed, 10 Nov 1999 12:14:17 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <Pine.LNX.4.10.9911110236000.18059-100000@nebula.lyra.org> Message-ID: <38295389.397DDE5E@equi4.com> Greg Stein wrote: [MAL:] > > The downside of using UTF16: it is a variable length format, > > so iterations over it will be slower than for UCS4. > > Bzzt. May as well go with UTF-8 as the internal format, much like Perl > is doing (as I recall). Ehm, pardon me for asking - what is the brief rationale for selecting UCS2/4, or whetever it ends up being, over UTF8? I couldn't find a discussion in the last months of the string SIG, was this decided upon and frozen long ago? I'm not trying to re-open a can of worms, just to understand. -- Jean-Claude From gstein at lyra.org Thu Nov 11 12:17:56 1999 From: gstein at lyra.org (Greg Stein) Date: Thu, 11 Nov 1999 03:17:56 -0800 (PST) Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <38295389.397DDE5E@equi4.com> Message-ID: <Pine.LNX.4.10.9911110315330.18059-100000@nebula.lyra.org> On Wed, 10 Nov 1999, Jean-Claude Wippler wrote: > Greg Stein wrote: > > Bzzt. May as well go with UTF-8 as the internal format, much like Perl > > is doing (as I recall). > > Ehm, pardon me for asking - what is the brief rationale for selecting > UCS2/4, or whetever it ends up being, over UTF8? > > I couldn't find a discussion in the last months of the string SIG, was > this decided upon and frozen long ago? Try sometime last year :-) ... something like July thru September as I recall. Things will be a lot faster if we have a fixed-size character. Variable length formats like UTF-8 are a lot harder to slice, search, etc. Also, (IMO) a big reason for this new type is for interaction with the underlying OS/platform. I don't know of any platforms right now that really use UTF-8 as their Unicode string representation (meaning we'd have to convert back/forth from our UTF-8 representation to talk to the OS). Cheers, -g -- Greg Stein, http://www.lyra.org/ From mal at lemburg.com Wed Nov 10 10:55:42 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Wed, 10 Nov 1999 10:55:42 +0100 Subject: cached encoding (Re: [Python-Dev] Internationalization Toolkit) References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> <199911091646.LAA21467@eric.cnri.reston.va.us> <010b01bf2b54$fb107430$f29b12c2@secret.pythonware.com> Message-ID: <3829411E.FD32F8CC@lemburg.com> Fredrik Lundh wrote: > > Guido van Rossum <guido at CNRI.Reston.VA.US> wrote: > > One specific question: in you discussion of typed strings, I'm not > > sure why you couldn't convert everything to Unicode and be done with > > it. I have a feeling that the answer is somewhere in your case study > > -- maybe you can elaborate? > > Marc-Andre writes: > > Unicode objects should have a pointer to a cached (read-only) char > buffer <defencbuf> holding the object's value using the current > <default encoding>. This is needed for performance and internal > parsing (see below) reasons. The buffer is filled when the first > conversion request to the <default encoding> is issued on the object. > > keeping track of an external encoding is better left > for the application programmers -- I'm pretty sure that > different application builders will want to handle this > in radically different ways, depending on their environ- > ment, underlying user interface toolkit, etc. It's not that hard to implement. All you have to do is check whether the current encoding in <defencbuf> still is the same as the threads view of <default encoding>. The <defencbuf> buffer is needed to implement "s" et al. argument parsing anyways. > besides, this is how Tcl would have done it. Python's > not Tcl, and I think you need *very* good arguments > for moving in that direction. > > </F> > > _______________________________________________ > Python-Dev maillist - Python-Dev at python.org > http://www.python.org/mailman/listinfo/python-dev -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Wed Nov 10 12:42:00 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Wed, 10 Nov 1999 12:42:00 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <19991110080926.2400.rocketmail@web602.mail.yahoo.com> Message-ID: <38295A08.D3928401@lemburg.com> Andy Robinson wrote: > > In general, I like this proposal a lot, but I think it > only covers half the story. How we actually build the > encoder/decoder for each encoding is a very big issue. > Thoughts on this below. > > First, a little nit > > u = u'<utf-8 encoded Python string>' > I don't like using funny prime characters - why not an > explicit function like "utf8()" u = unicode('...I am UTF8...','utf-8') will do just that. I've moved to Tim's proposal with the \uXXXX encoding for u'', BTW. > On to the important stuff:> > > unicodec.register(<encname>,<encoder>,<decoder> > > [,<stream_encoder>, <stream_decoder>]) > > > This registers the codecs under the given encoding > > name in the module global dictionary > > unicodec.codecs. Stream codecs are optional: > > the unicodec module will provide appropriate > > wrappers around <encoder> and > > <decoder> if not given. > > I would MUCH prefer a single 'Encoding' class or type > to wrap up these things, rather than up to four > disconnected objects/functions. Essentially it would > be an interface standard and would offer methods to do > the four things above. > > There are several reasons for this. > > ... > > In summary, firm up the concept of an Encoding object > and give it room to grow - that's the key to > real-world usefulness. If people feel the same way > I'll have a go at an interface for that, and try show > how it would have simplified specific problems I have > faced. Ok, you have a point there. Here's a proposal (note that this only defines an interface, not a class structure): Codec Interface Definition: --------------------------- The following base class should be defined in the module unicodec. class Codec: def encode(self,u): """ Return the Unicode object u encoded as Python string. """ ... def decode(self,s): """ Return an equivalent Unicode object for the encoded Python string s. """ ... def dump(self,u,stream,slice=None): """ Writes the Unicode object's contents encoded to the stream. stream must be a file-like object open for writing binary data. If slice is given (as slice object), only the sliced part of the Unicode object is written. """ ... the base class should provide a default implementation of this method using self.encode ... def load(self,stream,length=None): """ Reads an encoded string (up to <length> bytes) from the stream and returns an equivalent Unicode object. stream must be a file-like object open for reading binary data. If length is given, only length bytes are read. Note that this can cause the decoding algorithm to fail due to truncations in the encoding. """ ... the base class should provide a default implementation of this method using self.encode ... Codecs should raise an UnicodeError in case the conversion is not possible. It is not required by the unicodec.register() API to provide a subclass of this base class, only the 4 given methods must be present. This allows writing Codecs as extensions types. XXX Still to be discussed: ? support for line breaks (see http://www.unicode.org/unicode/reports/tr13/ ) ? support for case conversion: Problems: string lengths can change due to multiple characters being mapped to a single new one, capital letters starting a word can be different than ones occurring in the middle, there are locale dependent deviations from the standard mappings. ? support for numbers, digits, whitespace, etc. ? support (or no support) for private code point areas > We also need to think about where encoding info will > live. You cannot avoid mapping tables, although you > can hide them inside code modules or pickled objects > if you want. Should there be a standard > "..\Python\Enc" directory? Mapping tables should be incorporated into the codec modules preferably as static C data. That way multiple processes can share the same data. > And we're going to need some kind of testing and > certification procedure when adding new encodings. > This stuff has to be right. I will have to rely on your cooperation for the test data. Roundtrip testing is easy to implement, but I will also have to verify the output against prechecked data which is probably only creatable using visual tools to which I don't have access (e.g. a Japanese Windows installation). > Guido asked about TypedString. This can probably be > done on top of the built-in stuff - it is just a > convenience which would clarify intent, reduce lines > of code and prevent people shooting themselves in the > foot when juggling a lot of strings in different > (non-Unicode) encodings. I can do a Python module to > implement that on top of whatever is built. Ok. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Wed Nov 10 11:03:36 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Wed, 10 Nov 1999 11:03:36 +0100 Subject: Internal Format (Re: [Python-Dev] Internationalization Toolkit) References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> <199911091646.LAA21467@eric.cnri.reston.va.us> <00f401bf2b53$3c234e90$f29b12c2@secret.pythonware.com> Message-ID: <382942F8.1921158E@lemburg.com> Fredrik Lundh wrote: > > Guido van Rossum <guido at CNRI.Reston.VA.US> wrote: > > http://starship.skyport.net/~lemburg/unicode-proposal.txt > > Marc-Andre writes: > > The internal format for Unicode objects should either use a Python > specific fixed cross-platform format <PythonUnicode> (e.g. 2-byte > little endian byte order) or a compiler provided wchar_t format (if > available). Using the wchar_t format will ease embedding of Python in > other Unicode aware applications, but will also make internal format > dumps platform dependent. > > having been there and done that, I strongly suggest > a third option: a 16-bit unsigned integer, in platform > specific byte order (PY_UNICODE_T). along all other > roads lie code bloat and speed penalties... > > (besides, this is exactly how it's already done in > unicode.c and what 'sre' prefers...) Ok, byte order can cause a speed penalty, so it might be worthwhile introducing sys.bom (or sys.endianness) for this reason and sticking to 16-bit integers as you have already done in unicode.h. What I don't like is using wchar_t if available (and then addressing it as if it were defined as unsigned integer). IMO, it's better to define a Python Unicode representation which then gets converted to whatever wchar_t represents on the target machine. Another issue is whether to use UCS2 (as you have done) or UTF16 (which is what Unicode 3.0 requires)... see my other post for a discussion. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From fredrik at pythonware.com Wed Nov 10 13:32:16 1999 From: fredrik at pythonware.com (Fredrik Lundh) Date: Wed, 10 Nov 1999 13:32:16 +0100 Subject: Internal Format (Re: [Python-Dev] Internationalization Toolkit) References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> <199911091646.LAA21467@eric.cnri.reston.va.us> <00f401bf2b53$3c234e90$f29b12c2@secret.pythonware.com> <382942F8.1921158E@lemburg.com> Message-ID: <038501bf2b77$a06046f0$f29b12c2@secret.pythonware.com> > What I don't like is using wchar_t if available (and then addressing > it as if it were defined as unsigned integer). IMO, it's better > to define a Python Unicode representation which then gets converted > to whatever wchar_t represents on the target machine. you should read the unicode.h file a bit more carefully: ... /* Unicode declarations. Tweak these to match your platform */ /* set this flag if the platform has "wchar.h", "wctype.h" and the wchar_t type is a 16-bit unsigned type */ #define HAVE_USABLE_WCHAR_H #if defined(WIN32) || defined(HAVE_USABLE_WCHAR_H) (this uses wchar_t, and also iswspace and friends) ... #else /* Use if you have a standard ANSI compiler, without wchar_t support. If a short is not 16 bits on your platform, you have to fix the typedef below, or the module initialization code will complain. */ (this maps iswspace to isspace, for 8-bit characters). #endif ... the plan was to use the second solution (using "configure" to figure out what integer type to use), and its own uni- code database table for the is/to primitives (iirc, the unicode.txt file discussed this, but that one seems to be missing from the zip archive). </F> From fredrik at pythonware.com Wed Nov 10 13:39:56 1999 From: fredrik at pythonware.com (Fredrik Lundh) Date: Wed, 10 Nov 1999 13:39:56 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <Pine.LNX.4.10.9911110236000.18059-100000@nebula.lyra.org> Message-ID: <039c01bf2b78$b234d520$f29b12c2@secret.pythonware.com> Greg Stein <gstein at lyra.org> wrote: > Have you ever noticed how Python modules, packages, tools, etc, never > define an import hook? hey, didn't MAL use one in one of his mx kits? ;-) > I say axe it and say "UTF-8" is the fixed, default encoding. If you want > something else, then do that explicitly. exactly. modes are evil. python is not perl. etc. > Are we digging a hole for ourselves? Maybe. But there are two other big > platforms that have the same hole to dig out of *IF* it ever comes to > that. I posit that it won't be necessary; that the people needing UCS-4 > can do so entirely in Python. last time I checked, there were no characters (even in the ISO standard) outside the 16-bit range. has that changed? </F> From mal at lemburg.com Wed Nov 10 13:44:39 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Wed, 10 Nov 1999 13:44:39 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <Pine.LNX.4.10.9911110248270.18059-100000@nebula.lyra.org> Message-ID: <382968B7.ABFFD4C0@lemburg.com> Greg Stein wrote: > > On Wed, 10 Nov 1999, M.-A. Lemburg wrote: > > Tim Peters wrote: > > > BTW, is ord(unicode_char) defined? And as what? And does ord have an > > > inverse in the Unicode world? Both seem essential. > > > > Good points. > > > > How about > > > > uniord(u[:1]) --> Unicode ordinal number (32-bit) > > > > unichr(i) --> Unicode object for character i (provided it is 32-bit); > > ValueError otherwise > > Why new functions? Why not extend the definition of ord() and chr()? > > In terms of backwards compatibility, the only issue could possibly be that > people relied on chr(x) to throw an error when x>=256. They certainly > couldn't pass a Unicode object to ord(), so that function can safely be > extended to accept a Unicode object and return a larger integer. Because unichr() will always have to return Unicode objects. You don't want chr(i) to return Unicode for i>255 and strings for i<256. OTOH, ord() could probably be extended to also work on Unicode objects. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Wed Nov 10 14:08:30 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Wed, 10 Nov 1999 14:08:30 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <Pine.LNX.4.10.9911110236000.18059-100000@nebula.lyra.org> Message-ID: <38296E4E.914C0ED7@lemburg.com> Greg Stein wrote: > > On Wed, 10 Nov 1999, M.-A. Lemburg wrote: > >... > > Well almost... it depends on the current value of <default encoding>. > > Default encodings are kind of nasty when they can be altered. The same > problem occurred with import hooks. Only one can be present at a time. > This implies that modules, packages, subsystems, whatever, cannot set a > default encoding because something else might depend on it having a > different value. In the end, nobody uses the default encoding because it > is unreliable, so you end up with extra implementation/semantics that > aren't used/needed. I know, but this is a little different: you use strings a lot while import hooks are rarely used directly by the user. E.g. people in Europe will probably prefer Latin-1 as default encoding while people in Asia will use one of the common CJK encodings. The <default encoding> decides what encoding to use for many typical tasks: printing, str(u), "s" argument parsing, etc. Note that setting the <default encoding> is not intended to be done prior to single operations. It is meant to be settable at thread creation time. > [...] > > > BTW, I'm still not too sure about the underlying internal format. > > The problem here is that Unicode started out as 2-byte fixed length > > representation (UCS2) but then shifted towards a 4-byte fixed length > > reprensetation known as UCS4. Since having 4 bytes per character > > is hard sell to customers, UTF16 was created to stuff the UCS4 > > code points (this is how character entities are called in Unicode) > > into 2 bytes... with a variable length encoding. > > History is basically irrelevant. What is the situation today? What is in > use, and what are people planning for right now? > > >... > > The downside of using UTF16: it is a variable length format, > > so iterations over it will be slower than for UCS4. > > Bzzt. May as well go with UTF-8 as the internal format, much like Perl is > doing (as I recall). > > Why go with a variable length format, when people seem to be doing fine > with UCS-2? The reason for UTF-16 is simply that it is identical to UCS-2 over large ranges which makes optimizations (e.g. the UCS2 flag I mentioned in an earlier post) feasable and effective. UTF-8 slows things down for CJK encodings, since the APIs will very often have to scan the string to find the correct logical position in the data. Here's a quote from the Unicode FAQ (http://www.unicode.org/unicode/faq/ ): """ Q: How about using UCS-4 interfaces in my APIs? Given an internal UTF-16 storage, you can, of course, still index into text using UCS-4 indices. However, while converting from a UCS-4 index to a UTF-16 index or vice versa is fairly straightforward, it does involve a scan through the 16-bit units up to the index point. In a test run, for example, accessing UTF-16 storage as UCS-4 characters results in a 10X degradation. Of course, the precise differences will depend on the compiler, and there are some interesting optimizations that can be performed, but it will always be slower on average. This kind of performance hit is unacceptable in many environments. Most Unicode APIs are using UTF-16. The low-level character indexing are at the common storage level, with higher-level mechanisms for graphemes or words specifying their boundaries in terms of the storage units. This provides efficiency at the low levels, and the required functionality at the high levels. Convenience APIs can be produced that take parameters in UCS-4 methods for common utilities: e.g. converting UCS-4 indices back and forth, accessing character properties, etc. Outside of indexing, differences between UCS-4 and UTF-16 are not as important. For most other APIs outside of indexing, characters values cannot really be considered outside of their context--not when you are writing internationalized code. For such operations as display, input, collation, editing, and even upper and lowercasing, characters need to be considered in the context of a string. That means that in any event you end up looking at more than one character. In our experience, the incremental cost of doing surrogates is pretty small. """ > Like I said in the other mail note: two large platforms out there are > UCS-2 based. They seem to be doing quite well with that approach. > > If people truly need UCS-4, then they can work with that on their own. One > of the major reasons for putting Unicode into Python is to > increase/simplify its ability to speak to the underlying platform. Hey! > Guess what? That generally means UCS2. All those formats are upward compatible (within certain ranges) and the Python Unicode API will provide converters between its internal format and the few common Unicode implementations, e.g. for MS compilers (16-bit UCS2 AFAIK), GLIBC (32-bit UCS4). > If we didn't need to speak to the OS with these Unicode values, then > people can work with the values entirely in Python, > PyUnicodeType-be-damned. > > Are we digging a hole for ourselves? Maybe. But there are two other big > platforms that have the same hole to dig out of *IF* it ever comes to > that. I posit that it won't be necessary; that the people needing UCS-4 > can do so entirely in Python. > > Maybe we can allow the encoder to do UCS-4 to UTF-8 encoding and > vice-versa. But: it only does it from String to String -- you can't use > Unicode objects anywhere in there. See above. > > Simply sticking to UCS2 is probably out of the question, > > since Unicode 3.0 requires UCS4 and we are targetting > > Unicode 3.0. > > Oh? Who says? >From the FAQ: """ Q: What is UTF-16? Unicode was originally designed as a pure 16-bit encoding, aimed at representing all modern scripts. (Ancient scripts were to be represented with private-use characters.) Over time, and especially after the addition of over 14,500 composite characters for compatibility with legacy sets, it became clear that 16-bits were not sufficient for the user community. Out of this arose UTF-16. """ Note that there currently are no defined surrogate pairs for UTF-16, meaning that in practice the difference between UCS-2 and UTF-16 is probably negligable, e.g. we could define the internal format to be UTF-16 and raise exception whenever the border between UTF-16 and UCS-2 is crossed -- sort of as political compromise ;-). But... I think HP has the last word on this one. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Wed Nov 10 13:36:44 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Wed, 10 Nov 1999 13:36:44 +0100 Subject: Internal Format (Re: [Python-Dev] Internationalization Toolkit) References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> <199911091646.LAA21467@eric.cnri.reston.va.us> <00f401bf2b53$3c234e90$f29b12c2@secret.pythonware.com> <382942F8.1921158E@lemburg.com> <038501bf2b77$a06046f0$f29b12c2@secret.pythonware.com> Message-ID: <382966DC.F33E340E@lemburg.com> Fredrik Lundh wrote: > > > What I don't like is using wchar_t if available (and then addressing > > it as if it were defined as unsigned integer). IMO, it's better > > to define a Python Unicode representation which then gets converted > > to whatever wchar_t represents on the target machine. > > you should read the unicode.h file a bit more carefully: > > ... > > /* Unicode declarations. Tweak these to match your platform */ > > /* set this flag if the platform has "wchar.h", "wctype.h" and the > wchar_t type is a 16-bit unsigned type */ > #define HAVE_USABLE_WCHAR_H > > #if defined(WIN32) || defined(HAVE_USABLE_WCHAR_H) > > (this uses wchar_t, and also iswspace and friends) > > ... > > #else > > /* Use if you have a standard ANSI compiler, without wchar_t support. > If a short is not 16 bits on your platform, you have to fix the > typedef below, or the module initialization code will complain. */ > > (this maps iswspace to isspace, for 8-bit characters). > > #endif > > ... > > the plan was to use the second solution (using "configure" > to figure out what integer type to use), and its own uni- > code database table for the is/to primitives Oh, I did read unicode.h, stumbled across the mixed usage and decided not to like it ;-) Seriously, I find the second solution where you use the 'unsigned short' much more portable and straight forward. You never know what the compiler does for isw*() and it's probably better sticking to one format for all platforms. Only endianness gets in the way, but that's easy to handle. So I opt for 'unsigned short'. The encoding used in these 2 bytes is a different question though. If HP insists on Unicode 3.0, there's probably no other way than to use UTF-16. > (iirc, the unicode.txt file discussed this, but that one > seems to be missing from the zip archive). It's not in the file I downloaded from your site. Could you post it here ? -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Wed Nov 10 14:13:10 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Wed, 10 Nov 1999 14:13:10 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <Pine.LNX.4.10.9911110236000.18059-100000@nebula.lyra.org> <38295389.397DDE5E@equi4.com> Message-ID: <38296F66.5DF9263E@lemburg.com> Jean-Claude Wippler wrote: > > Greg Stein wrote: > [MAL:] > > > The downside of using UTF16: it is a variable length format, > > > so iterations over it will be slower than for UCS4. > > > > Bzzt. May as well go with UTF-8 as the internal format, much like Perl > > is doing (as I recall). > > Ehm, pardon me for asking - what is the brief rationale for selecting > UCS2/4, or whetever it ends up being, over UTF8? UCS-2 is the native format on major platforms (meaning straight fixed length encoding using 2 bytes), ie. interfacing between Python's Unicode object and the platform APIs will be simple and fast. UTF-8 is short for ASCII users, but imposes a performance hit for the CJK (Asian character sets) world, since UTF8 uses *variable* length encodings. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From akuchlin at mems-exchange.org Wed Nov 10 15:56:16 1999 From: akuchlin at mems-exchange.org (Andrew M. Kuchling) Date: Wed, 10 Nov 1999 09:56:16 -0500 (EST) Subject: [Python-Dev] Re: regexp performance In-Reply-To: <027c01bf2b69$59e60330$f29b12c2@secret.pythonware.com> References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> <199911091646.LAA21467@eric.cnri.reston.va.us> <14376.22527.323888.677816@amarok.cnri.reston.va.us> <199911091726.MAA21754@eric.cnri.reston.va.us> <14376.23671.250752.637144@amarok.cnri.reston.va.us> <14376.25768.368164.88151@anthem.cnri.reston.va.us> <14376.30652.201552.116828@amarok.cnri.reston.va.us> <027c01bf2b69$59e60330$f29b12c2@secret.pythonware.com> Message-ID: <14377.34704.639462.794509@amarok.cnri.reston.va.us> [Cc'ed to the String-SIG; sheesh, what's the point of having SIGs otherwise?] Fredrik Lundh writes: >any special pattern constructs that are in need of per- >formance improvements? (compared to Perl, that is). In the 1.5 source tree, I think one major slowdown is coming from the malloc'ed failure stack. This was introduced in order to prevent an expression like (x)* from filling the stack when applied to a string contained 50,000 'x' characters (hence 50,000 recursive function calls). I'd like to get rid of this stack because it's slow and requires much tedious patching of the upstream PCRE. >or maybe anyone has an extensive performance test >suite for perlish regular expressions? (preferrably based >on how real people use regular expressions, not only on >things that are known to be slow if not optimized) Friedl's book describes several optimizations which aren't implemented in PCRE. The problem is that PCRE never builds a parse tree, and parse trees are easy to analyse recursively. Instead, PCRE's functions actually look at the compiled byte codes (for example, look at find_firstchar or is_anchored in pypcre.c), but this makes analysis functions hard to write, and rearranging the code near-impossible. -- A.M. Kuchling http://starship.python.net/crew/amk/ I didn't say it was my fault. I said it was my responsibility. I know the difference. -- Rose Walker, in SANDMAN #60: "The Kindly Ones:4" From jack at oratrix.nl Wed Nov 10 16:04:58 1999 From: jack at oratrix.nl (Jack Jansen) Date: Wed, 10 Nov 1999 16:04:58 +0100 Subject: [Python-Dev] I18N Toolkit In-Reply-To: Message by "Fredrik Lundh" <fredrik@pythonware.com> , Wed, 10 Nov 1999 11:52:28 +0100 , <029c01bf2b69$af0da250$f29b12c2@secret.pythonware.com> Message-ID: <19991110150458.B542735BB1E@snelboot.oratrix.nl> > a slightly hairer design issue is what combinations > of pattern and string the new 're' will handle. > > the first two are obvious: > > ordinary pattern, ordinary string > unicode pattern, unicode string > > but what about these? > > ordinary pattern, unicode string > unicode pattern, ordinary string I think the logical thing to do would be to "promote" the ordinary pattern or string to unicode, in a similar way to what happens if you combine ints and floats in a single expression. The result may be a bit surprising if your pattern is in ascii and you've never been aware of unicode and are given such a string from somewhere else, but then if you're only aware of integer arithmetic and are suddenly presented with a couple of floats you'll also be pretty surprised at the result. At least it's easily explained. -- Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++ Jack.Jansen at oratrix.com | ++++ if you agree copy these lines to your sig ++++ www.oratrix.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm From fdrake at acm.org Wed Nov 10 16:22:17 1999 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Wed, 10 Nov 1999 10:22:17 -0500 (EST) Subject: Internal Format (Re: [Python-Dev] Internationalization Toolkit) In-Reply-To: <00f401bf2b53$3c234e90$f29b12c2@secret.pythonware.com> References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> <199911091646.LAA21467@eric.cnri.reston.va.us> <00f401bf2b53$3c234e90$f29b12c2@secret.pythonware.com> Message-ID: <14377.36265.315127.788319@weyr.cnri.reston.va.us> Fredrik Lundh writes: > having been there and done that, I strongly suggest > a third option: a 16-bit unsigned integer, in platform > specific byte order (PY_UNICODE_T). along all other I actually like this best, but I understand that there are reasons for using wchar_t, especially for interfacing with other code that uses Unicode. Perhaps someone who knows more about the specific issues with interfacing using wchar_t can summarize them, or point me to whatever I've already missed. p-) -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> Corporation for National Research Initiatives From skip at mojam.com Wed Nov 10 16:54:30 1999 From: skip at mojam.com (Skip Montanaro) Date: Wed, 10 Nov 1999 09:54:30 -0600 (CST) Subject: cached encoding (Re: [Python-Dev] Internationalization Toolkit) In-Reply-To: <010b01bf2b54$fb107430$f29b12c2@secret.pythonware.com> References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> <199911091646.LAA21467@eric.cnri.reston.va.us> <010b01bf2b54$fb107430$f29b12c2@secret.pythonware.com> Message-ID: <14377.38198.793496.870273@dolphin.mojam.com> Just a couple observations from the peanut gallery... 1. I'm glad I don't have to do this Unicode/UTF/internationalization stuff. Seems like it would be easier to just get the whole world speaking Esperanto. 2. Are there plans for an internationalization session at IPC8? Perhaps a few key players could be locked into a room for a couple days, to emerge bloodied, but with an implementation in-hand... Skip Montanaro | http://www.mojam.com/ skip at mojam.com | http://www.musi-cal.com/ 847-971-7098 | Python: Programming the way Guido indented... From fdrake at acm.org Wed Nov 10 16:58:30 1999 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Wed, 10 Nov 1999 10:58:30 -0500 (EST) Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <38295A08.D3928401@lemburg.com> References: <19991110080926.2400.rocketmail@web602.mail.yahoo.com> <38295A08.D3928401@lemburg.com> Message-ID: <14377.38438.615701.231437@weyr.cnri.reston.va.us> M.-A. Lemburg writes: > def encode(self,u): > > """ Return the Unicode object u encoded as Python string. This should accept an optional slice parameter, and use it in the same way as .dump(). > def dump(self,u,stream,slice=None): ... > def load(self,stream,length=None): Why not have something like .wrapFile(f) that returns a file-like object with all the file methods implemented, and doing to "right thing" regarding encoding/decoding? That way, the new file-like object can be used directly with code that works with files and doesn't care whether it uses 8-bit or unicode strings. > Codecs should raise an UnicodeError in case the conversion is > not possible. I think that should be ValueError, or UnicodeError should be a subclass of ValueError. (Can the -X interpreter option be removed yet?) -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> Corporation for National Research Initiatives From bwarsaw at cnri.reston.va.us Wed Nov 10 17:41:29 1999 From: bwarsaw at cnri.reston.va.us (Barry A. Warsaw) Date: Wed, 10 Nov 1999 11:41:29 -0500 (EST) Subject: cached encoding (Re: [Python-Dev] Internationalization Toolkit) References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> <199911091646.LAA21467@eric.cnri.reston.va.us> <010b01bf2b54$fb107430$f29b12c2@secret.pythonware.com> <14377.38198.793496.870273@dolphin.mojam.com> Message-ID: <14377.41017.413515.887236@anthem.cnri.reston.va.us> >>>>> "SM" == Skip Montanaro <skip at mojam.com> writes: SM> 2. Are there plans for an internationalization session at SM> IPC8? Perhaps a few key players could be locked into a room SM> for a couple days, to emerge bloodied, but with an SM> implementation in-hand... I'm starting to think about devday topics. Sounds like an I18n session would be very useful. Champions? -Barry From mal at lemburg.com Wed Nov 10 14:31:47 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Wed, 10 Nov 1999 14:31:47 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <Pine.LNX.4.10.9911110236000.18059-100000@nebula.lyra.org> <039c01bf2b78$b234d520$f29b12c2@secret.pythonware.com> Message-ID: <382973C3.DCA77051@lemburg.com> Fredrik Lundh wrote: > > Greg Stein <gstein at lyra.org> wrote: > > Have you ever noticed how Python modules, packages, tools, etc, never > > define an import hook? > > hey, didn't MAL use one in one of his mx kits? ;-) Not yet, but I will unless my last patch ("walk me up, Scotty" - import) goes into the core interpreter. > > I say axe it and say "UTF-8" is the fixed, default encoding. If you want > > something else, then do that explicitly. > > exactly. > > modes are evil. python is not perl. etc. But a requirement by the customer... they want to be able to set the locale on a per thread basis. Not exactly my preference (I think all locale settings should be passed as parameters, not via globals). > > Are we digging a hole for ourselves? Maybe. But there are two other big > > platforms that have the same hole to dig out of *IF* it ever comes to > > that. I posit that it won't be necessary; that the people needing UCS-4 > > can do so entirely in Python. > > last time I checked, there were no characters (even in the > ISO standard) outside the 16-bit range. has that changed? No, but people are already thinking about it and there is a defined range in the >16-bit area for private encodings (F0000..FFFFD and 100000..10FFFD). -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mhammond at skippinet.com.au Wed Nov 10 22:36:04 1999 From: mhammond at skippinet.com.au (Mark Hammond) Date: Thu, 11 Nov 1999 08:36:04 +1100 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <382973C3.DCA77051@lemburg.com> Message-ID: <005701bf2bc3$980f4d60$0501a8c0@bobcat> Marc writes: > > modes are evil. python is not perl. etc. > > But a requirement by the customer... they want to be able to > set the locale > on a per thread basis. Not exactly my preference (I think all locale > settings should be passed as parameters, not via globals). Sure - that is what this customer wants, but we need to be clear about the "best thing" for Python generally versus what this particular client wants. For example, if we went with UTF-8 as the only default encoding, then HP may be forced to use a helper function to perform the conversion, rather than the built-in functions. This helper function can use TLS (in Python) to store the encoding. At least it is localized. I agree that having a default encoding that can be changed is a bad idea. It may make 3 line scripts that need to print something easier to work with, but at the cost of reliability in large systems. Kinda like the existing "locale" support, which is thread specific, and is well known to cause these sorts of problems. The end result is that in your app, you find _someone_ has changed the default encoding, and some code no longer works. So the solution is to change the default encoding back, so _your_ code works again. You just know that whoever it was that changed the default encoding in the first place is now going to break - but what else can you do? Having a fixed, default encoding may make life slightly more difficult when you want to work primarily in a different encoding, but at least your system is predictable and reliable. Mark. > > > > Are we digging a hole for ourselves? Maybe. But there are > two other big > > > platforms that have the same hole to dig out of *IF* it > ever comes to > > > that. I posit that it won't be necessary; that the people > needing UCS-4 > > > can do so entirely in Python. > > > > last time I checked, there were no characters (even in the > > ISO standard) outside the 16-bit range. has that changed? > > No, but people are already thinking about it and there is > a defined range in the >16-bit area for private encodings > (F0000..FFFFD and 100000..10FFFD). > > -- > Marc-Andre Lemburg > ______________________________________________________________________ > Y2000: 51 days left > Business: http://www.lemburg.com/ > Python Pages: http://www.lemburg.com/python/ > > > _______________________________________________ > Python-Dev maillist - Python-Dev at python.org > http://www.python.org/mailman/listinfo/python-dev > From gstein at lyra.org Fri Nov 12 00:14:55 1999 From: gstein at lyra.org (Greg Stein) Date: Thu, 11 Nov 1999 15:14:55 -0800 (PST) Subject: [Python-Dev] default encodings (was: Internationalization Toolkit) In-Reply-To: <005701bf2bc3$980f4d60$0501a8c0@bobcat> Message-ID: <Pine.LNX.4.10.9911111502360.18059-100000@nebula.lyra.org> On Thu, 11 Nov 1999, Mark Hammond wrote: > Marc writes: > > > modes are evil. python is not perl. etc. > > > > But a requirement by the customer... they want to be able to > > set the locale > > on a per thread basis. Not exactly my preference (I think all locale > > settings should be passed as parameters, not via globals). > > Sure - that is what this customer wants, but we need to be clear about > the "best thing" for Python generally versus what this particular > client wants. Ha! I was getting ready to say exactly the same thing. Are building Python for a particular customer, or are we building it to Do The Right Thing? I've been getting increasingly annoyed at "well, HP says this" or "HP wants that." I'm ecstatic that they are a Consortium member and are helping to fund the development of Python. However, if that means we are selling Python's soul to corporate wishes rather than programming and design ideals... well, it reduces my enthusiasm :-) >... > I agree that having a default encoding that can be changed is a bad > idea. It may make 3 line scripts that need to print something easier > to work with, but at the cost of reliability in large systems. Kinda > like the existing "locale" support, which is thread specific, and is > well known to cause these sorts of problems. The end result is that > in your app, you find _someone_ has changed the default encoding, and > some code no longer works. So the solution is to change the default > encoding back, so _your_ code works again. You just know that whoever > it was that changed the default encoding in the first place is now > going to break - but what else can you do? Yes! Yes! Example #2. My first example (import hooks) was shrugged off by some as "well, nobody uses those." Okay, maybe people don't use them (but I believe that is *because* of this kind of problem). In Mark's example, however... this is a definite problem. I ran into this when I was building some code for Microsoft Site Server. IIS was setting a different locale on my thread -- one that I definitely was not expecting. All of a sudden, strlwr() no longer worked as I expected -- certain characters didn't get lower-cased, so my dictionary lookups failed because the keys were not all lower-cased. Solution? Before passing control from C++ into Python, I set the locale to the default locale. Restored it on the way back out. Extreme measures, and costly to do, but it had to be done. I think I'll pick up Fredrik's phrase here... (chanting) "Modes Are Evil!" "Modes Are Evil!" "Down with Modes!" :-) > Having a fixed, default encoding may make life slightly more difficult > when you want to work primarily in a different encoding, but at least > your system is predictable and reliable. *bing* I'm with Mark on this one. Global modes and state are a serious pain when it comes to developing a system. Python is very amenable to utility functions and classes. Any "customer" can use a utility function to manually do the encoding according to a per-thread setting stashed in some module-global dictionary (map thread-id to default-encoding). Done. Keep it out of the interpreter... Cheers, -g -- Greg Stein, http://www.lyra.org/ From da at ski.org Thu Nov 11 00:21:54 1999 From: da at ski.org (David Ascher) Date: Wed, 10 Nov 1999 15:21:54 -0800 (Pacific Standard Time) Subject: [Python-Dev] default encodings (was: Internationalization Toolkit) In-Reply-To: <Pine.LNX.4.10.9911111502360.18059-100000@nebula.lyra.org> Message-ID: <Pine.WNT.4.04.9911101519110.244-100000@rigoletto.ski.org> On Thu, 11 Nov 1999, Greg Stein wrote: > Ha! I was getting ready to say exactly the same thing. Are building Python > for a particular customer, or are we building it to Do The Right Thing? > > I've been getting increasingly annoyed at "well, HP says this" or "HP > wants that." I'm ecstatic that they are a Consortium member and are > helping to fund the development of Python. However, if that means we are > selling Python's soul to corporate wishes rather than programming and > design ideals... well, it reduces my enthusiasm :-) What about just explaining the rationale for the default-less point of view to whoever is in charge of this at HP and see why they came up with their rationale in the first place? They might have a good reason, or they might be willing to change said requirement. --david From gstein at lyra.org Fri Nov 12 00:31:43 1999 From: gstein at lyra.org (Greg Stein) Date: Thu, 11 Nov 1999 15:31:43 -0800 (PST) Subject: [Python-Dev] default encodings (was: Internationalization Toolkit) In-Reply-To: <Pine.WNT.4.04.9911101519110.244-100000@rigoletto.ski.org> Message-ID: <Pine.LNX.4.10.9911111531200.18059-100000@nebula.lyra.org> Damn, you're smooth... maybe you should have run for SF Mayor... :-) On Wed, 10 Nov 1999, David Ascher wrote: > On Thu, 11 Nov 1999, Greg Stein wrote: > > > Ha! I was getting ready to say exactly the same thing. Are building Python > > for a particular customer, or are we building it to Do The Right Thing? > > > > I've been getting increasingly annoyed at "well, HP says this" or "HP > > wants that." I'm ecstatic that they are a Consortium member and are > > helping to fund the development of Python. However, if that means we are > > selling Python's soul to corporate wishes rather than programming and > > design ideals... well, it reduces my enthusiasm :-) > > What about just explaining the rationale for the default-less point of > view to whoever is in charge of this at HP and see why they came up with > their rationale in the first place? They might have a good reason, or > they might be willing to change said requirement. > > --david > -- Greg Stein, http://www.lyra.org/ From tim_one at email.msn.com Thu Nov 11 07:25:27 1999 From: tim_one at email.msn.com (Tim Peters) Date: Thu, 11 Nov 1999 01:25:27 -0500 Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit) In-Reply-To: <00f501bf2b53$9872e610$f29b12c2@secret.pythonware.com> Message-ID: <000201bf2c0d$8b866160$262d153f@tim> [/F, dripping with code] > ... > Note that the 'u' must be followed by four hexadecimal digits. If > fewer digits are given, the sequence is left in the resulting string > exactly as given. Yuck -- don't let probable error pass without comment. "must be" == "must be"! [moving backwards] > \uxxxx -- Unicode character with hexadecimal value xxxx. The > character is stored using UTF-8 encoding, which means that this > sequence can result in up to three encoded characters. The code is fine, but I've gotten confused about what the intent is now. Expanding \uxxxx to its UTF-8 encoding made sense when MAL had UTF-8 literals, but now he's got Unicode-escaped literals instead -- and you favor an internal 2-byte-per-char Unicode storage format. In that combination of worlds, is there any use in the *language* (as opposed to in a runtime module) for \uxxxx -> UTF-8 conversion? And MAL, if you're listening, I'm not clear on what a Unicode-escaped literal means. When you had UTF-8 literals, the meaning of something like u"a\340\341" was clear, since UTF-8 is defined as a byte stream and UTF-8 string literals were just a way of specifying a byte stream. As a Unicode-escaped string, I assume the "a" maps to the Unicode "a", but what of the rest? Are the octal escapes to be taken as two separate Latin-1 characters (in their role as a Unicode subset), or as an especially clumsy way to specify a single 16-bit Unicode character? I'm afraid I'd vote for the former. Same issue wrt \x escapes. One other issue: are there "raw" Unicode strings too, as in ur"\u20ac"? There probably should be; and while Guido will hate this, a ur string should probably *not* leave \uxxxx escapes untouched. Nasties like this are why Java defines \uxxxx expansion as occurring in a preprocessing step. BTW, the meaning of \uxxxx in a non-Unicode string is now also unclear (or isn't \uxxxx allowed in a non-Unicode string? that's what I would do ...). From tim_one at email.msn.com Thu Nov 11 07:49:16 1999 From: tim_one at email.msn.com (Tim Peters) Date: Thu, 11 Nov 1999 01:49:16 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <Pine.LNX.4.10.9911110315330.18059-100000@nebula.lyra.org> Message-ID: <000501bf2c10$df4679e0$262d153f@tim> [ Greg Stein] > ... > Things will be a lot faster if we have a fixed-size character. Variable > length formats like UTF-8 are a lot harder to slice, search, etc. The initial byte of any UTF-8 encoded character never appears in a *non*-initial position of any UTF-8 encoded character. Which means searching is not only tractable in UTF-8, but also that whatever optimized 8-bit clean string searching routines you happen to have sitting around today can be used as-is on UTF-8 encoded strings. This is not true of UCS-2 encoded strings (in which "the first" byte is not distinguished, so 8-bit search is vulnerable to finding a hit starting "in the middle" of a character). More, to the extent that the bulk of your text is plain ASCII, the UTF-8 search will run much faster than when using a 2-byte encoding, simply because it has half as many bytes to chew over. UTF-8 is certainly slower for random-access indexing, including slicing. I don't know what "etc" means, but if it follows the pattern so far, sometimes it's faster and sometimes it's slower <wink>. > (IMO) a big reason for this new type is for interaction with the > underlying OS/platform. I don't know of any platforms right now that > really use UTF-8 as their Unicode string representation (meaning we'd > have to convert back/forth from our UTF-8 representation to talk to the > OS). No argument here. From tim_one at email.msn.com Thu Nov 11 07:56:35 1999 From: tim_one at email.msn.com (Tim Peters) Date: Thu, 11 Nov 1999 01:56:35 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <382968B7.ABFFD4C0@lemburg.com> Message-ID: <000601bf2c11$e4b07920$262d153f@tim> [MAL, on Unicode chr() and ord() > ... > Because unichr() will always have to return Unicode objects. You don't > want chr(i) to return Unicode for i>255 and strings for i<256. Indeed I do not! > OTOH, ord() could probably be extended to also work on Unicode objects. I think should be -- it's a good & natural use of polymorphism; introducing a new function *here* would be as odd as introducing a unilen() function to get the length of a Unicode string. From tim_one at email.msn.com Thu Nov 11 08:03:34 1999 From: tim_one at email.msn.com (Tim Peters) Date: Thu, 11 Nov 1999 02:03:34 -0500 Subject: [Python-Dev] RE: [String-SIG] Re: regexp performance In-Reply-To: <14377.34704.639462.794509@amarok.cnri.reston.va.us> Message-ID: <000701bf2c12$de8bca80$262d153f@tim> [Andrew M. Kuchling] > ... > Friedl's book describes several optimizations which aren't implemented > in PCRE. The problem is that PCRE never builds a parse tree, and > parse trees are easy to analyse recursively. Instead, PCRE's > functions actually look at the compiled byte codes (for example, look > at find_firstchar or is_anchored in pypcre.c), but this makes analysis > functions hard to write, and rearranging the code near-impossible. This is wonderfully & ironically Pythonic. That is, the Python compiler itself goes straight to byte code, and the optimization that's done works at the latter low level. Luckily <wink>, very little optimization is attempted, and what's there only replaces one bytecode with another of the same length. If it tried to do more, it would have to rearrange the code ... the-more-things-differ-the-more-things-don't-ly y'rs - tim From tim_one at email.msn.com Thu Nov 11 08:27:52 1999 From: tim_one at email.msn.com (Tim Peters) Date: Thu, 11 Nov 1999 02:27:52 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <382973C3.DCA77051@lemburg.com> Message-ID: <000801bf2c16$43f9a4c0$262d153f@tim> [/F] > last time I checked, there were no characters (even in the > ISO standard) outside the 16-bit range. has that changed? [MAL] > No, but people are already thinking about it and there is > a defined range in the >16-bit area for private encodings > (F0000..FFFFD and 100000..10FFFD). Over the decades I've developed a rule of thumb that has never wound up stuck in my ass <wink>: If I engineer code that I expect to be in use for N years, I make damn sure that every internal limit is at least 10x larger than the largest I can conceive of a user making reasonable use of at the end of those N years. The invariable result is that the N years pass, and fewer than half of the users have bumped into the limit <0.5 wink>. At the risk of offending everyone, I'll suggest that, qualitatively speaking, Unicode is as Eurocentric as ASCII is Anglocentric. We've just replaced "256 characters?! We'll *never* run out of those!" with 64K. But when Asian languages consume them 7K at a pop, 64K isn't even in my 10x comfort range for some individual languages. In just a few months, Unicode 3 will already have used up > 56K of the 64K slots. As I understand it, UTF-16 "only" adds 1M new code points. That's in my 10x zone, for about a decade. predicting-we'll-live-to-regret-it-either-way-ly y'rs - tim From captainrobbo at yahoo.com Thu Nov 11 08:29:05 1999 From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=) Date: Wed, 10 Nov 1999 23:29:05 -0800 (PST) Subject: cached encoding (Re: [Python-Dev] Internationalization Toolkit) Message-ID: <19991111072905.25203.rocketmail@web607.mail.yahoo.com> > 2. Are there plans for an internationalization > session at IPC8? Perhaps a > few key players could be locked into a room for a > couple days, to emerge > bloodied, but with an implementation in-hand... Excellent idea. - Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From tim_one at email.msn.com Thu Nov 11 08:29:50 1999 From: tim_one at email.msn.com (Tim Peters) Date: Thu, 11 Nov 1999 02:29:50 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <005701bf2bc3$980f4d60$0501a8c0@bobcat> Message-ID: <000901bf2c16$8a107420$262d153f@tim> [Mark Hammond] > Sure - that is what this customer wants, but we need to be clear about > the "best thing" for Python generally versus what this particular > client wants. > ... > Having a fixed, default encoding may make life slightly more difficult > when you want to work primarily in a different encoding, but at least > your system is predictable and reliable. Well said, Mark! Me too. It's like HP is suffering from Windows envy <wink>. From captainrobbo at yahoo.com Thu Nov 11 08:30:53 1999 From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=) Date: Wed, 10 Nov 1999 23:30:53 -0800 (PST) Subject: cached encoding (Re: [Python-Dev] Internationalization Toolkit) Message-ID: <19991111073053.7884.rocketmail@web602.mail.yahoo.com> --- "Barry A. Warsaw" <bwarsaw at cnri.reston.va.us> wrote: > > I'm starting to think about devday topics. Sounds > like an I18n > session would be very useful. Champions? > I'm willing to explain what the fuss is about to bemused onlookers and give some examples of problems it should be able to solve - plenty of good slides and screen shots. I'll stay well away from the C implementation issues. Regards, Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From captainrobbo at yahoo.com Thu Nov 11 08:33:25 1999 From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=) Date: Wed, 10 Nov 1999 23:33:25 -0800 (PST) Subject: [Python-Dev] default encodings (was: Internationalization Toolkit) Message-ID: <19991111073325.8024.rocketmail@web602.mail.yahoo.com> > > What about just explaining the rationale for the > default-less point of > view to whoever is in charge of this at HP and see > why they came up with > their rationale in the first place? They might have > a good reason, or > they might be willing to change said requirement. > > --david For that matter (I came into this a bit late), is there a statement somewhere of what HP actually want to do? - Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From captainrobbo at yahoo.com Thu Nov 11 08:44:50 1999 From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=) Date: Wed, 10 Nov 1999 23:44:50 -0800 (PST) Subject: [Python-Dev] Internationalization Toolkit Message-ID: <19991111074450.20451.rocketmail@web606.mail.yahoo.com> > I say axe it and say "UTF-8" is the fixed, default > encoding. If you want > something else, then do that explicitly. > Let me tell you why you would want to have an encoding which can be set: (1) sday I am on a Japanese Windows box, I have a string called 'address' and I do 'print address'. If I see utf8, I see garbage. If I see Shift-JIS, I see the correct Japanese address. At this point in time, utf8 is an interchange format but 99% of the world's data is in various native encodings. Analogous problems occur on input. (2) I'm using htmlgen, which 'prints' objects to standard output. My web site is supposed to be encoded in Shift-JIS (or EUC, or Big 5 for Taiwan, etc.) Yes, browsers CAN detect and display UTF8 but you just don't find UTF8 sites in the real world - and most users just don't know about the encoding menu, and will get pissed off if they have to reach for it. Ditto for streaming output in some protocol. Java solves this (and we could too by hacking stdout) using Writer classes which are created as wrappers around an output stream and can take an encoding, but you lose the flexibility to 'just print'. I think being able to change encoding would be useful. What I do not want is to auto-detect it from the operating system when Python boots - that would be a portability nightmare. Regards, Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From fredrik at pythonware.com Thu Nov 11 09:06:04 1999 From: fredrik at pythonware.com (Fredrik Lundh) Date: Thu, 11 Nov 1999 09:06:04 +0100 Subject: [Python-Dev] RE: [String-SIG] Re: regexp performance References: <000701bf2c12$de8bca80$262d153f@tim> Message-ID: <009201bf2c1b$9a5c1b90$f29b12c2@secret.pythonware.com> Tim Peters <tim_one at email.msn.com> wrote: > > The problem is that PCRE never builds a parse tree, and > > parse trees are easy to analyse recursively. Instead, PCRE's > > functions actually look at the compiled byte codes (for example, look > > at find_firstchar or is_anchored in pypcre.c), but this makes analysis > > functions hard to write, and rearranging the code near-impossible. > > This is wonderfully & ironically Pythonic. That is, the Python compiler > itself goes straight to byte code, and the optimization that's done works at > the latter low level. yeah, but by some reason, people (including GvR) expect a regular expression machinery to be more optimized than the language interpreter ;-) </F> From tim_one at email.msn.com Thu Nov 11 09:01:58 1999 From: tim_one at email.msn.com (Tim Peters) Date: Thu, 11 Nov 1999 03:01:58 -0500 Subject: [Python-Dev] default encodings (was: Internationalization Toolkit) In-Reply-To: <19991111073325.8024.rocketmail@web602.mail.yahoo.com> Message-ID: <000c01bf2c1b$0734c060$262d153f@tim> [Andy Robinson] > For that matter (I came into this a bit late), is > there a statement somewhere of what HP actually want > to do? On this list, the best explanation we got was from Guido: they want "internationalization", and "Perl-compatible Unicode regexps". I'm not sure they even know the two aren't identical <0.9 wink>. code-without-requirements-is-like-sex-without-consequences-ly y'rs - tim From guido at CNRI.Reston.VA.US Thu Nov 11 13:03:51 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Thu, 11 Nov 1999 07:03:51 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: Your message of "Wed, 10 Nov 1999 23:44:50 PST." <19991111074450.20451.rocketmail@web606.mail.yahoo.com> References: <19991111074450.20451.rocketmail@web606.mail.yahoo.com> Message-ID: <199911111203.HAA24221@eric.cnri.reston.va.us> > Let me tell you why you would want to have an encoding > which can be set: > > (1) sday I am on a Japanese Windows box, I have a > string called 'address' and I do 'print address'. If > I see utf8, I see garbage. If I see Shift-JIS, I see > the correct Japanese address. At this point in time, > utf8 is an interchange format but 99% of the world's > data is in various native encodings. > > Analogous problems occur on input. > > (2) I'm using htmlgen, which 'prints' objects to > standard output. My web site is supposed to be > encoded in Shift-JIS (or EUC, or Big 5 for Taiwan, > etc.) Yes, browsers CAN detect and display UTF8 but > you just don't find UTF8 sites in the real world - and > most users just don't know about the encoding menu, > and will get pissed off if they have to reach for it. > > Ditto for streaming output in some protocol. > > Java solves this (and we could too by hacking stdout) > using Writer classes which are created as wrappers > around an output stream and can take an encoding, but > you lose the flexibility to 'just print'. > > I think being able to change encoding would be useful. > What I do not want is to auto-detect it from the > operating system when Python boots - that would be a > portability nightmare. You almost convinced me there, but I think this can still be done without changing the default encoding: simply reopen stdout with a different encoding. This is how Java does it. I/O streams with an encoding specified at open() are a very powerful feature. You can hide this in your $PYTHONSTARTUP. Fran?ois Pinard might not like it though... BTW, someone asked what HP asked for: I can't reveal what exactly they asked for, basically because they don't seem to agree amongst themselves. The only firm statements I have is that they want i18n and that they want it fast (before the end of the year). The desire from Perl-compatible regexps comes from me, and the only reason is compatibility with re.py. (HP did ask for regexps, but they don't know the difference between POSIX and Perl if it poked them in the eye.) --Guido van Rossum (home page: http://www.python.org/~guido/) From gstein at lyra.org Thu Nov 11 13:20:39 1999 From: gstein at lyra.org (Greg Stein) Date: Thu, 11 Nov 1999 04:20:39 -0800 (PST) Subject: [Python-Dev] Internationalization Toolkit (fwd) Message-ID: <Pine.LNX.4.10.9911110419400.27203-100000@nebula.lyra.org> Andy originally sent this just to me... I replied in kind, but saw that he sent another copy to python-dev. Sending my reply there... ---------- Forwarded message ---------- Date: Thu, 11 Nov 1999 04:00:38 -0800 (PST) From: Greg Stein <gstein at lyra.org> To: andy at robanal.demon.co.uk Subject: Re: [Python-Dev] Internationalization Toolkit [ note: you sent direct to me; replying in kind in case that was your intent ] On Wed, 10 Nov 1999, [iso-8859-1] Andy Robinson wrote: >... > Let me tell you why you would want to have an encoding > which can be set: >...snip: two examples of how "print" fails... Neither of those examples are solid reasons for having a default encoding that can be changed. Both can easily be altered at the Python level by using an encoding function before printing. You're asking for convenience, *not* providing a reason. > Java solves this (and we could too) using Writer > classes which are created as wrappers around an output > stream and can take an encoding, but you lose the > flexibility to just print. Not flexibility: convenience. You can certainly do: print encode(u,'Shift-JIS') > I think being able to change encoding would be useful. > What I do not want is to auto-detect it from the > operating system when Python boots - that would be a > portability nightmare. Useful, but not a requirement. Keep the interpreter simple, understandable, and predictable. A module that changes the default over to 'utf-8' because it is interacting with a network object is going to screw up your app if you're relying on an encoding of 'shift-jis' to be present. Cheers, -g -- Greg Stein, http://www.lyra.org/ From captainrobbo at yahoo.com Thu Nov 11 13:49:10 1999 From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=) Date: Thu, 11 Nov 1999 04:49:10 -0800 (PST) Subject: [Python-Dev] Internationalization Toolkit Message-ID: <19991111124910.6373.rocketmail@web603.mail.yahoo.com> > You almost convinced me there, but I think this can > still be done > without changing the default encoding: simply reopen > stdout with a > different encoding. This is how Java does it. I/O > streams with an > encoding specified at open() are a very powerful > feature. You can > hide this in your $PYTHONSTARTUP. Good point, I'm happy with this. Make sure we specify it in the docs as the right way to do it. In an IDE, we'd have an Options screen somewhere for the output encoding. What the Java code I have seen does is to open a raw file and construct wrappers (InputStreamReader, OutputStreamWriter) around it to do an encoding conversion. This kind of obfuscates what is going on - Python just needs the extra argument. - Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From mal at lemburg.com Thu Nov 11 13:42:51 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Thu, 11 Nov 1999 13:42:51 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <19991110080926.2400.rocketmail@web602.mail.yahoo.com> <38295A08.D3928401@lemburg.com> <14377.38438.615701.231437@weyr.cnri.reston.va.us> Message-ID: <382AB9CB.634A9782@lemburg.com> "Fred L. Drake, Jr." wrote: > > M.-A. Lemburg writes: > > def encode(self,u): > > > > """ Return the Unicode object u encoded as Python string. > > This should accept an optional slice parameter, and use it in the > same way as .dump(). Ok. > > def dump(self,u,stream,slice=None): > ... > > def load(self,stream,length=None): > > Why not have something like .wrapFile(f) that returns a file-like > object with all the file methods implemented, and doing to "right > thing" regarding encoding/decoding? That way, the new file-like > object can be used directly with code that works with files and > doesn't care whether it uses 8-bit or unicode strings. See File Output of the latest version: File/Stream Output: ------------------- Since file.write(object) and most other stream writers use the 's#' argument parsing marker, the buffer interface implementation determines the encoding to use (see Buffer Interface). For explicit handling of Unicode using files, the unicodec module could provide stream wrappers which provide transparent encoding/decoding for any open stream (file-like object): import unicodec file = open('mytext.txt','rb') ufile = unicodec.stream(file,'utf-16') u = ufile.read() ... ufile.close() XXX unicodec.file(<filename>,<mode>,<encname>) could be provided as short-hand for unicodec.file(open(<filename>,<mode>),<encname>) which also assures that <mode> contains the 'b' character when needed. > > Codecs should raise an UnicodeError in case the conversion is > > not possible. > > I think that should be ValueError, or UnicodeError should be a > subclass of ValueError. Ok. > (Can the -X interpreter option be removed yet?) Doesn't Python convert class exceptions to strings when -X is used ? I would guess that many scripts already rely on the class based mechanism (much of my stuff does for sure), so by the time 1.6 is out, I think -X should be considered an option to run pre 1.5 code rather than using it for performance reasons. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 50 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Thu Nov 11 14:01:40 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Thu, 11 Nov 1999 14:01:40 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <005701bf2bc3$980f4d60$0501a8c0@bobcat> Message-ID: <382ABE34.5D27C701@lemburg.com> Mark Hammond wrote: > > Marc writes: > > > > modes are evil. python is not perl. etc. > > > > But a requirement by the customer... they want to be able to > > set the locale > > on a per thread basis. Not exactly my preference (I think all locale > > settings should be passed as parameters, not via globals). > > Sure - that is what this customer wants, but we need to be clear about > the "best thing" for Python generally versus what this particular > client wants. > > For example, if we went with UTF-8 as the only default encoding, then > HP may be forced to use a helper function to perform the conversion, > rather than the built-in functions. This helper function can use TLS > (in Python) to store the encoding. At least it is localized. > > I agree that having a default encoding that can be changed is a bad > idea. It may make 3 line scripts that need to print something easier > to work with, but at the cost of reliability in large systems. Kinda > like the existing "locale" support, which is thread specific, and is > well known to cause these sorts of problems. The end result is that > in your app, you find _someone_ has changed the default encoding, and > some code no longer works. So the solution is to change the default > encoding back, so _your_ code works again. You just know that whoever > it was that changed the default encoding in the first place is now > going to break - but what else can you do? > > Having a fixed, default encoding may make life slightly more difficult > when you want to work primarily in a different encoding, but at least > your system is predictable and reliable. I think the discussion on this is getting a little too hot. The point is simply that the option of changing the per-thread default encoding is there. You are not required to use it and if you do you are on your own when something breaks. Think of it as a HP specific feature... perhaps I should wrap the code in #ifdefs and leave it undocumented. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 50 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From fdrake at acm.org Thu Nov 11 16:02:32 1999 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Thu, 11 Nov 1999 10:02:32 -0500 (EST) Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <382AB9CB.634A9782@lemburg.com> References: <19991110080926.2400.rocketmail@web602.mail.yahoo.com> <38295A08.D3928401@lemburg.com> <14377.38438.615701.231437@weyr.cnri.reston.va.us> <382AB9CB.634A9782@lemburg.com> Message-ID: <14378.55944.371933.613604@weyr.cnri.reston.va.us> M.-A. Lemburg writes: > For explicit handling of Unicode using files, the unicodec module > could provide stream wrappers which provide transparent > encoding/decoding for any open stream (file-like object): Sounds good to me! I guess I just missed, there's been so much going on lately. > XXX unicodec.file(<filename>,<mode>,<encname>) could be provided as > short-hand for unicodec.file(open(<filename>,<mode>),<encname>) which > also assures that <mode> contains the 'b' character when needed. Actually, I'd call it unicodec.open(). I asked: > (Can the -X interpreter option be removed yet?) You commented: > Doesn't Python convert class exceptions to strings when -X is > used ? I would guess that many scripts already rely on the class > based mechanism (much of my stuff does for sure), so by the time > 1.6 is out, I think -X should be considered an option to run > pre 1.5 code rather than using it for performance reasons. Gosh, I never thought of it as a performance issue! What I'd like to do is avoid code like this: try: class UnicodeError(ValueError): # well, something would probably go here... pass except TypeError: class UnicodeError: # something slightly different for this one... pass Trying to use class exceptions can be really tedious, and often I'd like to pick up the stuff from Exception. -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> Corporation for National Research Initiatives From mal at lemburg.com Thu Nov 11 15:21:50 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Thu, 11 Nov 1999 15:21:50 +0100 Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit) References: <000201bf2c0d$8b866160$262d153f@tim> Message-ID: <382AD0FE.B604876A@lemburg.com> Tim Peters wrote: > > [/F, dripping with code] > > ... > > Note that the 'u' must be followed by four hexadecimal digits. If > > fewer digits are given, the sequence is left in the resulting string > > exactly as given. > > Yuck -- don't let probable error pass without comment. "must be" == "must > be"! I second that. > [moving backwards] > > \uxxxx -- Unicode character with hexadecimal value xxxx. The > > character is stored using UTF-8 encoding, which means that this > > sequence can result in up to three encoded characters. > > The code is fine, but I've gotten confused about what the intent is now. > Expanding \uxxxx to its UTF-8 encoding made sense when MAL had UTF-8 > literals, but now he's got Unicode-escaped literals instead -- and you favor > an internal 2-byte-per-char Unicode storage format. In that combination of > worlds, is there any use in the *language* (as opposed to in a runtime > module) for \uxxxx -> UTF-8 conversion? No, no... :-) I think it was a simple misunderstanding... \uXXXX is only to be used within u'' strings and then gets expanded to *one* character encoded in the internal Python format (which is heading towards UTF-16 without surrogates). > And MAL, if you're listening, I'm not clear on what a Unicode-escaped > literal means. When you had UTF-8 literals, the meaning of something like > > u"a\340\341" > > was clear, since UTF-8 is defined as a byte stream and UTF-8 string literals > were just a way of specifying a byte stream. As a Unicode-escaped string, I > assume the "a" maps to the Unicode "a", but what of the rest? Are the octal > escapes to be taken as two separate Latin-1 characters (in their role as a > Unicode subset), or as an especially clumsy way to specify a single 16-bit > Unicode character? I'm afraid I'd vote for the former. Same issue wrt \x > escapes. Good points. The conversion goes as follows: ? for single characters (and this includes all \XXX sequences except \uXXXX), take the ordinal and interpret it as Unicode ordinal ? for \uXXXX sequences, insert the Unicode character with ordinal 0xXXXX instead > One other issue: are there "raw" Unicode strings too, as in ur"\u20ac"? > There probably should be; and while Guido will hate this, a ur string should > probably *not* leave \uxxxx escapes untouched. Nasties like this are why > Java defines \uxxxx expansion as occurring in a preprocessing step. Not sure whether we really need to make this even more complicated... The \uXXXX strings look ugly, adding a few \\\\ for e.g. REs or filenames won't hurt much in the context of those \uXXXX monsters :-) > BTW, the meaning of \uxxxx in a non-Unicode string is now also unclear (or > isn't \uxxxx allowed in a non-Unicode string? that's what I would do ...). Right. \uXXXX will only be allowed in u'' strings, not in "normal" strings. BTW, if you want to type in UTF-8 strings and have them converted to Unicode, you can use the standard: u = unicode('...string with UTF-8 encoded characters...','utf-8') -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 50 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Thu Nov 11 15:23:45 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Thu, 11 Nov 1999 15:23:45 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <000601bf2c11$e4b07920$262d153f@tim> Message-ID: <382AD171.D22A1D6E@lemburg.com> Tim Peters wrote: > > [MAL, on Unicode chr() and ord() > > ... > > Because unichr() will always have to return Unicode objects. You don't > > want chr(i) to return Unicode for i>255 and strings for i<256. > > Indeed I do not! > > > OTOH, ord() could probably be extended to also work on Unicode objects. > > I think should be -- it's a good & natural use of polymorphism; introducing > a new function *here* would be as odd as introducing a unilen() function to > get the length of a Unicode string. Fine. So I'll drop the uniord() API and extend ord() instead. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 50 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Thu Nov 11 15:36:41 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Thu, 11 Nov 1999 15:36:41 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <000901bf2c16$8a107420$262d153f@tim> Message-ID: <382AD479.5261B43B@lemburg.com> Tim Peters wrote: > > [Mark Hammond] > > Sure - that is what this customer wants, but we need to be clear about > > the "best thing" for Python generally versus what this particular > > client wants. > > ... > > Having a fixed, default encoding may make life slightly more difficult > > when you want to work primarily in a different encoding, but at least > > your system is predictable and reliable. > > Well said, Mark! Me too. It's like HP is suffering from Windows envy > <wink>. See my other post on the subject... Note that if we make UTF-8 the standard encoding, nearly all special Latin-1 characters will produce UTF-8 errors on input and unreadable garbage on output. That will probably be unacceptable in Europe. To remedy this, one would *always* have to use u.encode('latin-1') to get readable output for Latin-1 strings repesented in Unicode. I'd rather see this happen the other way around: *always* explicitly state the encoding you want in case you rely on it, e.g. write file.write(u.encode('utf-8')) instead of file.write(u) # let's hope this goes out as UTF-8... Using the <default encoding> as site dependent setting is useful for convenience in those cases where the output format should be readable rather than parseable. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 50 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Thu Nov 11 15:26:59 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Thu, 11 Nov 1999 15:26:59 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <000801bf2c16$43f9a4c0$262d153f@tim> Message-ID: <382AD233.BE6DE888@lemburg.com> Tim Peters wrote: > > [/F] > > last time I checked, there were no characters (even in the > > ISO standard) outside the 16-bit range. has that changed? > > [MAL] > > No, but people are already thinking about it and there is > > a defined range in the >16-bit area for private encodings > > (F0000..FFFFD and 100000..10FFFD). > > Over the decades I've developed a rule of thumb that has never wound up > stuck in my ass <wink>: If I engineer code that I expect to be in use for N > years, I make damn sure that every internal limit is at least 10x larger > than the largest I can conceive of a user making reasonable use of at the > end of those N years. The invariable result is that the N years pass, and > fewer than half of the users have bumped into the limit <0.5 wink>. > > At the risk of offending everyone, I'll suggest that, qualitatively > speaking, Unicode is as Eurocentric as ASCII is Anglocentric. We've just > replaced "256 characters?! We'll *never* run out of those!" with 64K. But > when Asian languages consume them 7K at a pop, 64K isn't even in my 10x > comfort range for some individual languages. In just a few months, Unicode > 3 will already have used up > 56K of the 64K slots. > > As I understand it, UTF-16 "only" adds 1M new code points. That's in my 10x > zone, for about a decade. If HP approves, I'd propose to use UTF-16 as if it were UCS-2 and signal failure of this assertion at Unicode object construction time via an exception. That way we are within the standard, can use reasonably fast code for Unicode manipulation and add those extra 1M character at a later stage. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 50 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Thu Nov 11 15:47:49 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Thu, 11 Nov 1999 15:47:49 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <19991111074450.20451.rocketmail@web606.mail.yahoo.com> <199911111203.HAA24221@eric.cnri.reston.va.us> Message-ID: <382AD715.66DBA125@lemburg.com> Guido van Rossum wrote: > > > Let me tell you why you would want to have an encoding > > which can be set: > > > > (1) sday I am on a Japanese Windows box, I have a > > string called 'address' and I do 'print address'. If > > I see utf8, I see garbage. If I see Shift-JIS, I see > > the correct Japanese address. At this point in time, > > utf8 is an interchange format but 99% of the world's > > data is in various native encodings. > > > > Analogous problems occur on input. > > > > (2) I'm using htmlgen, which 'prints' objects to > > standard output. My web site is supposed to be > > encoded in Shift-JIS (or EUC, or Big 5 for Taiwan, > > etc.) Yes, browsers CAN detect and display UTF8 but > > you just don't find UTF8 sites in the real world - and > > most users just don't know about the encoding menu, > > and will get pissed off if they have to reach for it. > > > > Ditto for streaming output in some protocol. > > > > Java solves this (and we could too by hacking stdout) > > using Writer classes which are created as wrappers > > around an output stream and can take an encoding, but > > you lose the flexibility to 'just print'. > > > > I think being able to change encoding would be useful. > > What I do not want is to auto-detect it from the > > operating system when Python boots - that would be a > > portability nightmare. > > You almost convinced me there, but I think this can still be done > without changing the default encoding: simply reopen stdout with a > different encoding. This is how Java does it. I/O streams with an > encoding specified at open() are a very powerful feature. You can > hide this in your $PYTHONSTARTUP. True and it probably covers all cases where setting the default encoding to something other than UTF-8 makes sense. I guess you've convinced me there ;-) The current proposal has wrappers around stream for this purpose: For explicit handling of Unicode using files, the unicodec module could provide stream wrappers which provide transparent encoding/decoding for any open stream (file-like object): import unicodec file = open('mytext.txt','rb') ufile = unicodec.stream(file,'utf-16') u = ufile.read() ... ufile.close() XXX unicodec.file(<filename>,<mode>,<encname>) could be provided as short-hand for unicodec.file(open(<filename>,<mode>),<encname>) which also assures that <mode> contains the 'b' character when needed. The above can be done using: import sys,unicodec sys.stdin = unicodec.stream(sys.stdin,'jis') sys.stdout = unicodec.stream(sys.stdout,'jis') -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 50 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From jack at oratrix.nl Thu Nov 11 16:58:39 1999 From: jack at oratrix.nl (Jack Jansen) Date: Thu, 11 Nov 1999 16:58:39 +0100 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: Message by "M.-A. Lemburg" <mal@lemburg.com> , Thu, 11 Nov 1999 15:23:45 +0100 , <382AD171.D22A1D6E@lemburg.com> Message-ID: <19991111155839.BFB0235BB1E@snelboot.oratrix.nl> > > [MAL, on Unicode chr() and ord() > > > ... > > > Because unichr() will always have to return Unicode objects. You don't > > > want chr(i) to return Unicode for i>255 and strings for i<256. > > > OTOH, ord() could probably be extended to also work on Unicode objects. > Fine. So I'll drop the uniord() API and extend ord() instead. Hmm, then wouldn't it be more logical to drop unichr() too, but add an optional parameter to chr() to specify what sort of a string you want? The type-object of a unicode string comes to mind... -- Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++ Jack.Jansen at oratrix.com | ++++ if you agree copy these lines to your sig ++++ www.oratrix.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm From bwarsaw at cnri.reston.va.us Thu Nov 11 17:04:29 1999 From: bwarsaw at cnri.reston.va.us (Barry A. Warsaw) Date: Thu, 11 Nov 1999 11:04:29 -0500 (EST) Subject: [Python-Dev] Internationalization Toolkit References: <19991110080926.2400.rocketmail@web602.mail.yahoo.com> <38295A08.D3928401@lemburg.com> <14377.38438.615701.231437@weyr.cnri.reston.va.us> <382AB9CB.634A9782@lemburg.com> Message-ID: <14378.59661.376434.449820@anthem.cnri.reston.va.us> >>>>> "M" == M <mal at lemburg.com> writes: M> Doesn't Python convert class exceptions to strings when -X is M> used ? I would guess that many scripts already rely on the M> class based mechanism (much of my stuff does for sure), so by M> the time 1.6 is out, I think -X should be considered an option M> to run pre 1.5 code rather than using it for performance M> reasons. This is a little off-topic so I'll be brief. When using -X Python never even creates the class exceptions, so it isn't really a conversion. It just uses string exceptions and tries to craft tuples for what would be the superclasses in the class-based exception hierarchy. Yes, class-based exceptions are a bit of a performance hit when you are catching exceptions in Python (because they need to be instantiated), but they're just so darn *useful*. I wouldn't mind seeing the -X option go away for 1.6. -Barry From captainrobbo at yahoo.com Thu Nov 11 17:08:15 1999 From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=) Date: Thu, 11 Nov 1999 08:08:15 -0800 (PST) Subject: [Python-Dev] Internationalization Toolkit Message-ID: <19991111160815.5235.rocketmail@web608.mail.yahoo.com> > See my other post on the subject... > > Note that if we make UTF-8 the standard encoding, > nearly all > special Latin-1 characters will produce UTF-8 errors > on input > and unreadable garbage on output. That will probably > be unacceptable > in Europe. To remedy this, one would *always* have > to use > u.encode('latin-1') to get readable output for > Latin-1 strings > repesented in Unicode. You beat me to it - a colleague and I were just discussing this verbally. Specifically we Brits will get annoyed as soon as we read in a text file with pound (sterling) signs. We concluded that the only reasonable default (if you have one at all) is pure ASCII. At least that way I will get a clear and intelligible warning when I load in such a file, and will remember to specify ISO-Latin-1. - Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From mal at lemburg.com Thu Nov 11 16:59:21 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Thu, 11 Nov 1999 16:59:21 +0100 Subject: [Python-Dev] Unicode proposal: %-formatting ? Message-ID: <382AE7D9.147D58CB@lemburg.com> I wonder how we could add %-formatting to Unicode strings without duplicating the PyString_Format() logic. First, do we need Unicode object %-formatting at all ? Second, here is an emulation using strings and <default encoding> that should give an idea of one could work with the different encodings: s = '%s %i abc???' # a Latin-1 encoded string t = (u,3) # Convert Latin-1 s to a <default encoding> string via Unicode s1 = unicode(s,'latin-1').encode() # The '%s' will now add u in <default encoding> s2 = s1 % t # Finally, convert the <default encoding> encoded string to Unicode u1 = unicode(s2) Note that .encode() defaults to the current setting of <default encoding>. Provided u maps to Latin-1, an alternative would be: u1 = unicode('%s %i abc???' % (u.encode('latin-1'),3), 'latin-1') -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 50 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Thu Nov 11 18:04:37 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Thu, 11 Nov 1999 18:04:37 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <19991111155839.BFB0235BB1E@snelboot.oratrix.nl> Message-ID: <382AF725.FC66C9B6@lemburg.com> Jack Jansen wrote: > > > > [MAL, on Unicode chr() and ord() > > > > ... > > > > Because unichr() will always have to return Unicode objects. You don't > > > > want chr(i) to return Unicode for i>255 and strings for i<256. > > > > > OTOH, ord() could probably be extended to also work on Unicode objects. > > > Fine. So I'll drop the uniord() API and extend ord() instead. > > Hmm, then wouldn't it be more logical to drop unichr() too, but add an > optional parameter to chr() to specify what sort of a string you want? The > type-object of a unicode string comes to mind... Like: import types uc = chr(12,types.UnicodeType) ... looks overly complicated, IMHO. uc = unichr(12) and u = unicode('abc') look pretty intuitive to me. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 50 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Thu Nov 11 16:59:21 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Thu, 11 Nov 1999 16:59:21 +0100 Subject: [Python-Dev] Unicode proposal: %-formatting ? Message-ID: <382AE7D9.147D58CB@lemburg.com> I wonder how we could add %-formatting to Unicode strings without duplicating the PyString_Format() logic. First, do we need Unicode object %-formatting at all ? Second, here is an emulation using strings and <default encoding> that should give an idea of one could work with the different encodings: s = '%s %i abc???' # a Latin-1 encoded string t = (u,3) # Convert Latin-1 s to a <default encoding> string via Unicode s1 = unicode(s,'latin-1').encode() # The '%s' will now add u in <default encoding> s2 = s1 % t # Finally, convert the <default encoding> encoded string to Unicode u1 = unicode(s2) Note that .encode() defaults to the current setting of <default encoding>. Provided u maps to Latin-1, an alternative would be: u1 = unicode('%s %i abc???' % (u.encode('latin-1'),3), 'latin-1') -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 50 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Thu Nov 11 18:31:34 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Thu, 11 Nov 1999 18:31:34 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <19991111160815.5235.rocketmail@web608.mail.yahoo.com> Message-ID: <382AFD76.A0D3FEC4@lemburg.com> Andy Robinson wrote: > > > See my other post on the subject... > > > > Note that if we make UTF-8 the standard encoding, > > nearly all > > special Latin-1 characters will produce UTF-8 errors > > on input > > and unreadable garbage on output. That will probably > > be unacceptable > > in Europe. To remedy this, one would *always* have > > to use > > u.encode('latin-1') to get readable output for > > Latin-1 strings > > repesented in Unicode. > > You beat me to it - a colleague and I were just > discussing this verbally. Specifically we Brits will > get annoyed as soon as we read in a text file with > pound (sterling) signs. > > We concluded that the only reasonable default (if you > have one at all) is pure ASCII. At least that way I > will get a clear and intelligible warning when I load > in such a file, and will remember to specify > ISO-Latin-1. Well, Guido's post made me rethink the approach... 1. Setting <default encoding> to any non UTF encoding will result in data lossage due to the encoding limits imposed by the other formats -- this is dangerous and will result in errors (some of which may not even be noticed due to the interpreter ignoring them) in case your strings use non encodable characters. 2. You basically only want to set <default encoding> to anything other than UTF-8 for stream input and output. This can be done using the unicodec stream wrapper without too much inconvenience. (We'll have to extend the wrapper a little, though, because it currently only accept Unicode objects for writing and always return Unicode object when reading.) 3. We should leave the issue open until some code is there to be tested... I have a feeling that there will be quite a few strange effects when APIs expecting strings are fed with Unicode objects returning UTF-8. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 50 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mhammond at skippinet.com.au Fri Nov 12 02:10:09 1999 From: mhammond at skippinet.com.au (Mark Hammond) Date: Fri, 12 Nov 1999 12:10:09 +1100 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <382ABE34.5D27C701@lemburg.com> Message-ID: <007a01bf2caa$aabdef60$0501a8c0@bobcat> > Mark Hammond wrote: > > Having a fixed, default encoding may make life slightly > more difficult > > when you want to work primarily in a different encoding, > but at least > > your system is predictable and reliable. > > I think the discussion on this is getting a little too hot. Really - I see it as moving to a rational consensus that doesnt support the proposal in this regard. I see no heat in it at all. Im sorry if you saw my post or any of the followups as "emotional", but I certainly not getting passionate about this. I dont see any of this as affecting me personally. I believe that I can replace my Unicode implementation with this either way we go. Just because a we are trying to get it right doesnt mean we are getting heated. > The point > is simply that the option of changing the per-thread default encoding > is there. You are not required to use it and if you do you are on > your own when something breaks. Hrm - Im having serious trouble following your logic here. If make _any_ assumptions about a default encoding, I am in danger of breaking. I may not choose to change the default, but as soon as _anyone_ does, unrelated code may break. I agree that I will be "on my own", but I wont necessarily have been the one that changed it :-( The only answer I can see is, as you suggest, to ignore the fact that there is _any_ default. Always specify the encoding. But obviously this is not good enough for HP: > Think of it as a HP specific feature... perhaps I should wrap the code > in #ifdefs and leave it undocumented. That would work - just ensure that no standard Python has those #ifdefs turned on :-) I would be sorely dissapointed if the fact that HP are throwing money for this means they get every whim implemented in the core language. Imagine the outcry if it were instead MS' money, and you were attempting to put an MS spin on all this. Are you writing a module for HP, or writing a module for Python that HP are assisting by providing some funding? Clear difference. IMO, it must also be seen that there is a clear difference. Maybe Im missing something. Can you explain why it is good enough everyone else to be required to assume there is no default encoding, but HP get their thread specific global? Are their requirements greater than anyone elses? Is everyone else not as important? What would you, as a consultant, recommend to people who arent HP, but have a similar requirement? It would seem obvious to me that HPs requirement can be met in "pure Python", thereby keeping this out of the core all together... Mark. From gmcm at hypernet.com Fri Nov 12 03:01:23 1999 From: gmcm at hypernet.com (Gordon McMillan) Date: Thu, 11 Nov 1999 21:01:23 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <007a01bf2caa$aabdef60$0501a8c0@bobcat> References: <382ABE34.5D27C701@lemburg.com> Message-ID: <1269750417-7621469@hypernet.com> [per-thread defaults] C'mon guys, hasn't anyone ever played consultant before? The idea is obviously brain-dead. OTOH, they asked for it specifically, meaning they have some assumptions about how they think they're going to use it. If you give them what they ask for, you'll only have to fix it when they realize there are other ways of doing things that don't work with per-thread defaults. So, you find out why they think it's a good thing; you make it easy for them to code this way (without actually using per-thread defaults) and you don't make a fuss about it. More than likely, they won't either. "requirements"-are-only-useful-as-clues-to-the-objectives- behind-them-ly y'rs - Gordon From tim_one at email.msn.com Fri Nov 12 06:04:44 1999 From: tim_one at email.msn.com (Tim Peters) Date: Fri, 12 Nov 1999 00:04:44 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <382AB9CB.634A9782@lemburg.com> Message-ID: <000a01bf2ccb$6f59c2c0$fd2d153f@tim> [MAL] >>> Codecs should raise an UnicodeError in case the conversion is >>> not possible. [Fred L. Drake, Jr.] >> I think that should be ValueError, or UnicodeError should be a >> subclass of ValueError. >> (Can the -X interpreter option be removed yet?) [MAL] > Doesn't Python convert class exceptions to strings when -X is > used ? I would guess that many scripts already rely on the class > based mechanism (much of my stuff does for sure), so by the time > 1.6 is out, I think -X should be considered an option to run > pre 1.5 code rather than using it for performance reasons. -X is a red herring. That is, do what seems best without regard for -X. I already added one subclass exception to the CVS tree (UnboundLocalError as a subclass of NameError), and in doing that had to figure out how to make it do the right thing under -X too. It's a bit clumsy to arrange, but not a problem. From tim_one at email.msn.com Fri Nov 12 06:18:09 1999 From: tim_one at email.msn.com (Tim Peters) Date: Fri, 12 Nov 1999 00:18:09 -0500 Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit) In-Reply-To: <382AD0FE.B604876A@lemburg.com> Message-ID: <000e01bf2ccd$4f4b0e60$fd2d153f@tim> [MAL] > ... > The conversion goes as follows: > ? for single characters (and this includes all \XXX sequences > except \uXXXX), take the ordinal and interpret it as Unicode > ordinal for \uXXXX sequences, insert the Unicode character > with ordinal 0xXXXX instead Perfect! [about "raw" Unicode strings] > ... > Not sure whether we really need to make this even more complicated... > The \uXXXX strings look ugly, adding a few \\\\ for e.g. REs or > filenames won't hurt much in the context of those \uXXXX monsters :-) Alas, this won't stand over the long term. Eventually people will write Python using nothing but Unicode strings -- "regular strings" will eventurally become a backward compatibility headache <0.7 wink>. IOW, Unicode regexps and Unicode docstrings and Unicode formatting ops ... nothing will escape. Nor should it. I don't think it all needs to be done at once, though -- existing languages usually take years to graft in gimmicks to cover all the fine points. So, happy to let raw Unicode strings pass for now, as a relatively minor point, but without agreeing it can be ignored forever. > ... > BTW, if you want to type in UTF-8 strings and have them converted > to Unicode, you can use the standard: > > u = unicode('...string with UTF-8 encoded characters...','utf-8') That's what I figured, and thanks for the confirmation. From tim_one at email.msn.com Fri Nov 12 06:42:32 1999 From: tim_one at email.msn.com (Tim Peters) Date: Fri, 12 Nov 1999 00:42:32 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <382AD233.BE6DE888@lemburg.com> Message-ID: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim> [MAL] > If HP approves, I'd propose to use UTF-16 as if it were UCS-2 and > signal failure of this assertion at Unicode object construction time > via an exception. That way we are within the standard, can use > reasonably fast code for Unicode manipulation and add those extra 1M > character at a later stage. I think this is reasonable. Using UTF-8 internally is also reasonable, and if it's being rejected on the grounds of supposed slowness, that deserves a closer look (it's an ingenious encoding scheme that works correctly with a surprising number of existing 8-bit string routines as-is). Indexing UTF-8 strings is greatly speeded by adding a simple finger (i.e., store along with the string an index+offset pair identifying the most recent position indexed to -- since string indexing is overwhelmingly sequential, this makes most indexing constant-time; and UTF-8 can be scanned either forward or backward from a random internal point because "the first byte" of each encoding is recognizable as such). I expect either would work well. It's at least curious that Perl and Tcl both went with UTF-8 -- does anyone think they know *why*? I don't. The people here saying UCS-2 is the obviously better choice are all from the Microsoft camp <wink>. It's not obvious to me, but then neither do I claim that UTF-8 is obviously better. From tim_one at email.msn.com Fri Nov 12 07:02:01 1999 From: tim_one at email.msn.com (Tim Peters) Date: Fri, 12 Nov 1999 01:02:01 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <382AD479.5261B43B@lemburg.com> Message-ID: <001001bf2cd3$6fa57820$fd2d153f@tim> [MAL] > Note that if we make UTF-8 the standard encoding, nearly all > special Latin-1 characters will produce UTF-8 errors on input > and unreadable garbage on output. That will probably be unacceptable > in Europe. To remedy this, one would *always* have to use > u.encode('latin-1') to get readable output for Latin-1 strings > repesented in Unicode. I think it's time for the Europeans to pronounce on what's acceptable in Europe. To the limited extent that I can pretend I'm Eurpoean, I'm happy with Guido's rebind-stdin/stdout-in-PYTHONSTARTUP idea. > I'd rather see this happen the other way around: *always* explicitly > state the encoding you want in case you rely on it, e.g. write > > file.write(u.encode('utf-8')) > > instead of > > file.write(u) # let's hope this goes out as UTF-8... By the same argument, those pesky Europeans who are relying on Latin-1 should write file.write(u.encode('latin-1')) instead of file.write(u) # let's hope this goes out as Latin-1 > Using the <default encoding> as site dependent setting is useful > for convenience in those cases where the output format should be > readable rather than parseable. Well, "convenience" is always the argument advanced in favor of modes. Conflicts and nasty intermittent bugs are always the result. The latter will happen under Guido's idea too, as various careless modules rebind stdin & stdout to their own ideas of what "the proper" encoding should be. But at least the blame doesn't fall on the core language then <0.3 wink>. Since there doesn't appear to be anything (either or good or bad) you can do (or avoid) by using Guido's scheme instead of magical core thread state, there's no *need* for the latter. That is, it can be done with a user-level API without involving the core. From tim_one at email.msn.com Fri Nov 12 07:17:08 1999 From: tim_one at email.msn.com (Tim Peters) Date: Fri, 12 Nov 1999 01:17:08 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <007a01bf2caa$aabdef60$0501a8c0@bobcat> Message-ID: <001501bf2cd5$8c380140$fd2d153f@tim> [Mark Hammond] > ... > Are you writing a module for HP, or writing a module for Python that > HP are assisting by providing some funding? Clear difference. IMO, > it must also be seen that there is a clear difference. I can resolve this easily, but only with input from Guido. Guido, did HP's check clear yet? If so, we can ignore them <wink>. From captainrobbo at yahoo.com Fri Nov 12 09:15:19 1999 From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=) Date: Fri, 12 Nov 1999 00:15:19 -0800 (PST) Subject: [Python-Dev] Internationalization Toolkit Message-ID: <19991112081519.20636.rocketmail@web603.mail.yahoo.com> --- Gordon McMillan <gmcm at hypernet.com> wrote: > [per-thread defaults] > > C'mon guys, hasn't anyone ever played consultant > before? The > idea is obviously brain-dead. OTOH, they asked for > it > specifically, meaning they have some assumptions > about how > they think they're going to use it. If you give them > what they > ask for, you'll only have to fix it when they > realize there are > other ways of doing things that don't work with > per-thread > defaults. So, you find out why they think it's a > good thing; you > make it easy for them to code this way (without > actually using > per-thread defaults) and you don't make a fuss about > it. More > than likely, they won't either. > I wrote directly to ask them exactly this last night. Let's forget the per-thread thing until we get an answer. - Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From mal at lemburg.com Fri Nov 12 10:27:29 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Fri, 12 Nov 1999 10:27:29 +0100 Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit) References: <000e01bf2ccd$4f4b0e60$fd2d153f@tim> Message-ID: <382BDD81.458D3125@lemburg.com> Tim Peters wrote: > > [MAL] > > ... > > The conversion goes as follows: > > ? for single characters (and this includes all \XXX sequences > > except \uXXXX), take the ordinal and interpret it as Unicode > > ordinal for \uXXXX sequences, insert the Unicode character > > with ordinal 0xXXXX instead > > Perfect! Thanks :-) > [about "raw" Unicode strings] > > ... > > Not sure whether we really need to make this even more complicated... > > The \uXXXX strings look ugly, adding a few \\\\ for e.g. REs or > > filenames won't hurt much in the context of those \uXXXX monsters :-) > > Alas, this won't stand over the long term. Eventually people will write > Python using nothing but Unicode strings -- "regular strings" will > eventurally become a backward compatibility headache <0.7 wink>. IOW, > Unicode regexps and Unicode docstrings and Unicode formatting ops ... > nothing will escape. Nor should it. > > I don't think it all needs to be done at once, though -- existing languages > usually take years to graft in gimmicks to cover all the fine points. So, > happy to let raw Unicode strings pass for now, as a relatively minor point, > but without agreeing it can be ignored forever. Agreed... note that you could also write your own codec for just this reason and then use: u = unicode('....\u1234...\...\...','raw-unicode-escaped') Put that into a function called 'ur' and you have: u = ur('...\u4545...\...\...') which is not that far away from ur'...' w/r to cosmetics. > > ... > > BTW, if you want to type in UTF-8 strings and have them converted > > to Unicode, you can use the standard: > > > > u = unicode('...string with UTF-8 encoded characters...','utf-8') > > That's what I figured, and thanks for the confirmation. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Fri Nov 12 10:00:47 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Fri, 12 Nov 1999 10:00:47 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <19991112081519.20636.rocketmail@web603.mail.yahoo.com> Message-ID: <382BD73E.E6729C79@lemburg.com> Andy Robinson wrote: > > --- Gordon McMillan <gmcm at hypernet.com> wrote: > > [per-thread defaults] > > > > C'mon guys, hasn't anyone ever played consultant > > before? The > > idea is obviously brain-dead. OTOH, they asked for > > it > > specifically, meaning they have some assumptions > > about how > > they think they're going to use it. If you give them > > what they > > ask for, you'll only have to fix it when they > > realize there are > > other ways of doing things that don't work with > > per-thread > > defaults. So, you find out why they think it's a > > good thing; you > > make it easy for them to code this way (without > > actually using > > per-thread defaults) and you don't make a fuss about > > it. More > > than likely, they won't either. > > > > I wrote directly to ask them exactly this last night. > Let's forget the per-thread thing until we get an > answer. That's the way to go, Andy. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Fri Nov 12 10:44:14 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Fri, 12 Nov 1999 10:44:14 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <007a01bf2caa$aabdef60$0501a8c0@bobcat> Message-ID: <382BE16E.D17C80E1@lemburg.com> Mark Hammond wrote: > > > Mark Hammond wrote: > > > Having a fixed, default encoding may make life slightly > > more difficult > > > when you want to work primarily in a different encoding, > > but at least > > > your system is predictable and reliable. > > > > I think the discussion on this is getting a little too hot. > > Really - I see it as moving to a rational consensus that doesnt > support the proposal in this regard. I see no heat in it at all. Im > sorry if you saw my post or any of the followups as "emotional", but I > certainly not getting passionate about this. I dont see any of this > as affecting me personally. I believe that I can replace my Unicode > implementation with this either way we go. Just because a we are > trying to get it right doesnt mean we are getting heated. Naa... with "heated" I meant the "HP wants this, HP wants that" side of things. We'll just have to wait for their answer on this one. > > The point > > is simply that the option of changing the per-thread default > encoding > > is there. You are not required to use it and if you do you are on > > your own when something breaks. > > Hrm - Im having serious trouble following your logic here. If make > _any_ assumptions about a default encoding, I am in danger of > breaking. I may not choose to change the default, but as soon as > _anyone_ does, unrelated code may break. > > I agree that I will be "on my own", but I wont necessarily have been > the one that changed it :-( Sure there are some very subtile dangers in setting the default to anything other than the default ;-) For some this risk may be worthwhile taking, for others not. In fact, in large projects I would never take such a risk... I'm sure we can get this message across to them. > The only answer I can see is, as you suggest, to ignore the fact that > there is _any_ default. Always specify the encoding. But obviously > this is not good enough for HP: > > > Think of it as a HP specific feature... perhaps I should wrap the > code > > in #ifdefs and leave it undocumented. > > That would work - just ensure that no standard Python has those > #ifdefs turned on :-) I would be sorely dissapointed if the fact that > HP are throwing money for this means they get every whim implemented > in the core language. Imagine the outcry if it were instead MS' > money, and you were attempting to put an MS spin on all this. > > Are you writing a module for HP, or writing a module for Python that > HP are assisting by providing some funding? Clear difference. IMO, > it must also be seen that there is a clear difference. > > Maybe Im missing something. Can you explain why it is good enough > everyone else to be required to assume there is no default encoding, > but HP get their thread specific global? Are their requirements > greater than anyone elses? Is everyone else not as important? What > would you, as a consultant, recommend to people who arent HP, but have > a similar requirement? It would seem obvious to me that HPs > requirement can be met in "pure Python", thereby keeping this out of > the core all together... Again, all I can try is convince them of not really needing settable default encodings. <IMO> Since this is the first time a Python Consortium member is pushing development, I think we can learn a lot here. For one, it should be clear that money doesn't buy everything, OTOH, we cannot put the whole thing at risk just because of some minor disagreement that cannot be solved between the parties. The standard solution for the latter should be a customized Python interpreter. </IMO> -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Fri Nov 12 10:04:31 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Fri, 12 Nov 1999 10:04:31 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <001001bf2cd3$6fa57820$fd2d153f@tim> Message-ID: <382BD81F.B2BC896A@lemburg.com> Tim Peters wrote: > > [MAL] > > Note that if we make UTF-8 the standard encoding, nearly all > > special Latin-1 characters will produce UTF-8 errors on input > > and unreadable garbage on output. That will probably be unacceptable > > in Europe. To remedy this, one would *always* have to use > > u.encode('latin-1') to get readable output for Latin-1 strings > > repesented in Unicode. > > I think it's time for the Europeans to pronounce on what's acceptable in > Europe. To the limited extent that I can pretend I'm Eurpoean, I'm happy > with Guido's rebind-stdin/stdout-in-PYTHONSTARTUP idea. Agreed. > > I'd rather see this happen the other way around: *always* explicitly > > state the encoding you want in case you rely on it, e.g. write > > > > file.write(u.encode('utf-8')) > > > > instead of > > > > file.write(u) # let's hope this goes out as UTF-8... > > By the same argument, those pesky Europeans who are relying on Latin-1 > should write > > file.write(u.encode('latin-1')) > > instead of > > file.write(u) # let's hope this goes out as Latin-1 Right. > > Using the <default encoding> as site dependent setting is useful > > for convenience in those cases where the output format should be > > readable rather than parseable. > > Well, "convenience" is always the argument advanced in favor of modes. > Conflicts and nasty intermittent bugs are always the result. The latter > will happen under Guido's idea too, as various careless modules rebind stdin > & stdout to their own ideas of what "the proper" encoding should be. But at > least the blame doesn't fall on the core language then <0.3 wink>. > > Since there doesn't appear to be anything (either or good or bad) you can do > (or avoid) by using Guido's scheme instead of magical core thread state, > there's no *need* for the latter. That is, it can be done with a user-level > API without involving the core. Dito :-) I have nothing against telling people to take care about the problem in user space (meaning: not done by the core interpreter) and I'm pretty sure that HP will agree on this too, provided we give them the proper user space tools like file wrappers et al. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Fri Nov 12 10:16:57 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Fri, 12 Nov 1999 10:16:57 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim> Message-ID: <382BDB09.55583F28@lemburg.com> Tim Peters wrote: > > [MAL] > > If HP approves, I'd propose to use UTF-16 as if it were UCS-2 and > > signal failure of this assertion at Unicode object construction time > > via an exception. That way we are within the standard, can use > > reasonably fast code for Unicode manipulation and add those extra 1M > > character at a later stage. > > I think this is reasonable. > > Using UTF-8 internally is also reasonable, and if it's being rejected on the > grounds of supposed slowness, that deserves a closer look (it's an ingenious > encoding scheme that works correctly with a surprising number of existing > 8-bit string routines as-is). Indexing UTF-8 strings is greatly speeded by > adding a simple finger (i.e., store along with the string an index+offset > pair identifying the most recent position indexed to -- since string > indexing is overwhelmingly sequential, this makes most indexing > constant-time; and UTF-8 can be scanned either forward or backward from a > random internal point because "the first byte" of each encoding is > recognizable as such). Here are some arguments for using the proposed UTF-16 strategy instead: ? all characters have the same length; indexing is fast ? conversion APIs to platform dependent wchar_t implementation are fast because they either can simply copy the content or extend the 2-bytes to 4 byte ? UTF-8 needs 2 bytes for all the compound Latin-1 characters (e.g. u with two dots) which are used in many non-English languages ? from the Unicode Consortium FAQ: "Most Unicode APIs are using UTF-16." Besides, the Unicode object will have a buffer containing the <default encoding> representation of the object, which, if all goes well, will always hold the UTF-8 value. RE engines etc. can then directly work with this buffer. > I expect either would work well. It's at least curious that Perl and Tcl > both went with UTF-8 -- does anyone think they know *why*? I don't. The > people here saying UCS-2 is the obviously better choice are all from the > Microsoft camp <wink>. It's not obvious to me, but then neither do I claim > that UTF-8 is obviously better. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From gstein at lyra.org Fri Nov 12 11:20:16 1999 From: gstein at lyra.org (Greg Stein) Date: Fri, 12 Nov 1999 02:20:16 -0800 (PST) Subject: [Python-Dev] the Benevolent Dictator (was: Internationalization Toolkit) In-Reply-To: <382BE16E.D17C80E1@lemburg.com> Message-ID: <Pine.LNX.4.10.9911120214521.27203-100000@nebula.lyra.org> On Fri, 12 Nov 1999, M.-A. Lemburg wrote: > <IMO> > Since this is the first time a Python Consortium member is > pushing development, I think we can learn a lot here. For one, > it should be clear that money doesn't buy everything, OTOH, > we cannot put the whole thing at risk just because > of some minor disagreement that cannot be solved between the > parties. The standard solution for the latter should be a > customized Python interpreter. > </IMO> hehe... funny you mention this. Go read the Consortium docs. Last time that I read them, there are no "parties" to reach consensus. *Every* technical decision regarding the Python language falls to the Technical Director (Guido, of course). I looked. I found nothing that can override the T.D.'s decisions and no way to force a particular decision. Guido is still the Benevolent Dictator :-) Cheers, -g p.s. yes, there is always the caveat that "sure, Guido has final say" but "Al can fire him at will for being too stubborn" :-) ... but hey, Guido's title does have the word Benevolent in it, so things are cool... -- Greg Stein, http://www.lyra.org/ From gstein at lyra.org Fri Nov 12 11:24:56 1999 From: gstein at lyra.org (Greg Stein) Date: Fri, 12 Nov 1999 02:24:56 -0800 (PST) Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <382BE16E.D17C80E1@lemburg.com> Message-ID: <Pine.LNX.4.10.9911120221010.27203-100000@nebula.lyra.org> On Fri, 12 Nov 1999, M.-A. Lemburg wrote: > Sure there are some very subtile dangers in setting the default > to anything other than the default ;-) For some this risk may > be worthwhile taking, for others not. In fact, in large projects > I would never take such a risk... I'm sure we can get this > message across to them. It's a lot easier to just never provide the rope (per-thread default encodings) in the first place. If the feature exists, then it will be used. Period. Try to get the message across until you're blue in the face, but it would be used. Anyhow... discussion is pretty moot until somebody can state that it is/isn't a "real requirement" and/or until The Guido takes a position. Cheers, -g -- Greg Stein, http://www.lyra.org/ From gstein at lyra.org Fri Nov 12 11:30:04 1999 From: gstein at lyra.org (Greg Stein) Date: Fri, 12 Nov 1999 02:30:04 -0800 (PST) Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim> Message-ID: <Pine.LNX.4.10.9911120225080.27203-100000@nebula.lyra.org> On Fri, 12 Nov 1999, Tim Peters wrote: >... > Using UTF-8 internally is also reasonable, and if it's being rejected on the > grounds of supposed slowness No... my main point was interaction with the underlying OS. I made a SWAG (Scientific Wild Ass Guess :-) and stated that UTF-8 is probably slower for various types of operations. As always, your infernal meddling has dashed that hypothesis, so I must retreat... >... > I expect either would work well. It's at least curious that Perl and Tcl > both went with UTF-8 -- does anyone think they know *why*? I don't. The > people here saying UCS-2 is the obviously better choice are all from the > Microsoft camp <wink>. It's not obvious to me, but then neither do I claim > that UTF-8 is obviously better. Probably for the exact reason that you stated in your messages: many 8-bit (7-bit?) functions continue to work quite well when given a UTF-8-encoded string. i.e. they didn't have to rewrite the entire Perl/TCL interpreter to deal with a new string type. I'd guess it is a helluva lot easier for us to add a Python Type than for Perl or TCL to whack around with new string types (since they use strings so heavily). Cheers, -g -- Greg Stein, http://www.lyra.org/ From mal at lemburg.com Fri Nov 12 11:30:28 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Fri, 12 Nov 1999 11:30:28 +0100 Subject: [Python-Dev] the Benevolent Dictator (was: Internationalization Toolkit) References: <Pine.LNX.4.10.9911120214521.27203-100000@nebula.lyra.org> Message-ID: <382BEC44.A2541C7E@lemburg.com> Greg Stein wrote: > > On Fri, 12 Nov 1999, M.-A. Lemburg wrote: > > <IMO> > > Since this is the first time a Python Consortium member is > > pushing development, I think we can learn a lot here. For one, > > it should be clear that money doesn't buy everything, OTOH, > > we cannot put the whole thing at risk just because > > of some minor disagreement that cannot be solved between the > > parties. The standard solution for the latter should be a > > customized Python interpreter. > > </IMO> > > hehe... funny you mention this. Go read the Consortium docs. Last time > that I read them, there are no "parties" to reach consensus. *Every* > technical decision regarding the Python language falls to the Technical > Director (Guido, of course). I looked. I found nothing that can override > the T.D.'s decisions and no way to force a particular decision. > > Guido is still the Benevolent Dictator :-) Sure, but have you considered the option of a member simply bailing out ? HP could always stop funding Unicode integration. That wouldn't help us either... > Cheers, > -g > > p.s. yes, there is always the caveat that "sure, Guido has final say" but > "Al can fire him at will for being too stubborn" :-) ... but hey, Guido's > title does have the word Benevolent in it, so things are cool... -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From gstein at lyra.org Fri Nov 12 11:39:45 1999 From: gstein at lyra.org (Greg Stein) Date: Fri, 12 Nov 1999 02:39:45 -0800 (PST) Subject: [Python-Dev] the Benevolent Dictator (was: Internationalization Toolkit) In-Reply-To: <382BEC44.A2541C7E@lemburg.com> Message-ID: <Pine.LNX.4.10.9911120238230.27203-100000@nebula.lyra.org> On Fri, 12 Nov 1999, M.-A. Lemburg wrote: >... > Sure, but have you considered the option of a member simply bailing > out ? HP could always stop funding Unicode integration. That wouldn't > help us either... I'm not that dumb... come on. That was my whole point about "Benevolent" below... Guido is a fair and reasonable Dictator... he wouldn't let that happen. >... > > p.s. yes, there is always the caveat that "sure, Guido has final say" but > > "Al can fire him at will for being too stubborn" :-) ... but hey, Guido's > > title does have the word Benevolent in it, so things are cool... Cheers, -g -- Greg Stein, http://www.lyra.org/ From Mike.Da.Silva at uk.fid-intl.com Fri Nov 12 12:00:49 1999 From: Mike.Da.Silva at uk.fid-intl.com (Da Silva, Mike) Date: Fri, 12 Nov 1999 11:00:49 -0000 Subject: [Python-Dev] Internationalization Toolkit Message-ID: <DBF3B37F7BF1D111B2A10000F6B14B1FDDAF22@ukhil704nts.hld.uk.fid-intl.com> Most of the ASCII string functions do indeed work for UTF-8. I have made extensive use of this feature when writing translation logic to harmonize ASCII text (an SQL statement) with substitution parameters that must be converted from IBM EBCDIC code pages (5035, 1027) into UTF8. Since UTF-8 is a superset of ASCII, this all works fine. Some of the character classification functions etc can be flaky when used with UTF8 characters outside the ASCII range, but simple string operations work fine. As I see it, the relative pros and cons of UTF-8 versus UTF-16 for use as an internal string representation are: 1. UTF-8 allows all characters to be displayed (in some form or other) on the users machine, with or without native fonts installed. Naturally anything outside the ASCII range will be garbage, but it is an immense debugging aid when working with character encodings to be able to touch and feel something recognizable. Trying to decode a block of raw UTF-16 is a pain. 2. UTF-8 works with most existing string manipulation libraries quite happily. It is also portable (a char is always 8 bits, regardless of platform; wchar_t varies between 16 and 32 bits depending on the underlying operating system (although unsigned short does seems to work across platforms, in my experience). 3. UTF-16 has some advantages in providing fixed width characters and, (ignoring surrogate pairs etc) a modeless encoding space. This is an advantage for fast string operations, especially on CPU's that have efficient operations for handling 16bit data. 4. UTF-16 would directly support a tightly coupled character properties engine, which would enable Unicode compliant case folding and character decomposition to be performed without an intermediate UTF-8 <----> UTF-16 translation step. 5. UTF-16 requires string operations that do not make assumptions about nulls - this means re-implementing most of the C runtime functions to work with unsigned shorts. Regards, Mike da Silva -----Original Message----- From: Greg Stein [SMTP:gstein at lyra.org] Sent: 12 November 1999 10:30 To: Tim Peters Cc: python-dev at python.org Subject: RE: [Python-Dev] Internationalization Toolkit On Fri, 12 Nov 1999, Tim Peters wrote: >... > Using UTF-8 internally is also reasonable, and if it's being rejected on the > grounds of supposed slowness No... my main point was interaction with the underlying OS. I made a SWAG (Scientific Wild Ass Guess :-) and stated that UTF-8 is probably slower for various types of operations. As always, your infernal meddling has dashed that hypothesis, so I must retreat... >... > I expect either would work well. It's at least curious that Perl and Tcl > both went with UTF-8 -- does anyone think they know *why*? I don't. The > people here saying UCS-2 is the obviously better choice are all from the > Microsoft camp <wink>. It's not obvious to me, but then neither do I claim > that UTF-8 is obviously better. Probably for the exact reason that you stated in your messages: many 8-bit (7-bit?) functions continue to work quite well when given a UTF-8-encoded string. i.e. they didn't have to rewrite the entire Perl/TCL interpreter to deal with a new string type. I'd guess it is a helluva lot easier for us to add a Python Type than for Perl or TCL to whack around with new string types (since they use strings so heavily). Cheers, -g -- Greg Stein, http://www.lyra.org/ _______________________________________________ Python-Dev maillist - Python-Dev at python.org http://www.python.org/mailman/listinfo/python-dev From fredrik at pythonware.com Fri Nov 12 12:23:24 1999 From: fredrik at pythonware.com (Fredrik Lundh) Date: Fri, 12 Nov 1999 12:23:24 +0100 Subject: [Python-Dev] just say no... References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim> <382BDB09.55583F28@lemburg.com> Message-ID: <027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com> > Besides, the Unicode object will have a buffer containing the > <default encoding> representation of the object, which, if all goes > well, will always hold the UTF-8 value. <rant> over my dead body, that one... (fwiw, over the last 20 years, I've implemented about a dozen image processing libraries, supporting loads of pixel layouts and file formats. one important lesson from that is to stick to a single internal representation, and let the application programmers build their own layers if they need to speed things up -- yes, they're actually happier that way. and text strings are not that different from pixel buffers or sound streams or scientific data sets, after all...) (and sticks and modes will break your bones, but you know that...) > RE engines etc. can then directly work with this buffer. sidebar: the RE engine that's being developed for this project can handle 8-bit, 16-bit, and (optionally) 32-bit text buffers. a single compiled expression can be used with any character size, and performance is about the same for all sizes (at least on any decent cpu). > > I expect either would work well. It's at least curious that Perl and Tcl > > both went with UTF-8 -- does anyone think they know *why*? I don't. The > > people here saying UCS-2 is the obviously better choice are all from the > > Microsoft camp <wink>. (hey, I'm not a microsofter. but I've been writing "i/o libraries" for various "object types" all my life, so I do have strong preferences on what works, and what doesn't... I use Python for good reasons, you know ;-) </rant> thanks. I feel better now. </F> From fredrik at pythonware.com Fri Nov 12 12:23:38 1999 From: fredrik at pythonware.com (Fredrik Lundh) Date: Fri, 12 Nov 1999 12:23:38 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <DBF3B37F7BF1D111B2A10000F6B14B1FDDAF22@ukhil704nts.hld.uk.fid-intl.com> Message-ID: <027f01bf2d00$648745e0$f29b12c2@secret.pythonware.com> > 5. UTF-16 requires string operations that do not make assumptions about > nulls - this means re-implementing most of the C runtime functions to work > with unsigned shorts. footnote: the mad scientist has been there and done that: http://www.pythonware.com/madscientist/ (and you can replace "unsigned short" with "whatever's suitable on this platform") </F> From fredrik at pythonware.com Fri Nov 12 12:36:03 1999 From: fredrik at pythonware.com (Fredrik Lundh) Date: Fri, 12 Nov 1999 12:36:03 +0100 Subject: [Python-Dev] the Benevolent Dictator (was: Internationalization Toolkit) References: <Pine.LNX.4.10.9911120238230.27203-100000@nebula.lyra.org> Message-ID: <02a701bf2d02$20c66280$f29b12c2@secret.pythonware.com> > Guido is a fair and reasonable Dictator... he wouldn't let that > happen. ...but where is he when we need him? ;-) </F> From Mike.Da.Silva at uk.fid-intl.com Fri Nov 12 12:43:21 1999 From: Mike.Da.Silva at uk.fid-intl.com (Da Silva, Mike) Date: Fri, 12 Nov 1999 11:43:21 -0000 Subject: [Python-Dev] Internationalization Toolkit Message-ID: <DBF3B37F7BF1D111B2A10000F6B14B1FDDAF24@ukhil704nts.hld.uk.fid-intl.com> Fredrik Lundh wrote: > 5. UTF-16 requires string operations that do not make assumptions about > nulls - this means re-implementing most of the C runtime functions to work > with unsigned shorts. footnote: the mad scientist has been there and done that: http://www.pythonware.com/madscientist/ <http://www.pythonware.com/madscientist/> (and you can replace "unsigned short" with "whatever's suitable on this platform") Surely using a different type on different platforms means that we throw away the concept of a platform independent Unicode string? I.e. on Solaris, wchar_t is 32 bits, on Windows it is 16 bits. Does this mean that to transfer a file between a Windows box and Solaris, an implicit conversion has to be done to go from 16 bits to 32 bits (and vice versa)? What about byte ordering issues? Or do you mean whatever 16 bit data type is available on the platform, with a standard (platform independent) byte ordering maintained? Mike da S From fredrik at pythonware.com Fri Nov 12 13:16:24 1999 From: fredrik at pythonware.com (Fredrik Lundh) Date: Fri, 12 Nov 1999 13:16:24 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <DBF3B37F7BF1D111B2A10000F6B14B1FDDAF24@ukhil704nts.hld.uk.fid-intl.com> Message-ID: <02f901bf2d07$bdf5d950$f29b12c2@secret.pythonware.com> Mike wrote: > Surely using a different type on different platforms means that we throw > away the concept of a platform independent Unicode string? > I.e. on Solaris, wchar_t is 32 bits, on Windows it is 16 bits. so? the interchange format doesn't have to be the same as the internal format, does it? > Does this mean that to transfer a file between a Windows box and Solaris, an > implicit conversion has to be done to go from 16 bits to 32 bits (and vice > versa)? What about byte ordering issues? no problem at all: unicode has special byte order marks for this purpose (and utf-8 doesn't care, of course). > Or do you mean whatever 16 bit data type is available on the platform, with > a standard (platform independent) byte ordering maintained? well, my preference is a 16-bit data type in the plat- form's native byte order (exactly how it's done in the unicode module -- for the moment, it can use the platform's wchar_t, but only if it happens to be a 16-bit unsigned type). gives you good performance, compact storage, and cleanest possible code. ... anyway, I think it would help the discussion a little bit if people looked at (and played with) the existing code base. at least that'll change arguments like "but then we have to implement that" to "but then we have to maintain that code" ;-) </F> From captainrobbo at yahoo.com Fri Nov 12 13:13:03 1999 From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=) Date: Fri, 12 Nov 1999 04:13:03 -0800 (PST) Subject: [Python-Dev] Internationalization Toolkit Message-ID: <19991112121303.27452.rocketmail@ web605.yahoomail.com> --- "Da Silva, Mike" <Mike.Da.Silva at uk.fid-intl.com> wrote: > As I see it, the relative pros and cons of UTF-8 > versus UTF-16 for use as an > internal string representation are: > [snip] > Regards, > Mike da Silva > Note that by going with UTF16, we get both. We will certainly have a codec for utf8, just as we will for ISO-Latin-1, Shift-JIS or whatever. And a perfectly ordinary Python string is a great place to hold UTF8; you can look at it and use most of the ordinary string algorithms on it. I presume no one is actually advocating dropping ordinary Python strings, or the ability to do rawdata = open('myfile.txt', 'rb').read() without any transformations? - Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From mhammond at skippinet.com.au Fri Nov 12 13:27:19 1999 From: mhammond at skippinet.com.au (Mark Hammond) Date: Fri, 12 Nov 1999 23:27:19 +1100 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <02f901bf2d07$bdf5d950$f29b12c2@secret.pythonware.com> Message-ID: <007e01bf2d09$44738440$0501a8c0@bobcat> /F writes > anyway, I think it would help the discussion a little bit > if people looked at (and played with) the existing code > base. at least that'll change arguments like "but then > we have to implement that" to "but then we have to > maintain that code" ;-) I second that. It is good enough for me (although my requirements arent stringent) - its been used on CE, so would slot directly into the win32 stuff. It is pretty much the consensus of the string-sig of last year, but as code! The only "problem" with it is the code that hasnt been written yet, specifically: * Encoders as streams, and a concrete proposal for them. * Decent PyArg_ParseTuple support and Py_BuildValue support. * The ord(), chr() stuff, and other stuff around the edges no doubt. Couldnt we start with Fredriks implementation, and see how the rest turns out? Even if we do choose to change the underlying Unicode implementation to use a different native encoding, the interface to the PyUnicode_Type would remain pretty similar. The advantage is that we have something now to start working with for the rest of the support we need. Mark. From mal at lemburg.com Fri Nov 12 13:38:44 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Fri, 12 Nov 1999 13:38:44 +0100 Subject: [Python-Dev] Unicode Proposal: Version 0.4 Message-ID: <382C0A54.E6E8328D@lemburg.com> I've uploaded a new version of the proposal which incorporates a lot of what has been discussed on the list. Thanks to everybody who helped so far. Note that I have extended the list of references for those who want to join in, but are in need of more background information. The latest version of the proposal is available at: http://starship.skyport.net/~lemburg/unicode-proposal.txt Older versions are available as: http://starship.skyport.net/~lemburg/unicode-proposal-X.X.txt Some POD (points of discussion) that are still open: ? support for line breaks (see http://www.unicode.org/unicode/reports/tr13/ ) ? support for case conversion: Problems: string lengths can change due to multiple characters being mapped to a single new one, capital letters starting a word can be different than ones occurring in the middle, there are locale dependent deviations from the standard mappings. ? support for numbers, digits, whitespace, etc. ? support (or no support) for private code point areas ? should Unicode objects support %-formatting ? One possibility would be to emulate this via strings and <default encoding>: s = '%s %i abc???' # a Latin-1 encoded string t = (u,3) # Convert Latin-1 s to a <default encoding> string s1 = unicode(s,'latin-1').encode() # The '%s' will now add u in <default encoding> s2 = s1 % t # Finally, convert the <default encoding> encoded string to Unicode u1 = unicode(s2) ? specifying file wrappers: Open issues: what to do with Python strings fed to the .write() method (may need to know the encoding of the strings) and when/if to return Python strings through the .read() method. Perhaps we need more than one type of wrapper here. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Fri Nov 12 14:11:26 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Fri, 12 Nov 1999 14:11:26 +0100 Subject: [Python-Dev] just say no... References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim> <382BDB09.55583F28@lemburg.com> <027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com> Message-ID: <382C11FE.D7D9F916@lemburg.com> Fredrik Lundh wrote: > > > Besides, the Unicode object will have a buffer containing the > > <default encoding> representation of the object, which, if all goes > > well, will always hold the UTF-8 value. > > <rant> > > over my dead body, that one... Such a buffer is needed to implement "s" and "s#" argument parsing. It's a simple requirement to support those two parsing markers -- there's not much to argue about, really... unless, of course, you want to give up Unicode object support for all APIs using these parsers. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Fri Nov 12 14:01:28 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Fri, 12 Nov 1999 14:01:28 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <DBF3B37F7BF1D111B2A10000F6B14B1FDDAF24@ukhil704nts.hld.uk.fid-intl.com> <02f901bf2d07$bdf5d950$f29b12c2@secret.pythonware.com> Message-ID: <382C0FA8.ACB6CCD6@lemburg.com> Fredrik Lundh wrote: > > Mike wrote: > > Surely using a different type on different platforms means that we throw > > away the concept of a platform independent Unicode string? > > I.e. on Solaris, wchar_t is 32 bits, on Windows it is 16 bits. > > so? the interchange format doesn't have to be > the same as the internal format, does it? The interchange format (marshal + pickle) is defined as UTF-8, so there's no problem with endianness or missing bits w/r to shipping Unicode data from one platform to another. > > Does this mean that to transfer a file between a Windows box and Solaris, an > > implicit conversion has to be done to go from 16 bits to 32 bits (and vice > > versa)? What about byte ordering issues? > > no problem at all: unicode has special byte order > marks for this purpose (and utf-8 doesn't care, of > course). Access to this mark will go into sys: sys.bom. > > Or do you mean whatever 16 bit data type is available on the platform, with > > a standard (platform independent) byte ordering maintained? > > well, my preference is a 16-bit data type in the plat- > form's native byte order (exactly how it's done in the > unicode module -- for the moment, it can use the > platform's wchar_t, but only if it happens to be a > 16-bit unsigned type). gives you good performance, > compact storage, and cleanest possible code. The 0.4 proposal fixes this to 16-bit unsigned short using UTF-16 encoding with checks for surrogates. This covers all defined standard Unicode character points, is fast, etc. pp... -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Fri Nov 12 12:15:15 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Fri, 12 Nov 1999 12:15:15 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <DBF3B37F7BF1D111B2A10000F6B14B1FDDAF22@ukhil704nts.hld.uk.fid-intl.com> Message-ID: <382BF6C3.D79840EC@lemburg.com> "Da Silva, Mike" wrote: > > Most of the ASCII string functions do indeed work for UTF-8. I have made > extensive use of this feature when writing translation logic to harmonize > ASCII text (an SQL statement) with substitution parameters that must be > converted from IBM EBCDIC code pages (5035, 1027) into UTF8. Since UTF-8 is > a superset of ASCII, this all works fine. > > Some of the character classification functions etc can be flaky when used > with UTF8 characters outside the ASCII range, but simple string operations > work fine. That's why there's the <defencbuf> buffer which holds the UTF-8 encoded value... > As I see it, the relative pros and cons of UTF-8 versus UTF-16 for use as an > internal string representation are: > > 1. UTF-8 allows all characters to be displayed (in some form or other) > on the users machine, with or without native fonts installed. Naturally > anything outside the ASCII range will be garbage, but it is an immense > debugging aid when working with character encodings to be able to touch and > feel something recognizable. Trying to decode a block of raw UTF-16 is a > pain. True. > 2. UTF-8 works with most existing string manipulation libraries quite > happily. It is also portable (a char is always 8 bits, regardless of > platform; wchar_t varies between 16 and 32 bits depending on the underlying > operating system (although unsigned short does seems to work across > platforms, in my experience). You mean with the compiler applying the needed 16->32 bit extension ? > 3. UTF-16 has some advantages in providing fixed width characters and, > (ignoring surrogate pairs etc) a modeless encoding space. This is an > advantage for fast string operations, especially on CPU's that have > efficient operations for handling 16bit data. Right and this is major argument for using 16 bit encodings without state internally. > 4. UTF-16 would directly support a tightly coupled character properties > engine, which would enable Unicode compliant case folding and character > decomposition to be performed without an intermediate UTF-8 <----> UTF-16 > translation step. Could you elaborate on this one ? It is one of the open issues in the proposal. > 5. UTF-16 requires string operations that do not make assumptions about > nulls - this means re-implementing most of the C runtime functions to work > with unsigned shorts. AFAIK, the RE engines in Python are 8-bit clean... BTW, wouldn't it be possible to take pcre and have it use Py_Unicode instead of char ? [Of course, there would have to be some extensions for character classes etc.] -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From fredrik at pythonware.com Fri Nov 12 14:43:12 1999 From: fredrik at pythonware.com (Fredrik Lundh) Date: Fri, 12 Nov 1999 14:43:12 +0100 Subject: [Python-Dev] just say no... References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim> <382BDB09.55583F28@lemburg.com> <027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com> <382C11FE.D7D9F916@lemburg.com> Message-ID: <005201bf2d13$ddd75ad0$f29b12c2@secret.pythonware.com> > > > Besides, the Unicode object will have a buffer containing the > > > <default encoding> representation of the object, which, if all goes > > > well, will always hold the UTF-8 value. > > > > <rant> > > > > over my dead body, that one... > > Such a buffer is needed to implement "s" and "s#" argument > parsing. It's a simple requirement to support those two > parsing markers -- there's not much to argue about, really... why? I don't understand why "s" and "s#" has to deal with encoding issues at all... > unless, of course, you want to give up Unicode object support > for all APIs using these parsers. hmm. maybe that's exactly what I want... </F> From fdrake at acm.org Fri Nov 12 15:34:56 1999 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Fri, 12 Nov 1999 09:34:56 -0500 (EST) Subject: [Python-Dev] just say no... In-Reply-To: <382C11FE.D7D9F916@lemburg.com> References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim> <382BDB09.55583F28@lemburg.com> <027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com> <382C11FE.D7D9F916@lemburg.com> Message-ID: <14380.9616.245419.138261@weyr.cnri.reston.va.us> M.-A. Lemburg writes: > Such a buffer is needed to implement "s" and "s#" argument > parsing. It's a simple requirement to support those two > parsing markers -- there's not much to argue about, really... > unless, of course, you want to give up Unicode object support > for all APIs using these parsers. Perhaps I missed the agreement that these should always receive UTF-8 from Unicode strings. Was this agreed upon, or has it simply not been argued over in favor of other topics? If this has indeed been agreed upon... at least it can be computed on demand rather than at initialization! Perhaps there should be two pointers: one to the UTF-8 buffer and one to a PyObject; if the PyObject is there it's a "old-style" string that's actually providing the buffer. This may or may not be a good idea; there's a lot of memory expense for long Unicode strings converted from UTF-8 that aren't ever converted back to UTF-8 or accessed using "s" or "s#". Ok, I've talked myself out of that. ;-) -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> Corporation for National Research Initiatives From fdrake at acm.org Fri Nov 12 15:57:15 1999 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Fri, 12 Nov 1999 09:57:15 -0500 (EST) Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <382C0FA8.ACB6CCD6@lemburg.com> References: <DBF3B37F7BF1D111B2A10000F6B14B1FDDAF24@ukhil704nts.hld.uk.fid-intl.com> <02f901bf2d07$bdf5d950$f29b12c2@secret.pythonware.com> <382C0FA8.ACB6CCD6@lemburg.com> Message-ID: <14380.10955.420102.327867@weyr.cnri.reston.va.us> M.-A. Lemburg writes: > Access to this mark will go into sys: sys.bom. Can the name in sys be a little more descriptive? sys.byte_order_mark would be reasonable. I think that a support module (possibly unicodec) should provide constants for all four byte order marks as strings (2- & 4-byte, little- and big-endian). Names could be short BOM_2_LE, BOM_4_LE, etc. -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> Corporation for National Research Initiatives From fredrik at pythonware.com Fri Nov 12 16:00:45 1999 From: fredrik at pythonware.com (Fredrik Lundh) Date: Fri, 12 Nov 1999 16:00:45 +0100 Subject: [Python-Dev] just say no... References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim><382BDB09.55583F28@lemburg.com><027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com><382C11FE.D7D9F916@lemburg.com> <14380.9616.245419.138261@weyr.cnri.reston.va.us> Message-ID: <009101bf2d1f$21f5b490$f29b12c2@secret.pythonware.com> Fred L. Drake, Jr. <fdrake at acm.org> wrote: > M.-A. Lemburg writes: > > Such a buffer is needed to implement "s" and "s#" argument > > parsing. It's a simple requirement to support those two > > parsing markers -- there's not much to argue about, really... > > unless, of course, you want to give up Unicode object support > > for all APIs using these parsers. > > Perhaps I missed the agreement that these should always receive > UTF-8 from Unicode strings. from unicode import * def getname(): # hidden in some database engine, or so... return unicode("Link?ping", "iso-8859-1") ... name = getname() # emulate automatic conversion to utf-8 name = str(name) # print it in uppercase, in the usual way import string print string.upper(name) ## LINK??PING I don't know, but I think that I think that it perhaps should raise an exception instead... </F> From mal at lemburg.com Fri Nov 12 16:17:43 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Fri, 12 Nov 1999 16:17:43 +0100 Subject: [Python-Dev] just say no... References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim> <382BDB09.55583F28@lemburg.com> <027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com> <382C11FE.D7D9F916@lemburg.com> <005201bf2d13$ddd75ad0$f29b12c2@secret.pythonware.com> Message-ID: <382C2F97.8E7D7A4D@lemburg.com> Fredrik Lundh wrote: > > > > > Besides, the Unicode object will have a buffer containing the > > > > <default encoding> representation of the object, which, if all goes > > > > well, will always hold the UTF-8 value. > > > > > > <rant> > > > > > > over my dead body, that one... > > > > Such a buffer is needed to implement "s" and "s#" argument > > parsing. It's a simple requirement to support those two > > parsing markers -- there's not much to argue about, really... > > why? I don't understand why "s" and "s#" has > to deal with encoding issues at all... > > > unless, of course, you want to give up Unicode object support > > for all APIs using these parsers. > > hmm. maybe that's exactly what I want... If we don't add that support, lot's of existing APIs won't accept Unicode object instead of strings. While it could be argued that automatic conversion to UTF-8 is not transparent enough for the user, the other solution of using str(u) everywhere would probably make writing Unicode-aware code a rather clumsy task and introduce other pitfalls, since str(obj) calls PyObject_Str() which also works on integers, floats, etc. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Fri Nov 12 16:50:33 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Fri, 12 Nov 1999 16:50:33 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <DBF3B37F7BF1D111B2A10000F6B14B1FDDAF24@ukhil704nts.hld.uk.fid-intl.com> <02f901bf2d07$bdf5d950$f29b12c2@secret.pythonware.com> <382C0FA8.ACB6CCD6@lemburg.com> <14380.10955.420102.327867@weyr.cnri.reston.va.us> Message-ID: <382C3749.198EEBC6@lemburg.com> "Fred L. Drake, Jr." wrote: > > M.-A. Lemburg writes: > > Access to this mark will go into sys: sys.bom. > > Can the name in sys be a little more descriptive? > sys.byte_order_mark would be reasonable. The abbreviation BOM is quite common w/r to Unicode. > I think that a support module (possibly unicodec) should provide > constants for all four byte order marks as strings (2- & 4-byte, > little- and big-endian). Names could be short BOM_2_LE, BOM_4_LE, > etc. Good idea... sys.bom should return the byte order mark (BOM) for the format used internally. The unicodec module should provide symbols for all possible values of this variable: BOM_BE: '\376\377' (corresponds to Unicode 0x0000FEFF in UTF-16 == ZERO WIDTH NO-BREAK SPACE) BOM_LE: '\377\376' (corresponds to Unicode 0x0000FFFE in UTF-16 == illegal Unicode character) BOM4_BE: '\000\000\377\376' (corresponds to Unicode 0x0000FEFF in UCS-4) BOM4_LE: '\376\377\000\000' (corresponds to Unicode 0x0000FFFE in UCS-4) Note that Unicode sees big endian byte order as being "correct". The swapped order is taken to be an indicator for a "wrong" format, hence the illegal character definition. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Fri Nov 12 16:24:33 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Fri, 12 Nov 1999 16:24:33 +0100 Subject: [Python-Dev] just say no... References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim> <382BDB09.55583F28@lemburg.com> <027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com> <382C11FE.D7D9F916@lemburg.com> <14380.9616.245419.138261@weyr.cnri.reston.va.us> Message-ID: <382C3131.A8965CA5@lemburg.com> "Fred L. Drake, Jr." wrote: > > M.-A. Lemburg writes: > > Such a buffer is needed to implement "s" and "s#" argument > > parsing. It's a simple requirement to support those two > > parsing markers -- there's not much to argue about, really... > > unless, of course, you want to give up Unicode object support > > for all APIs using these parsers. > > Perhaps I missed the agreement that these should always receive > UTF-8 from Unicode strings. Was this agreed upon, or has it simply > not been argued over in favor of other topics? It's been in the proposal since version 0.1. The idea is to provide a decent way of making existing script Unicode aware. > If this has indeed been agreed upon... at least it can be computed > on demand rather than at initialization! This is what I intended to implement. The <defencbuf> buffer will be filled upon the first request to the UTF-8 encoding. "s" and "s#" are examples of such requests. The buffer will remain intact until the object is destroyed (since other code could store the pointer received via e.g. "s"). > Perhaps there should be two > pointers: one to the UTF-8 buffer and one to a PyObject; if the > PyObject is there it's a "old-style" string that's actually providing > the buffer. This may or may not be a good idea; there's a lot of > memory expense for long Unicode strings converted from UTF-8 that > aren't ever converted back to UTF-8 or accessed using "s" or "s#". > Ok, I've talked myself out of that. ;-) Note that Unicode object are completely different beast ;-) String object are not touched in any way by the proposal. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From fdrake at acm.org Fri Nov 12 17:22:24 1999 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Fri, 12 Nov 1999 11:22:24 -0500 (EST) Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <382C3749.198EEBC6@lemburg.com> References: <DBF3B37F7BF1D111B2A10000F6B14B1FDDAF24@ukhil704nts.hld.uk.fid-intl.com> <02f901bf2d07$bdf5d950$f29b12c2@secret.pythonware.com> <382C0FA8.ACB6CCD6@lemburg.com> <14380.10955.420102.327867@weyr.cnri.reston.va.us> <382C3749.198EEBC6@lemburg.com> Message-ID: <14380.16064.723277.586881@weyr.cnri.reston.va.us> M.-A. Lemburg writes: > The abbreviation BOM is quite common w/r to Unicode. Yes: "w/r to Unicode". In sys, it's out of context and should receive a more descriptive name. I think using BOM in unicodec is good. > BOM_BE: '\376\377' > (corresponds to Unicode 0x0000FEFF in UTF-16 > == ZERO WIDTH NO-BREAK SPACE) I'd also add BOM to be the same as sys.byte_order_mark. Perhaps even instead of sys.byte_order_mark (just to localize the areas of code that are affected). > Note that Unicode sees big endian byte order as being "correct". The A lot of us do. ;-) -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> Corporation for National Research Initiatives From fdrake at acm.org Fri Nov 12 17:28:37 1999 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Fri, 12 Nov 1999 11:28:37 -0500 (EST) Subject: [Python-Dev] just say no... In-Reply-To: <382C3131.A8965CA5@lemburg.com> References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim> <382BDB09.55583F28@lemburg.com> <027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com> <382C11FE.D7D9F916@lemburg.com> <14380.9616.245419.138261@weyr.cnri.reston.va.us> <382C3131.A8965CA5@lemburg.com> Message-ID: <14380.16437.71847.832880@weyr.cnri.reston.va.us> M.-A. Lemburg writes: > It's been in the proposal since version 0.1. The idea is to > provide a decent way of making existing script Unicode aware. Ok, so I haven't read closely enough. > This is what I intended to implement. The <defencbuf> buffer > will be filled upon the first request to the UTF-8 encoding. > "s" and "s#" are examples of such requests. The buffer will > remain intact until the object is destroyed (since other code > could store the pointer received via e.g. "s"). Right. > Note that Unicode object are completely different beast ;-) > String object are not touched in any way by the proposal. I wasn't suggesting the PyStringObject be changed, only that the PyUnicodeObject could maintain a reference. Consider: s = fp.read() u = unicode(s, 'utf-8') u would now hold a reference to s, and s/s# would return a pointer into s instead of re-building the UTF-8 form. I talked myself out of this because it would be too easy to keep a lot more string objects around than were actually needed. -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> Corporation for National Research Initiatives From jack at oratrix.nl Fri Nov 12 17:33:46 1999 From: jack at oratrix.nl (Jack Jansen) Date: Fri, 12 Nov 1999 17:33:46 +0100 Subject: [Python-Dev] just say no... In-Reply-To: Message by "M.-A. Lemburg" <mal@lemburg.com> , Fri, 12 Nov 1999 16:24:33 +0100 , <382C3131.A8965CA5@lemburg.com> Message-ID: <19991112163347.5527635BB1E@snelboot.oratrix.nl> The problem with "s" and "s#" is that they're already semantically overloaded, and will become more so with support for multiple charsets. Some modules use "s#" when they mean "give me a pointer to an area of memory and its length". Writing to binary files is an example of this. Some modules use it to mean "give me a pointer to a string". Writing to a text file is (probably) an example of this. Some modules use it to mean "give me a pointer to an 8-bit ASCII string". This is the case if we're going to actually look at the contents (think of string.upper() and such). I think that the only real solution is to define what "s" means, come up with new getarg-formats for the other two use cases and convert all modules to use the new standard. It'll still cause grief to extension modules that aren't part of the core, but at least the problem will go away after a while. -- Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++ Jack.Jansen at oratrix.com | ++++ if you agree copy these lines to your sig ++++ www.oratrix.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm From mal at lemburg.com Fri Nov 12 19:36:55 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Fri, 12 Nov 1999 19:36:55 +0100 Subject: [Python-Dev] just say no... References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim> <382BDB09.55583F28@lemburg.com> <027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com> <382C11FE.D7D9F916@lemburg.com> <14380.9616.245419.138261@weyr.cnri.reston.va.us> <382C3131.A8965CA5@lemburg.com> <14380.16437.71847.832880@weyr.cnri.reston.va.us> Message-ID: <382C5E47.21FB4DD@lemburg.com> "Fred L. Drake, Jr." wrote: > > M.-A. Lemburg writes: > > It's been in the proposal since version 0.1. The idea is to > > provide a decent way of making existing script Unicode aware. > > Ok, so I haven't read closely enough. > > > This is what I intended to implement. The <defencbuf> buffer > > will be filled upon the first request to the UTF-8 encoding. > > "s" and "s#" are examples of such requests. The buffer will > > remain intact until the object is destroyed (since other code > > could store the pointer received via e.g. "s"). > > Right. > > > Note that Unicode object are completely different beast ;-) > > String object are not touched in any way by the proposal. > > I wasn't suggesting the PyStringObject be changed, only that the > PyUnicodeObject could maintain a reference. Consider: > > s = fp.read() > u = unicode(s, 'utf-8') > > u would now hold a reference to s, and s/s# would return a pointer > into s instead of re-building the UTF-8 form. I talked myself out of > this because it would be too easy to keep a lot more string objects > around than were actually needed. Agreed. Also, the encoding would always be correct. <defencbuf> will always hold the <default encoding> version (which should be UTF-8...). -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From gstein at lyra.org Fri Nov 12 23:19:15 1999 From: gstein at lyra.org (Greg Stein) Date: Fri, 12 Nov 1999 14:19:15 -0800 (PST) Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <007e01bf2d09$44738440$0501a8c0@bobcat> Message-ID: <Pine.LNX.4.10.9911121417530.27203-100000@nebula.lyra.org> On Fri, 12 Nov 1999, Mark Hammond wrote: > Couldnt we start with Fredriks implementation, and see how the rest > turns out? Even if we do choose to change the underlying Unicode > implementation to use a different native encoding, the interface to > the PyUnicode_Type would remain pretty similar. The advantage is that > we have something now to start working with for the rest of the > support we need. I agree with "start with" here, and will go one step further (which Mark may have implied) -- *check in* Fredrik's code. Cheers, -g -- Greg Stein, http://www.lyra.org/ From gstein at lyra.org Fri Nov 12 23:59:03 1999 From: gstein at lyra.org (Greg Stein) Date: Fri, 12 Nov 1999 14:59:03 -0800 (PST) Subject: [Python-Dev] just say no... In-Reply-To: <382C11FE.D7D9F916@lemburg.com> Message-ID: <Pine.LNX.4.10.9911121456370.2535-100000@nebula.lyra.org> On Fri, 12 Nov 1999, M.-A. Lemburg wrote: > Fredrik Lundh wrote: > > > Besides, the Unicode object will have a buffer containing the > > > <default encoding> representation of the object, which, if all goes > > > well, will always hold the UTF-8 value. > > > > <rant> > > > > over my dead body, that one... > > Such a buffer is needed to implement "s" and "s#" argument > parsing. It's a simple requirement to support those two > parsing markers -- there's not much to argue about, really... > unless, of course, you want to give up Unicode object support > for all APIs using these parsers. Bull! You can easily support "s#" support by returning the pointer to the Unicode buffer. The *entire* reason for introducing "t#" is to differentiate between returning a pointer to an 8-bit [character] buffer and a not-8-bit buffer. In other words, the work done to introduce "t#" was done *SPECIFICALLY* to allow "s#" to return a pointer to the Unicode data. I am with Fredrik on that auxilliary buffer. You'll have two dead bodies to deal with :-) Cheers, -g -- Greg Stein, http://www.lyra.org/ From gstein at lyra.org Sat Nov 13 00:05:11 1999 From: gstein at lyra.org (Greg Stein) Date: Fri, 12 Nov 1999 15:05:11 -0800 (PST) Subject: [Python-Dev] just say no... In-Reply-To: <19991112163347.5527635BB1E@snelboot.oratrix.nl> Message-ID: <Pine.LNX.4.10.9911121501460.2535-100000@nebula.lyra.org> This was done last year!! We have "s#" meaning "give me some bytes." We have "t#" meaning "give me some 8-bit characters." The Python distribution has been completely updated to use the appropriate format in each call. The was done *specifically* to support the introduction of a Unicode type. The intent was that "s#" returns the *raw* bytes of the Unicode string -- NOT a UTF-8 encoding! As a separate argument, MAL can argue that "t#" should create an internal, associated buffer to hold a UTF-8 encoding and then return that. But the "s#" should return the raw bytes! [ and I'll argue against the response to "t#" anyhow... ] -g On Fri, 12 Nov 1999, Jack Jansen wrote: > The problem with "s" and "s#" is that they're already semantically > overloaded, and will become more so with support for multiple charsets. > > Some modules use "s#" when they mean "give me a pointer to an area of memory > and its length". Writing to binary files is an example of this. > > Some modules use it to mean "give me a pointer to a string". Writing to a text > file is (probably) an example of this. > > Some modules use it to mean "give me a pointer to an 8-bit ASCII string". This > is the case if we're going to actually look at the contents (think of > string.upper() and such). > > I think that the only real solution is to define what "s" means, come up with > new getarg-formats for the other two use cases and convert all modules to use > the new standard. It'll still cause grief to extension modules that aren't > part of the core, but at least the problem will go away after a while. > -- > Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++ > Jack.Jansen at oratrix.com | ++++ if you agree copy these lines to your sig ++++ > www.oratrix.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm > > > > _______________________________________________ > Python-Dev maillist - Python-Dev at python.org > http://www.python.org/mailman/listinfo/python-dev > -- Greg Stein, http://www.lyra.org/ From gstein at lyra.org Sat Nov 13 00:09:13 1999 From: gstein at lyra.org (Greg Stein) Date: Fri, 12 Nov 1999 15:09:13 -0800 (PST) Subject: [Python-Dev] just say no... In-Reply-To: <382C2F97.8E7D7A4D@lemburg.com> Message-ID: <Pine.LNX.4.10.9911121505210.2535-100000@nebula.lyra.org> On Fri, 12 Nov 1999, M.-A. Lemburg wrote: > Fredrik Lundh wrote: >... > > why? I don't understand why "s" and "s#" has > > to deal with encoding issues at all... > > > > > unless, of course, you want to give up Unicode object support > > > for all APIs using these parsers. > > > > hmm. maybe that's exactly what I want... > > If we don't add that support, lot's of existing APIs won't > accept Unicode object instead of strings. While it could be > argued that automatic conversion to UTF-8 is not transparent > enough for the user, the other solution of using str(u) > everywhere would probably make writing Unicode-aware code a > rather clumsy task and introduce other pitfalls, since str(obj) > calls PyObject_Str() which also works on integers, floats, > etc. No no no... "s" and "s#" are NOT SUPPOSED TO return a UTF-8 encoding. They are supposed to return the raw bytes. If a caller wants 8-bit characters, then that caller will use "t#". If you want to argue for that separate, encoded buffer, then argue for it for support for the "t#" format. But do NOT say that it is needed for "s#" which simply means "give me some bytes." -g -- Greg Stein, http://www.lyra.org/ From gstein at lyra.org Sat Nov 13 00:26:08 1999 From: gstein at lyra.org (Greg Stein) Date: Fri, 12 Nov 1999 15:26:08 -0800 (PST) Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <14380.16064.723277.586881@weyr.cnri.reston.va.us> Message-ID: <Pine.LNX.4.10.9911121519440.2535-100000@nebula.lyra.org> On Fri, 12 Nov 1999, Fred L. Drake, Jr. wrote: > M.-A. Lemburg writes: > > The abbreviation BOM is quite common w/r to Unicode. True. > Yes: "w/r to Unicode". In sys, it's out of context and should > receive a more descriptive name. I think using BOM in unicodec is > good. I agree and believe that we can avoid putting it into sys altogether. > > BOM_BE: '\376\377' > > (corresponds to Unicode 0x0000FEFF in UTF-16 > > == ZERO WIDTH NO-BREAK SPACE) Are you sure about that interpretation? I thought the BOM characters (0xFEFF and 0xFFFE) were *reserved* in the UCS-2 space. > I'd also add BOM to be the same as sys.byte_order_mark. Perhaps > even instead of sys.byte_order_mark (just to localize the areas of > code that are affected). ### unicodec.py ### import struct BOM = struct.pack('h', 0x0000FEFF) BOM_BE = '\376\377' ... If somebody needs the BOM, then they should go to unicodec.py (or some other module). I do not believe we need to put that stuff into the sys module. It is just too easy to create the value in Python. Cheers, -g p.s. to be pedantic, the pack() format could be '@h' -- Greg Stein, http://www.lyra.org/ From mhammond at skippinet.com.au Sat Nov 13 00:41:16 1999 From: mhammond at skippinet.com.au (Mark Hammond) Date: Sat, 13 Nov 1999 10:41:16 +1100 Subject: [Python-Dev] just say no... In-Reply-To: <Pine.LNX.4.10.9911121501460.2535-100000@nebula.lyra.org> Message-ID: <008601bf2d67$6a9982b0$0501a8c0@bobcat> [Greg writes] > As a separate argument, MAL can argue that "t#" should create > an internal, > associated buffer to hold a UTF-8 encoding and then return > that. But the > "s#" should return the raw bytes! > [ and I'll argue against the response to "t#" anyhow... ] Hmm. Climbing over these dead bodies could get a bit smelly :-) Im inclined to agree that holding 2 internal buffers for the unicode object is not ideal. However, I _am_ concerned with getting decent PyArg_ParseTuple and Py_BuildValue support, and if the cost is an extra buffer I will survive. So lets look for solutions that dont require it, rather than holding it up as evil when no other solution is obvious. My requirements appear to me to be very simple (for an anglophile): Lets say I have a platform Unicode value - eg, I got a Unicode value from some external library (say COM :-) Lets assume for now that the Unicode string is fully representable as ASCII - say a file or directory name that COM gave me. I simply want to be able to pass this Unicode object to "open()", and have it work. This assumes that open() will not become "native unicode", simply as the underlying C support is not unicode aware - it needs to be converted to a "char *" (ie, will use the "t#" format) The second side of the equation is when I expose a Python function that talks Unicode - eg, I need to _pass_ a platform Unicode value to an external library. The Python programmer should be able to pass a Unicode object (no problem), or a PyString object. In code terms: Prob1: name = SomeComObject.GetFileName() # A Unicode object f = open(name) Prob2: SomeComObject.SetFileName("foo.txt") IMO it is important that we have a good strategy for dealing with this for extensions. MAL addresses one direction, but not the other. Maybe if we toss around general solutions for this the implementation will fall out. MALs idea of the additional buffer starts to address this, but isnt the whole story. Any ideas on this? From gstein at lyra.org Sat Nov 13 01:49:34 1999 From: gstein at lyra.org (Greg Stein) Date: Fri, 12 Nov 1999 16:49:34 -0800 (PST) Subject: [Python-Dev] argument parsing (was: just say no...) In-Reply-To: <008601bf2d67$6a9982b0$0501a8c0@bobcat> Message-ID: <Pine.LNX.4.10.9911121615170.2535-100000@nebula.lyra.org> On Sat, 13 Nov 1999, Mark Hammond wrote: >... > Im inclined to agree that holding 2 internal buffers for the unicode > object is not ideal. However, I _am_ concerned with getting decent > PyArg_ParseTuple and Py_BuildValue support, and if the cost is an > extra buffer I will survive. So lets look for solutions that dont > require it, rather than holding it up as evil when no other solution > is obvious. I believe Py_BuildValue is pretty straight-forward. Simply state that it is allowed to perform conversions and place the resulting object into the resulting tuple. (with appropriate refcounting) In other words: tuple = Py_BuildValue("U", stringOb); The stringOb will be converted to a Unicode object. The new Unicode object will go into the tuple (with the tuple holding the only reference!). The stringOb will NOT acquire any additional references. [ "U" format may be wrong; it is here for example purposes ] Okay... now the PyArg_ParseTuple() is the *real* kicker. >... > Prob1: > name = SomeComObject.GetFileName() # A Unicode object > f = open(name) > Prob2: > SomeComObject.SetFileName("foo.txt") Both of these issues are due to PyArg_ParseTuple. In Prob1, you want a string-like object which can be passed to the OS as an 8-bit string. In Prob2, you want a string-like object which can be passed to the OS as a Unicode string. I see three options for PyArg_ParseTuple: 1) allow it to return NEW objects which must be DECREF'd. [ current policy only loans out references ] This option could be difficult in the presence of errors during the parse. For example, the current idiom is: if (!PyArg_ParseTuple(args, "...")) return NULL; If an object was produced, but then a later argument cause a failure, then who is responsible for freeing the object? 2) like step 1, but PyArg_ParseTuple is smart enough to NOT return any new objects when an error occurred. This basically answers the last question in option (1) -- ParseTuple is responsible. 3) Return loaned-out-references to objects which have been tested for convertability. Helper functions perform the conversion and the caller will then free the reference. [ this is the model used in PyWin32 ] Code in PyWin32 typically looks like: if (!PyArg_ParseTuple(args, "O", &ob)) return NULL; if ((unicodeOb = GiveMeUnicode(ob)) == NULL) return NULL; ... Py_DECREF(unicodeOb); [ GiveMeUnicode is descriptive here; I forget the name used in PyWin32 ] In a "real" situation, the ParseTuple format would be "U" and the object would be type-tested for PyStringType or PyUnicodeType. Note that GiveMeUnicode() would also do a type-test, but it can't produce a *specific* error like ParseTuple (e.g. "string/unicode object expected" vs "parameter 3 must be a string/unicode object") Are there more options? Anybody? All three of these avoid the secondary buffer. The last is cleanest w.r.t. to keeping the existing "loaned references" behavior, but can get a bit wordy when you need to convert a bunch of string arguments. Option (2) adds a good amount of complexity to PyArg_ParseTuple -- it would need to keep a "free list" in case an error occurred. Option (1) adds DECREF logic to callers to ensure they clean up. The add'l logic isn't much more than the other two options (the only change is adding DECREFs before returning NULL from the "if (!PyArg_ParseTuple..." condition). Note that the caller would probably need to initialize each object to NULL before calling ParseTuple. Personally, I prefer (3) as it makes it very clear that a new object has been created and must be DECREF'd at some point. Also note that GiveMeUnicode() could also accept a second argument for the type of decoding to do (or NULL meaning "UTF-8"). Oh: note there are equivalents of all options for going from unicode-to-string; the above is all about string-to-unicode. However, the tricky part of unicode-to-string is determining whether backwards compatibility will be a requirement. i.e. does existing code that uses the "t" format suddenly achieve the capability to accept a Unicode object? This obviously causes problems in all three options: since a new reference must be created to handle the situation, then who DECREF's it? The old code certainly doesn't. [ <IMO> I'm with Fredrik in saying "no, old code *doesn't* suddenly get the ability to accept a Unicode object." The Python code must use str() to do the encoding manually (until the old code is upgraded to one of the above three options). </IMO> ] I think that's it for me. In the several years I've been thinking on this problem, I haven't come up with anything but the above three. There may be a whole new paradigm for argument parsing, but I haven't tried to think on that one (and just fit in around ParseTuple). Cheers, -g -- Greg Stein, http://www.lyra.org/ From mal at lemburg.com Fri Nov 12 19:49:52 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Fri, 12 Nov 1999 19:49:52 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <DBF3B37F7BF1D111B2A10000F6B14B1FDDAF24@ukhil704nts.hld.uk.fid-intl.com> <02f901bf2d07$bdf5d950$f29b12c2@secret.pythonware.com> <382C0FA8.ACB6CCD6@lemburg.com> <14380.10955.420102.327867@weyr.cnri.reston.va.us> <382C3749.198EEBC6@lemburg.com> <14380.16064.723277.586881@weyr.cnri.reston.va.us> Message-ID: <382C6150.53BDC803@lemburg.com> "Fred L. Drake, Jr." wrote: > > M.-A. Lemburg writes: > > The abbreviation BOM is quite common w/r to Unicode. > > Yes: "w/r to Unicode". In sys, it's out of context and should > receive a more descriptive name. I think using BOM in unicodec is > good. Guido proposed to add it to sys. I originally had it defined in unicodec. Perhaps a sys.endian would be more appropriate for sys with values 'little' and 'big' or '<' and '>' to be conform to the struct module. unicodec could then define unicodec.bom depending on the setting in sys. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Sat Nov 13 10:37:35 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Sat, 13 Nov 1999 10:37:35 +0100 Subject: [Python-Dev] just say no... References: <Pine.LNX.4.10.9911121505210.2535-100000@nebula.lyra.org> Message-ID: <382D315F.A7ADEC42@lemburg.com> Greg Stein wrote: > > On Fri, 12 Nov 1999, M.-A. Lemburg wrote: > > Fredrik Lundh wrote: > >... > > > why? I don't understand why "s" and "s#" has > > > to deal with encoding issues at all... > > > > > > > unless, of course, you want to give up Unicode object support > > > > for all APIs using these parsers. > > > > > > hmm. maybe that's exactly what I want... > > > > If we don't add that support, lot's of existing APIs won't > > accept Unicode object instead of strings. While it could be > > argued that automatic conversion to UTF-8 is not transparent > > enough for the user, the other solution of using str(u) > > everywhere would probably make writing Unicode-aware code a > > rather clumsy task and introduce other pitfalls, since str(obj) > > calls PyObject_Str() which also works on integers, floats, > > etc. > > No no no... > > "s" and "s#" are NOT SUPPOSED TO return a UTF-8 encoding. They are > supposed to return the raw bytes. [I've waited quite some time for you to chime in on this one ;-)] Let me summarize a bit on the general ideas behind "s", "s#" and the extra buffer: First, we have a general design question here: should old code become Unicode compatible or not. As I recall the original idea about Unicode integration was to follow Perl's idea to have scripts become Unicode aware by simply adding a 'use utf8;'. If this is still the case, then we'll have to come with a resonable approach for integrating classical string based APIs with the new type. Since UTF-8 is a standard (some would probably prefer UTF-7,5 e.g. the Latin-1 folks) which has some very nice features (see http://czyborra.com/utf/ ) and which is a true extension of ASCII, this encoding seems best fit for the purpose. However, one should not forget that UTF-8 is in fact a variable length encoding of Unicode characters, that is up to 3 bytes form a *single* character. This is obviously not compatible with definitions that explicitly state data to be using a 8-bit single character encoding, e.g. indexing in UTF-8 doesn't work like it does in Latin-1 text. So if we are to do the integration, we'll have to choose argument parser markers that allow for multi byte characters. "t#" does not fall into this category, "s#" certainly does, "s" is argueable. Also note that we have to watch out for embedded NULL bytes. UTF-16 has NULL bytes for every character from the Latin-1 domain. If "s" were to give back a pointer to the internal buffer which is encoded in UTF-16, you would loose data. UTF-8 doesn't have this problem, since only NULL bytes map to (single) NULL bytes. Now Greg would chime in with the buffer interface and argue that it should make the underlying internal format accessible. This is a bad idea, IMHO, since you shouldn't really have to know what the internal data format is. Defining "s#" to return UTF-8 data does not only make "s" and "s#" return the same data format (which should always be the case, IMO), but also hides the internal format from the user and gives him a reliable cross-platform data representation of Unicode data (note that UTF-8 doesn't have the byte order problems of UTF-16). If you are still with, let's look at what "s" and "s#" do: they return pointers into data areas which have to be kept alive until the corresponding object dies. The only way to support this feature is by allocating a buffer for just this purpose (on the fly and only if needed to prevent excessive memory load). The other options of adding new magic parser markers or switching to more generic one all have one downside: you need to change existing code which is in conflict with the idea we started out with. So, again, the question is: do we want this magical intergration or not ? Note that this is a design question, not one of memory consumption... -- Ok, the above covered Unicode -> String conversion. Mark mentioned that he wanted the other way around to also work in the same fashion, ie. automatic String -> Unicode conversion. This could also be done in the same way by interpreting the string as UTF-8 encoded Unicode... but we have the same problem: where to put the data without generating new intermediate objects. Since only newly written code will use this feature there is a way to do this though: PyArg_ParseTuple(args,"s#",&utf8,&len); If your C API understands UTF-8 there's nothing more to do, if not, take Greg's option 3 approach: PyArg_ParseTuple(args,"O",&obj); unicode = PyUnicode_FromObject(obj); ... Py_DECREF(unicode); Here PyUnicode_FromObject() will return a new reference if obj is an Unicode object or create a new Unicode object by interpreting str(obj) as UTF-8 encoded string. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 48 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From guido at CNRI.Reston.VA.US Sat Nov 13 13:12:41 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Sat, 13 Nov 1999 07:12:41 -0500 Subject: [Python-Dev] just say no... In-Reply-To: Your message of "Fri, 12 Nov 1999 14:59:03 PST." <Pine.LNX.4.10.9911121456370.2535-100000@nebula.lyra.org> References: <Pine.LNX.4.10.9911121456370.2535-100000@nebula.lyra.org> Message-ID: <199911131212.HAA25895@eric.cnri.reston.va.us> > I am with Fredrik on that auxilliary buffer. You'll have two dead bodies > to deal with :-) I haven't made up my mind yet (due to a very successful Python-promoting visit to SD'99 east, I'm about 100 msgs behind in this thread alone) but let me warn you that I can deal with the carnage, if necessary. :-) --Guido van Rossum (home page: http://www.python.org/~guido/) From gstein at lyra.org Sat Nov 13 13:23:54 1999 From: gstein at lyra.org (Greg Stein) Date: Sat, 13 Nov 1999 04:23:54 -0800 (PST) Subject: [Python-Dev] just say no... In-Reply-To: <199911131212.HAA25895@eric.cnri.reston.va.us> Message-ID: <Pine.LNX.4.10.9911130423400.2535-100000@nebula.lyra.org> On Sat, 13 Nov 1999, Guido van Rossum wrote: > > I am with Fredrik on that auxilliary buffer. You'll have two dead bodies > > to deal with :-) > > I haven't made up my mind yet (due to a very successful > Python-promoting visit to SD'99 east, I'm about 100 msgs behind in > this thread alone) but let me warn you that I can deal with the > carnage, if necessary. :-) Bring it on, big boy! :-) -- Greg Stein, http://www.lyra.org/ From mhammond at skippinet.com.au Sat Nov 13 13:52:18 1999 From: mhammond at skippinet.com.au (Mark Hammond) Date: Sat, 13 Nov 1999 23:52:18 +1100 Subject: [Python-Dev] argument parsing (was: just say no...) In-Reply-To: <Pine.LNX.4.10.9911121615170.2535-100000@nebula.lyra.org> Message-ID: <00b301bf2dd5$ec4df840$0501a8c0@bobcat> [Lamenting about PyArg_ParseTuple and managing memory buffers for String/Unicode conversions.] So what is really wrong with Marc's proposal about the extra pointer on the Unicode object? And to double the carnage, who not add the equivilent native Unicode buffer to the PyString object? These would only ever be filled when requested by the conversion routines. They have no other effect than their memory is managed by the object itself; simply a convenience to avoid having extension modules manage the conversion buffers. The only overheads appear to be: * The conversion buffers may be slightly (or much :-) longer-lived - ie, they are not freed until the object itself is freed. * String object slightly bigger, and slightly slower to destroy. It appears to solve the problems, and the cost doesnt seem too high... Mark. From guido at CNRI.Reston.VA.US Sat Nov 13 14:06:26 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Sat, 13 Nov 1999 08:06:26 -0500 Subject: [Python-Dev] just say no... In-Reply-To: Your message of "Sat, 13 Nov 1999 10:37:35 +0100." <382D315F.A7ADEC42@lemburg.com> References: <Pine.LNX.4.10.9911121505210.2535-100000@nebula.lyra.org> <382D315F.A7ADEC42@lemburg.com> Message-ID: <199911131306.IAA26030@eric.cnri.reston.va.us> I think I have a reasonable grasp of the issues here, even though I still haven't read about 100 msgs in this thread. Note that t# and the charbuffer addition to the buffer API were added by Greg Stein with my support; I'll attempt to reconstruct our thinking at the time... [MAL] > Let me summarize a bit on the general ideas behind "s", "s#" > and the extra buffer: I think you left out t#. > First, we have a general design question here: should old code > become Unicode compatible or not. As I recall the original idea > about Unicode integration was to follow Perl's idea to have > scripts become Unicode aware by simply adding a 'use utf8;'. I've never heard of this idea before -- or am I taking it too literal? It smells of a mode to me :-) I'd rather live in a world where Unicode just works as long as you use u'...' literals or whatever convention we decide. > If this is still the case, then we'll have to come with a > resonable approach for integrating classical string based > APIs with the new type. > > Since UTF-8 is a standard (some would probably prefer UTF-7,5 e.g. > the Latin-1 folks) which has some very nice features (see > http://czyborra.com/utf/ ) and which is a true extension of ASCII, > this encoding seems best fit for the purpose. Yes, especially if we fix the default encoding as UTF-8. (I'm expecting feedback from HP on this next week, hopefully when I see the details, it'll be clear that don't need a per-thread default encoding to solve their problems; that's quite a likely outcome. If not, we have a real-world argument for allowing a variable default encoding, without carnage.) > However, one should not forget that UTF-8 is in fact a > variable length encoding of Unicode characters, that is up to > 3 bytes form a *single* character. This is obviously not compatible > with definitions that explicitly state data to be using a > 8-bit single character encoding, e.g. indexing in UTF-8 doesn't > work like it does in Latin-1 text. Sure, but where in current Python are there such requirements? > So if we are to do the integration, we'll have to choose > argument parser markers that allow for multi byte characters. > "t#" does not fall into this category, "s#" certainly does, > "s" is argueable. I disagree. I grepped through the source for s# and t#. Here's a bit of background. Before t# was introduced, s# was being used for two distinct purposes: (1) to get an 8-bit text string plus its length, in situations where the length was needed; (2) to get binary data (e.g. GIF data read from a file in "rb" mode). Greg pointed out that if we ever introduced some form of Unicode support, these two had to be disambiguated. We found that the majority of uses was for (2)! Therefore we decided to change the definition of s# to mean only (2), and introduced t# to mean (1). Also, we introduced getcharbuffer corresponding to t#, while getreadbuffer was meant for s#. Note that the definition of the 's' format was left alone -- as before, it means you need an 8-bit text string not containing null bytes. Our expectation was that a Unicode string passed to an s# situation would give a pointer to the internal format plus a byte count (not a character count!) while t# would get a pointer to some kind of 8-bit translation/encoding plus a byte count, with the explicit requirement that the 8-bit translation would have the same lifetime as the original unicode object. We decided to leave it up to the next generation (i.e., Marc-Andre :-) to decide what kind of translation to use and what to do when there is no reasonable translation. Any of the following choices is acceptable (from the point of view of not breaking the intended t# semantics; we can now start deciding which we like best): - utf-8 - latin-1 - ascii - shift-jis - lower byte of unicode ordinal - some user- or os-specified multibyte encoding As far as t# is concerned, for encodings that don't encode all of Unicode, untranslatable characters could be dealt with in any number of ways (raise an exception, ignore, replace with '?', make best effort, etc.). Given the current context, it should probably be the same as the default encoding -- i.e., utf-8. If we end up making the default user-settable, we'll have to decide what to do with untranslatable characters -- but that will probably be decided by the user too (it would be a property of a specific translation specification). In any case, I feel that t# could receive a multi-byte encoding, s# should receive raw binary data, and they should correspond to getcharbuffer and getreadbuffer, respectively. (Aside: the symmetry between 's' and 's#' is now lost; 's' matches 't#', there's no match for 's#'.) > Also note that we have to watch out for embedded NULL bytes. > UTF-16 has NULL bytes for every character from the Latin-1 > domain. If "s" were to give back a pointer to the internal > buffer which is encoded in UTF-16, you would loose data. > UTF-8 doesn't have this problem, since only NULL bytes > map to (single) NULL bytes. This is a red herring given my explanation above. > Now Greg would chime in with the buffer interface and > argue that it should make the underlying internal > format accessible. This is a bad idea, IMHO, since you > shouldn't really have to know what the internal data format > is. This is for C code. Quite likely it *does* know what the internal data format is! > Defining "s#" to return UTF-8 data does not only > make "s" and "s#" return the same data format (which should > always be the case, IMO), That was before t# was introduced. No more, alas. If you replace s# with t#, I agree with you completely. > but also hides the internal > format from the user and gives him a reliable cross-platform > data representation of Unicode data (note that UTF-8 doesn't > have the byte order problems of UTF-16). > > If you are still with, let's look at what "s" and "s#" (and t#, which is more relevant here) > do: they return pointers into data areas which have to > be kept alive until the corresponding object dies. > > The only way to support this feature is by allocating > a buffer for just this purpose (on the fly and only if > needed to prevent excessive memory load). The other > options of adding new magic parser markers or switching > to more generic one all have one downside: you need to > change existing code which is in conflict with the idea > we started out with. Agreed. I think this was our thinking when Greg & I introduced t#. My own preference would be to allocate a whole string object, not just a buffer; this could then also be used for the .encode() method using the default encoding. > So, again, the question is: do we want this magical > intergration or not ? Note that this is a design question, > not one of memory consumption... Yes, I want it. Note that this doesn't guarantee that all old extensions will work flawlessly when passed Unicode objects; but I think that it covers most cases where you could have a reasonable expectation that it works. (Hm, unfortunately many reasonable expectations seem to involve the current user's preferred encoding. :-( ) > -- > > Ok, the above covered Unicode -> String conversion. Mark > mentioned that he wanted the other way around to also > work in the same fashion, ie. automatic String -> Unicode > conversion. > > This could also be done in the same way by > interpreting the string as UTF-8 encoded Unicode... but we > have the same problem: where to put the data without > generating new intermediate objects. Since only newly > written code will use this feature there is a way to do > this though: > > PyArg_ParseTuple(args,"s#",&utf8,&len); No! That is supposed to give the native representation of the string object. I agree that Mark's problem requires a solution too, but it doesn't have to use existing formatting characters, since there's no backwards compatibility issue. > If your C API understands UTF-8 there's nothing more to do, > if not, take Greg's option 3 approach: > > PyArg_ParseTuple(args,"O",&obj); > unicode = PyUnicode_FromObject(obj); > ... > Py_DECREF(unicode); > > Here PyUnicode_FromObject() will return a new > reference if obj is an Unicode object or create a new > Unicode object by interpreting str(obj) as UTF-8 encoded string. This might work. --Guido van Rossum (home page: http://www.python.org/~guido/) From mal at lemburg.com Sat Nov 13 14:06:35 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Sat, 13 Nov 1999 14:06:35 +0100 Subject: [Python-Dev] Unicode Proposal: Version 0.5 References: <382C0A54.E6E8328D@lemburg.com> Message-ID: <382D625B.DC14DBDE@lemburg.com> FYI, I've uploaded a new version of the proposal which incorporates proposals for line breaks, case mapping, character properties and private code points support. The latest version of the proposal is available at: http://starship.skyport.net/~lemburg/unicode-proposal.txt Older versions are available as: http://starship.skyport.net/~lemburg/unicode-proposal-X.X.txt Some POD (points of discussion) that are still open: ? should Unicode objects support %-formatting ? One possibility would be to emulate this via strings and <default encoding>: s = '%s %i abc???' # a Latin-1 encoded string t = (u,3) # Convert Latin-1 s to a <default encoding> string s1 = unicode(s,'latin-1').encode() # The '%s' will now add u in <default encoding> s2 = s1 % t # Finally, convert the <default encoding> encoded string to Unicode u1 = unicode(s2) ? specifying file wrappers: Open issues: what to do with Python strings fed to the .write() method (may need to know the encoding of the strings) and when/if to return Python strings through the .read() method. Perhaps we need more than one type of wrapper here. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 48 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From jack at oratrix.nl Sat Nov 13 17:40:34 1999 From: jack at oratrix.nl (Jack Jansen) Date: Sat, 13 Nov 1999 17:40:34 +0100 Subject: [Python-Dev] just say no... In-Reply-To: Message by Greg Stein <gstein@lyra.org> , Fri, 12 Nov 1999 15:05:11 -0800 (PST) , <Pine.LNX.4.10.9911121501460.2535-100000@nebula.lyra.org> Message-ID: <19991113164039.9B697EA11A@oratrix.oratrix.nl> Recently, Greg Stein <gstein at lyra.org> said: > This was done last year!! We have "s#" meaning "give me some bytes." We > have "t#" meaning "give me some 8-bit characters." The Python distribution > has been completely updated to use the appropriate format in each call. Oops... I remember the discussion but I wasn't aware that somone had actually _implemented_ this:-). Part of my misunderstanding was also caused by the fact that I inspected what I thought would be the prime candidate for t#: file.write() to a non-binary file, and it doesn't use the new format. I also noted a few inconsistencies at first glance, by the way: most modules seem to use s# for things like filenames and other data-that-is-readable-but-shouldn't-be-messed-with, but binascii is an exception and it uses t# for uuencoded strings... -- Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++ Jack.Jansen at oratrix.com | ++++ if you agree copy these lines to your sig ++++ www.oratrix.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm From guido at CNRI.Reston.VA.US Sat Nov 13 20:20:51 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Sat, 13 Nov 1999 14:20:51 -0500 Subject: [Python-Dev] just say no... In-Reply-To: Your message of "Sat, 13 Nov 1999 17:40:34 +0100." <19991113164039.9B697EA11A@oratrix.oratrix.nl> References: <19991113164039.9B697EA11A@oratrix.oratrix.nl> Message-ID: <199911131920.OAA26165@eric.cnri.reston.va.us> > I remember the discussion but I wasn't aware that somone had actually > _implemented_ this:-). Part of my misunderstanding was also caused by > the fact that I inspected what I thought would be the prime candidate > for t#: file.write() to a non-binary file, and it doesn't use the new > format. I guess that's because file.write() doesn't distinguish between text and binary files. Maybe it should: the current implementation together with my proposed semantics for Unicode strings would mean that printing a unicode string (to stdout) would dump the internal encoding to the file. I guess it should do so only when the file is opened in binary mode; for files opened in text mode it should use an encoding (opening a file can specify an encoding; can we change the encoding of an existing file?). > I also noted a few inconsistencies at first glance, by the way: most > modules seem to use s# for things like filenames and other > data-that-is-readable-but-shouldn't-be-messed-with, but binascii is an > exception and it uses t# for uuencoded strings... Actually, binascii seems to do it right: s# for binary data, t# for text (uuencoded, hqx, base64). That is, the b2a variants use s# while the a2b variants use t#. The only thing I'm not sure about in that module are binascii_rledecode_hqx() and binascii_rlecode_hqx() -- I don't understand where these stand in the complexity of binhex en/decoding. --Guido van Rossum (home page: http://www.python.org/~guido/) From mal at lemburg.com Sun Nov 14 23:11:54 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Sun, 14 Nov 1999 23:11:54 +0100 Subject: [Python-Dev] just say no... References: <Pine.LNX.4.10.9911121505210.2535-100000@nebula.lyra.org> <382D315F.A7ADEC42@lemburg.com> <199911131306.IAA26030@eric.cnri.reston.va.us> Message-ID: <382F33AA.C3EE825A@lemburg.com> Guido van Rossum wrote: > > I think I have a reasonable grasp of the issues here, even though I > still haven't read about 100 msgs in this thread. Note that t# and > the charbuffer addition to the buffer API were added by Greg Stein > with my support; I'll attempt to reconstruct our thinking at the > time... > > [MAL] > > Let me summarize a bit on the general ideas behind "s", "s#" > > and the extra buffer: > > I think you left out t#. On purpose -- according to my thinking. I see "t#" as an interface to bf_getcharbuf which I understand as 8-bit character buffer... UTF-8 is a multi byte encoding. It still is character data, but not necessarily 8 bits in length (up to 24 bits are used). Anyway, I'm not really interested in having an argument about this. If you say, "t#" fits the purpose, then that's fine with me. Still, we should clearly define that "t#" returns text data and "s#" binary data. Encoding, bit length, etc. should explicitly remain left undefined. > > First, we have a general design question here: should old code > > become Unicode compatible or not. As I recall the original idea > > about Unicode integration was to follow Perl's idea to have > > scripts become Unicode aware by simply adding a 'use utf8;'. > > I've never heard of this idea before -- or am I taking it too literal? > It smells of a mode to me :-) I'd rather live in a world where > Unicode just works as long as you use u'...' literals or whatever > convention we decide. > > > If this is still the case, then we'll have to come with a > > resonable approach for integrating classical string based > > APIs with the new type. > > > > Since UTF-8 is a standard (some would probably prefer UTF-7,5 e.g. > > the Latin-1 folks) which has some very nice features (see > > http://czyborra.com/utf/ ) and which is a true extension of ASCII, > > this encoding seems best fit for the purpose. > > Yes, especially if we fix the default encoding as UTF-8. (I'm > expecting feedback from HP on this next week, hopefully when I see the > details, it'll be clear that don't need a per-thread default encoding > to solve their problems; that's quite a likely outcome. If not, we > have a real-world argument for allowing a variable default encoding, > without carnage.) Fair enough :-) > > However, one should not forget that UTF-8 is in fact a > > variable length encoding of Unicode characters, that is up to > > 3 bytes form a *single* character. This is obviously not compatible > > with definitions that explicitly state data to be using a > > 8-bit single character encoding, e.g. indexing in UTF-8 doesn't > > work like it does in Latin-1 text. > > Sure, but where in current Python are there such requirements? It was my understanding that "t#" refers to single byte character data. That's where the above arguments were aiming at... > > So if we are to do the integration, we'll have to choose > > argument parser markers that allow for multi byte characters. > > "t#" does not fall into this category, "s#" certainly does, > > "s" is argueable. > > I disagree. I grepped through the source for s# and t#. Here's a bit > of background. Before t# was introduced, s# was being used for two > distinct purposes: (1) to get an 8-bit text string plus its length, in > situations where the length was needed; (2) to get binary data (e.g. > GIF data read from a file in "rb" mode). Greg pointed out that if we > ever introduced some form of Unicode support, these two had to be > disambiguated. We found that the majority of uses was for (2)! > Therefore we decided to change the definition of s# to mean only (2), > and introduced t# to mean (1). Also, we introduced getcharbuffer > corresponding to t#, while getreadbuffer was meant for s#. I know its too late now, but I can't really follow the arguments here: in what ways are (1) and (2) different from the implementations point of view ? If "t#" is to return UTF-8 then <length of the buffer> will not equal <text length>, so both parser markers return essentially the same information. The only difference would be on the semantic side: (1) means: give me text data, while (2) does not specify the data type. Perhaps I'm missing something... > Note that the definition of the 's' format was left alone -- as > before, it means you need an 8-bit text string not containing null > bytes. This definition should then be changed to "text string without null bytes" dropping the 8-bit reference. > Our expectation was that a Unicode string passed to an s# situation > would give a pointer to the internal format plus a byte count (not a > character count!) while t# would get a pointer to some kind of 8-bit > translation/encoding plus a byte count, with the explicit requirement > that the 8-bit translation would have the same lifetime as the > original unicode object. We decided to leave it up to the next > generation (i.e., Marc-Andre :-) to decide what kind of translation to > use and what to do when there is no reasonable translation. Hmm, I would strongly object to making "s#" return the internal format. file.write() would then default to writing UTF-16 data instead of UTF-8 data. This could result in strange errors due to the UTF-16 format being endian dependent. It would also break the symmetry between file.write(u) and unicode(file.read()), since the default encoding is not used as internal format for other reasons (see proposal). > Any of the following choices is acceptable (from the point of view of > not breaking the intended t# semantics; we can now start deciding > which we like best): I think we have already agreed on using UTF-8 for the default encoding. It has quite a few advantages. See http://czyborra.com/utf/ for a good overview of the pros and cons. > - utf-8 > - latin-1 > - ascii > - shift-jis > - lower byte of unicode ordinal > - some user- or os-specified multibyte encoding > > As far as t# is concerned, for encodings that don't encode all of > Unicode, untranslatable characters could be dealt with in any number > of ways (raise an exception, ignore, replace with '?', make best > effort, etc.). The usual Python way would be: raise an exception. This is what the proposal defines for Codecs in case an encoding/decoding mapping is not possible, BTW. (UTF-8 will always succeed on output.) > Given the current context, it should probably be the same as the > default encoding -- i.e., utf-8. If we end up making the default > user-settable, we'll have to decide what to do with untranslatable > characters -- but that will probably be decided by the user too (it > would be a property of a specific translation specification). > > In any case, I feel that t# could receive a multi-byte encoding, > s# should receive raw binary data, and they should correspond to > getcharbuffer and getreadbuffer, respectively. Why would you want to have "s#" return the raw binary data for Unicode objects ? Note that it is not mentioned anywhere that "s#" and "t#" do have to necessarily return different things (binary being a superset of text). I'd opt for "s#" and "t#" both returning UTF-8 data. This can be implemented by delegating the buffer slots to the <defencstr> object (see below). > > Now Greg would chime in with the buffer interface and > > argue that it should make the underlying internal > > format accessible. This is a bad idea, IMHO, since you > > shouldn't really have to know what the internal data format > > is. > > This is for C code. Quite likely it *does* know what the internal > data format is! C code can use the PyUnicode_* APIs to access the data. I don't think that argument parsing is powerful enough to provide the C code with enough information about the data contents, e.g. it can only state the encoding length, not the string length. > > Defining "s#" to return UTF-8 data does not only > > make "s" and "s#" return the same data format (which should > > always be the case, IMO), > > That was before t# was introduced. No more, alas. If you replace s# > with t#, I agree with you completely. Done :-) > > but also hides the internal > > format from the user and gives him a reliable cross-platform > > data representation of Unicode data (note that UTF-8 doesn't > > have the byte order problems of UTF-16). > > > > If you are still with, let's look at what "s" and "s#" > > (and t#, which is more relevant here) > > > do: they return pointers into data areas which have to > > be kept alive until the corresponding object dies. > > > > The only way to support this feature is by allocating > > a buffer for just this purpose (on the fly and only if > > needed to prevent excessive memory load). The other > > options of adding new magic parser markers or switching > > to more generic one all have one downside: you need to > > change existing code which is in conflict with the idea > > we started out with. > > Agreed. I think this was our thinking when Greg & I introduced t#. > My own preference would be to allocate a whole string object, not > just a buffer; this could then also be used for the .encode() method > using the default encoding. Good point. I'll change <defencbuf> to <defencstr>, a Python string object created on request. > > So, again, the question is: do we want this magical > > intergration or not ? Note that this is a design question, > > not one of memory consumption... > > Yes, I want it. > > Note that this doesn't guarantee that all old extensions will work > flawlessly when passed Unicode objects; but I think that it covers > most cases where you could have a reasonable expectation that it > works. > > (Hm, unfortunately many reasonable expectations seem to involve > the current user's preferred encoding. :-( ) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 47 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From amk1 at erols.com Mon Nov 15 02:49:08 1999 From: amk1 at erols.com (A.M. Kuchling) Date: Sun, 14 Nov 1999 20:49:08 -0500 Subject: [Python-Dev] PyErr_Format security note Message-ID: <199911150149.UAA00408@mira.erols.com> I noticed this in PyErr_Format(exception, format, va_alist): char buffer[500]; /* Caller is responsible for limiting the format */ ... vsprintf(buffer, format, vargs); Making the caller responsible for this is error-prone. The danger, of course, is a buffer overflow caused by generating an error string that's larger than the buffer, possibly letting people execute arbitrary code. We could add a test to the configure script for vsnprintf() and use it when possible, but that only fixes the problem on platforms which have it. Can we find an implementation of vsnprintf() someplace? -- A.M. Kuchling http://starship.python.net/crew/amk/ One form to rule them all, one form to find them, one form to bring them all and in the darkness rewrite the hell out of them. -- Digital Equipment Corporation, in a comment from SENDMAIL Ruleset 3 From gstein at lyra.org Mon Nov 15 03:11:39 1999 From: gstein at lyra.org (Greg Stein) Date: Sun, 14 Nov 1999 18:11:39 -0800 (PST) Subject: [Python-Dev] PyErr_Format security note In-Reply-To: <199911150149.UAA00408@mira.erols.com> Message-ID: <Pine.LNX.4.10.9911141807390.2535-100000@nebula.lyra.org> On Sun, 14 Nov 1999, A.M. Kuchling wrote: > Making the caller responsible for this is error-prone. The danger, of > course, is a buffer overflow caused by generating an error string > that's larger than the buffer, possibly letting people execute > arbitrary code. We could add a test to the configure script for > vsnprintf() and use it when possible, but that only fixes the problem > on platforms which have it. Can we find an implementation of > vsnprintf() someplace? Apache has a safe implementation (they have reviewed the heck out of it for obvious reasons :-). In the Apache source distribution, it is located in src/ap/ap_snprintf.c. Cheers, -g -- Greg Stein, http://www.lyra.org/ From mal at lemburg.com Mon Nov 15 09:09:07 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Mon, 15 Nov 1999 09:09:07 +0100 Subject: [Python-Dev] PyErr_Format security note References: <199911150149.UAA00408@mira.erols.com> Message-ID: <382FBFA3.B28B8E1E@lemburg.com> "A.M. Kuchling" wrote: > > I noticed this in PyErr_Format(exception, format, va_alist): > > char buffer[500]; /* Caller is responsible for limiting the format */ > ... > vsprintf(buffer, format, vargs); > > Making the caller responsible for this is error-prone. The danger, of > course, is a buffer overflow caused by generating an error string > that's larger than the buffer, possibly letting people execute > arbitrary code. We could add a test to the configure script for > vsnprintf() and use it when possible, but that only fixes the problem > on platforms which have it. Can we find an implementation of > vsnprintf() someplace? In sysmodule.c, this check is done which should be safe enough since no "return" is issued (Py_FatalError() does an abort()): if (vsprintf(buffer, format, va) >= sizeof(buffer)) Py_FatalError("PySys_WriteStdout/err: buffer overrun"); -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 46 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From gstein at lyra.org Mon Nov 15 10:28:06 1999 From: gstein at lyra.org (Greg Stein) Date: Mon, 15 Nov 1999 01:28:06 -0800 (PST) Subject: [Python-Dev] PyErr_Format security note In-Reply-To: <382FBFA3.B28B8E1E@lemburg.com> Message-ID: <Pine.LNX.4.10.9911150127320.2535-100000@nebula.lyra.org> On Mon, 15 Nov 1999, M.-A. Lemburg wrote: >... > In sysmodule.c, this check is done which should be safe enough > since no "return" is issued (Py_FatalError() does an abort()): > > if (vsprintf(buffer, format, va) >= sizeof(buffer)) > Py_FatalError("PySys_WriteStdout/err: buffer overrun"); I believe the return from vsprintf() itself would be the problem. Cheers, -g -- Greg Stein, http://www.lyra.org/ From mal at lemburg.com Mon Nov 15 10:49:26 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Mon, 15 Nov 1999 10:49:26 +0100 Subject: [Python-Dev] PyErr_Format security note References: <Pine.LNX.4.10.9911150127320.2535-100000@nebula.lyra.org> Message-ID: <382FD726.6ACB912F@lemburg.com> Greg Stein wrote: > > On Mon, 15 Nov 1999, M.-A. Lemburg wrote: > >... > > In sysmodule.c, this check is done which should be safe enough > > since no "return" is issued (Py_FatalError() does an abort()): > > > > if (vsprintf(buffer, format, va) >= sizeof(buffer)) > > Py_FatalError("PySys_WriteStdout/err: buffer overrun"); > > I believe the return from vsprintf() itself would be the problem. Ouch, yes, you are right... but who could exploit this security hole ? Since PyErr_Format() is only reachable for C code, only bad programming style in extensions could make it exploitable via user input. Wouldn't it be possible to assign thread globals for these functions to use ? These would live on the heap instead of on the stack and eliminate the buffer overrun possibilities (I guess -- I don't have any experience with these...). -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 46 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From akuchlin at mems-exchange.org Mon Nov 15 16:17:58 1999 From: akuchlin at mems-exchange.org (Andrew M. Kuchling) Date: Mon, 15 Nov 1999 10:17:58 -0500 (EST) Subject: [Python-Dev] PyErr_Format security note In-Reply-To: <382FD726.6ACB912F@lemburg.com> References: <Pine.LNX.4.10.9911150127320.2535-100000@nebula.lyra.org> <382FD726.6ACB912F@lemburg.com> Message-ID: <14384.9254.152604.11688@amarok.cnri.reston.va.us> M.-A. Lemburg writes: >Ouch, yes, you are right... but who could exploit this security >hole ? Since PyErr_Format() is only reachable for C code, only >bad programming style in extensions could make it exploitable >via user input. 99% of security holes arise out of carelessness, and besides, this buffer size doesn't seem to be documented in either api.tex or ext.tex. I'll look into borrowing Apache's implementation and modifying it into a varargs form. -- A.M. Kuchling http://starship.python.net/crew/amk/ I can also withstand considerably more G-force than most people, even though I do say so myself. -- The Doctor, in "The Ambassadors of Death" From guido at CNRI.Reston.VA.US Mon Nov 15 16:23:57 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Mon, 15 Nov 1999 10:23:57 -0500 Subject: [Python-Dev] PyErr_Format security note In-Reply-To: Your message of "Sun, 14 Nov 1999 20:49:08 EST." <199911150149.UAA00408@mira.erols.com> References: <199911150149.UAA00408@mira.erols.com> Message-ID: <199911151523.KAA27163@eric.cnri.reston.va.us> > I noticed this in PyErr_Format(exception, format, va_alist): > > char buffer[500]; /* Caller is responsible for limiting the format */ > ... > vsprintf(buffer, format, vargs); > > Making the caller responsible for this is error-prone. Agreed. The limit of 500 chars, while technically undocumented, is part of the specs for PyErr_Format (which is currently wholly undocumented). The current callers all have explicit precautions, but of course I agree that this is a potential danger. > The danger, of > course, is a buffer overflow caused by generating an error string > that's larger than the buffer, possibly letting people execute > arbitrary code. We could add a test to the configure script for > vsnprintf() and use it when possible, but that only fixes the problem > on platforms which have it. Can we find an implementation of > vsnprintf() someplace? Assuming that Linux and Solaris have vsnprintf(), can't we just use the configure script to detect it, and issue a warning blaming the platform for those platforms that don't have it? That seems much simpler (from a maintenance perspective) than carrying our own implementation around (even if we can borrow the Apache version). --Guido van Rossum (home page: http://www.python.org/~guido/) From fdrake at acm.org Mon Nov 15 16:24:27 1999 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Mon, 15 Nov 1999 10:24:27 -0500 (EST) Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <382C6150.53BDC803@lemburg.com> References: <DBF3B37F7BF1D111B2A10000F6B14B1FDDAF24@ukhil704nts.hld.uk.fid-intl.com> <02f901bf2d07$bdf5d950$f29b12c2@secret.pythonware.com> <382C0FA8.ACB6CCD6@lemburg.com> <14380.10955.420102.327867@weyr.cnri.reston.va.us> <382C3749.198EEBC6@lemburg.com> <14380.16064.723277.586881@weyr.cnri.reston.va.us> <382C6150.53BDC803@lemburg.com> Message-ID: <14384.9643.145759.816037@weyr.cnri.reston.va.us> M.-A. Lemburg writes: > Guido proposed to add it to sys. I originally had it defined in > unicodec. Well, he clearly didn't ask me! ;-) > Perhaps a sys.endian would be more appropriate for sys > with values 'little' and 'big' or '<' and '>' to be conform > to the struct module. > > unicodec could then define unicodec.bom depending on the setting > in sys. This seems more reasonable, though I'd go with BOM instead of bom. But that's a style issue, so not so important. If your write bom, I'll write bom. -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> Corporation for National Research Initiatives From captainrobbo at yahoo.com Mon Nov 15 16:30:45 1999 From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=) Date: Mon, 15 Nov 1999 07:30:45 -0800 (PST) Subject: [Python-Dev] Some thoughts on the codecs... Message-ID: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> Some thoughts on the codecs... 1. Stream interface At the moment a codec has dump and load methods which read a (slice of a) stream into a string in memory and vice versa. As the proposal notes, this could lead to errors if you take a slice out of a stream. This is not just due to character truncation; some Asian encodings are modal and have shift-in and shift-out sequences as they move from Western single-byte characters to double-byte ones. It also seems a bit pointless to me as the source (or target) is still a Unicode string in memory. This is a real problem - a filter to convert big files between two encodings should be possible without knowledge of the particular encoding, as should one on the input/output of some server. We can still give a default implementation for single-byte encodings. What's a good API for real stream conversion? just Codec.encodeStream(infile, outfile) ? or is it more useful to feed the codec with data a chunk at a time? 2. Data driven codecs I really like codecs being objects, and believe we could build support for a lot more encodings, a lot sooner than is otherwise possible, by making them data driven rather making each one compiled C code with static mapping tables. What do people think about the approach below? First of all, the ISO8859-1 series are straight mappings to Unicode code points. So one Python script could parse these files and build the mapping table, and a very small data file could hold these encodings. A compiled helper function analogous to string.translate() could deal with most of them. Secondly, the double-byte ones involve a mixture of algorithms and data. The worst cases I know are modal encodings which need a single-byte lookup table, a double-byte lookup table, and have some very simple rules about escape sequences in between them. A simple state machine could still handle these (and the single-byte mappings above become extra-simple special cases); I could imagine feeding it a totally data-driven set of rules. Third, we can massively compress the mapping tables using a notation which just lists contiguous ranges; and very often there are relationships between encodings. For example, "cpXYZ is just like cpXYY but with an extra 'smiley' at 0XFE32". In these cases, a script can build a family of related codecs in an auditable manner. 3. What encodings to distribute? The only clean answers to this are 'almost none', or 'everything that Unicode 3.0 has a mapping for'. The latter is going to add some weight to the distribution. What are people's feelings? Do we ship any at all apart from the Unicode ones? Should new encodings be downloadable from www.python.org? Should there be an optional package outside the main distribution? Thanks, Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From akuchlin at mems-exchange.org Mon Nov 15 16:36:47 1999 From: akuchlin at mems-exchange.org (Andrew M. Kuchling) Date: Mon, 15 Nov 1999 10:36:47 -0500 (EST) Subject: [Python-Dev] PyErr_Format security note In-Reply-To: <199911151523.KAA27163@eric.cnri.reston.va.us> References: <199911150149.UAA00408@mira.erols.com> <199911151523.KAA27163@eric.cnri.reston.va.us> Message-ID: <14384.10383.718373.432606@amarok.cnri.reston.va.us> Guido van Rossum writes: >Assuming that Linux and Solaris have vsnprintf(), can't we just use >the configure script to detect it, and issue a warning blaming the >platform for those platforms that don't have it? That seems much But people using an already-installed Python binary won't see any such configure-time warning, and won't find out about the potential problem. Plus, how do people fix the problem on platforms that don't have vsnprintf() -- switch to Solaris or Linux? Not much of a solution. (vsnprintf() isn't ANSI C, though it's a common extension, so platforms that lack it aren't really deficient.) Hmm... could we maybe use Python's existing (string % vars) machinery? <think think> No, that seems to be hard, because it would want PyObjects, and we can't know what Python types to convert the varargs to, unless we parse the format string (at which point we may as well get a vsnprintf() implementation. -- A.M. Kuchling http://starship.python.net/crew/amk/ A successful tool is one that was used to do something undreamed of by its author. -- S.C. Johnson From guido at CNRI.Reston.VA.US Mon Nov 15 16:50:24 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Mon, 15 Nov 1999 10:50:24 -0500 Subject: [Python-Dev] just say no... In-Reply-To: Your message of "Sun, 14 Nov 1999 23:11:54 +0100." <382F33AA.C3EE825A@lemburg.com> References: <Pine.LNX.4.10.9911121505210.2535-100000@nebula.lyra.org> <382D315F.A7ADEC42@lemburg.com> <199911131306.IAA26030@eric.cnri.reston.va.us> <382F33AA.C3EE825A@lemburg.com> Message-ID: <199911151550.KAA27188@eric.cnri.reston.va.us> > On purpose -- according to my thinking. I see "t#" as an interface > to bf_getcharbuf which I understand as 8-bit character buffer... > UTF-8 is a multi byte encoding. It still is character data, but > not necessarily 8 bits in length (up to 24 bits are used). > > Anyway, I'm not really interested in having an argument about > this. If you say, "t#" fits the purpose, then that's fine with > me. Still, we should clearly define that "t#" returns > text data and "s#" binary data. Encoding, bit length, etc. should > explicitly remain left undefined. Thanks for not picking an argument. Multibyte encodings typically have ASCII as a subset (in such a way that an ASCII string is represented as itself in bytes). This is the characteristic that's needed in my view. > > > First, we have a general design question here: should old code > > > become Unicode compatible or not. As I recall the original idea > > > about Unicode integration was to follow Perl's idea to have > > > scripts become Unicode aware by simply adding a 'use utf8;'. > > > > I've never heard of this idea before -- or am I taking it too literal? > > It smells of a mode to me :-) I'd rather live in a world where > > Unicode just works as long as you use u'...' literals or whatever > > convention we decide. > > > > > If this is still the case, then we'll have to come with a > > > resonable approach for integrating classical string based > > > APIs with the new type. > > > > > > Since UTF-8 is a standard (some would probably prefer UTF-7,5 e.g. > > > the Latin-1 folks) which has some very nice features (see > > > http://czyborra.com/utf/ ) and which is a true extension of ASCII, > > > this encoding seems best fit for the purpose. > > > > Yes, especially if we fix the default encoding as UTF-8. (I'm > > expecting feedback from HP on this next week, hopefully when I see the > > details, it'll be clear that don't need a per-thread default encoding > > to solve their problems; that's quite a likely outcome. If not, we > > have a real-world argument for allowing a variable default encoding, > > without carnage.) > > Fair enough :-) > > > > However, one should not forget that UTF-8 is in fact a > > > variable length encoding of Unicode characters, that is up to > > > 3 bytes form a *single* character. This is obviously not compatible > > > with definitions that explicitly state data to be using a > > > 8-bit single character encoding, e.g. indexing in UTF-8 doesn't > > > work like it does in Latin-1 text. > > > > Sure, but where in current Python are there such requirements? > > It was my understanding that "t#" refers to single byte character > data. That's where the above arguments were aiming at... t# refers to byte-encoded data. Multibyte encodings are explicitly designed to be passed cleanly through processing steps that handle single-byte character data, as long as they are 8-bit clean and don't do too much processing. > > > So if we are to do the integration, we'll have to choose > > > argument parser markers that allow for multi byte characters. > > > "t#" does not fall into this category, "s#" certainly does, > > > "s" is argueable. > > > > I disagree. I grepped through the source for s# and t#. Here's a bit > > of background. Before t# was introduced, s# was being used for two > > distinct purposes: (1) to get an 8-bit text string plus its length, in > > situations where the length was needed; (2) to get binary data (e.g. > > GIF data read from a file in "rb" mode). Greg pointed out that if we > > ever introduced some form of Unicode support, these two had to be > > disambiguated. We found that the majority of uses was for (2)! > > Therefore we decided to change the definition of s# to mean only (2), > > and introduced t# to mean (1). Also, we introduced getcharbuffer > > corresponding to t#, while getreadbuffer was meant for s#. > > I know its too late now, but I can't really follow the arguments > here: in what ways are (1) and (2) different from the implementations > point of view ? If "t#" is to return UTF-8 then <length of the > buffer> will not equal <text length>, so both parser markers return > essentially the same information. The only difference would be > on the semantic side: (1) means: give me text data, while (2) does > not specify the data type. > > Perhaps I'm missing something... The idea is that (1)/s# disallows any translation of the data, while (2)/t# requires translation of the data to an ASCII superset (possibly multibyte, such as UTF-8 or shift-JIS). (2)/t# assumes that the data contains text and that if the text consists of only ASCII characters they are represented as themselves. (1)/s# makes no such assumption. In terms of implementation, Unicode objects should translate themselves to the default encoding for t# (if possible), but they should make the native representation available for s#. For example, take an encryption engine. While it is defined in terms of byte streams, there's no requirement that the bytes represent characters -- they could be the bytes of a GIF file, an MP3 file, or a gzipped tar file. If we pass Unicode to an encryption engine, we want Unicode to come out at the other end, not UTF-8. (If we had wanted to encrypt UTF-8, we should have fed it UTF-8.) > > Note that the definition of the 's' format was left alone -- as > > before, it means you need an 8-bit text string not containing null > > bytes. > > This definition should then be changed to "text string without > null bytes" dropping the 8-bit reference. Aha, I think there's a confusion about what "8-bit" means. For me, a multibyte encoding like UTF-8 is still 8-bit. Am I alone in this? (As far as I know, C uses char* to represent multibyte characters.) Maybe we should disambiguate it more explicitly? > > Our expectation was that a Unicode string passed to an s# situation > > would give a pointer to the internal format plus a byte count (not a > > character count!) while t# would get a pointer to some kind of 8-bit > > translation/encoding plus a byte count, with the explicit requirement > > that the 8-bit translation would have the same lifetime as the > > original unicode object. We decided to leave it up to the next > > generation (i.e., Marc-Andre :-) to decide what kind of translation to > > use and what to do when there is no reasonable translation. > > Hmm, I would strongly object to making "s#" return the internal > format. file.write() would then default to writing UTF-16 data > instead of UTF-8 data. This could result in strange errors > due to the UTF-16 format being endian dependent. But this was the whole design. file.write() needs to be changed to use s# when the file is open in binary mode and t# when the file is open in text mode. > It would also break the symmetry between file.write(u) and > unicode(file.read()), since the default encoding is not used as > internal format for other reasons (see proposal). If the file is encoded using UTF-16 or UCS-2, you should open it in binary mode and use unicode(file.read(), 'utf-16'). (Or perhaps the app should read the first 2 bytes and check for a BOM and then decide to choose bewteen 'utf-16-be' and 'utf-16-le'.) > > Any of the following choices is acceptable (from the point of view of > > not breaking the intended t# semantics; we can now start deciding > > which we like best): > > I think we have already agreed on using UTF-8 for the default > encoding. It has quite a few advantages. See > > http://czyborra.com/utf/ > > for a good overview of the pros and cons. Of course. I was just presenting the list as an argument that if we changed our mind about the default encoding, t# should follow the default encoding (and not pick an encoding by other means). > > - utf-8 > > - latin-1 > > - ascii > > - shift-jis > > - lower byte of unicode ordinal > > - some user- or os-specified multibyte encoding > > > > As far as t# is concerned, for encodings that don't encode all of > > Unicode, untranslatable characters could be dealt with in any number > > of ways (raise an exception, ignore, replace with '?', make best > > effort, etc.). > > The usual Python way would be: raise an exception. This is what > the proposal defines for Codecs in case an encoding/decoding > mapping is not possible, BTW. (UTF-8 will always succeed on > output.) Did you read Andy Robinson's case study? He suggested that for certain encodings there may be other things you can do that are more user-friendly than raising an exception, depending on the application. I am proposing to leave this a detail of each specific translation. There may even be translations that do the same thing except they have a different behavior for untranslatable cases -- e.g. a strict version that raises an exception and a non-strict version that replaces bad characters with '?'. I think this is one of the powers of having an extensible set of encodings. > > Given the current context, it should probably be the same as the > > default encoding -- i.e., utf-8. If we end up making the default > > user-settable, we'll have to decide what to do with untranslatable > > characters -- but that will probably be decided by the user too (it > > would be a property of a specific translation specification). > > > > In any case, I feel that t# could receive a multi-byte encoding, > > s# should receive raw binary data, and they should correspond to > > getcharbuffer and getreadbuffer, respectively. > > Why would you want to have "s#" return the raw binary data for > Unicode objects ? Because file.write() for a binary file, and other similar things (e.g. the encryption engine example I mentioned above) must have *some* way to get at the raw bits. > Note that it is not mentioned anywhere that > "s#" and "t#" do have to necessarily return different things > (binary being a superset of text). I'd opt for "s#" and "t#" both > returning UTF-8 data. This can be implemented by delegating the > buffer slots to the <defencstr> object (see below). This would defeat the whole purpose of introducing t#. We might as well drop t# then altogether if we adopt this. > > > Now Greg would chime in with the buffer interface and > > > argue that it should make the underlying internal > > > format accessible. This is a bad idea, IMHO, since you > > > shouldn't really have to know what the internal data format > > > is. > > > > This is for C code. Quite likely it *does* know what the internal > > data format is! > > C code can use the PyUnicode_* APIs to access the data. I > don't think that argument parsing is powerful enough to > provide the C code with enough information about the data > contents, e.g. it can only state the encoding length, not the > string length. Typically, all the C code does is pass multibyte encoded strings on to other library routines that know what to do to them, or simply give them back unchanged at a later time. It is essential to know the number of bytes, for memory allocation purposes. The number of characters is totally immaterial (and multibyte-handling code knows how to calculate the number of characters anyway). > > > Defining "s#" to return UTF-8 data does not only > > > make "s" and "s#" return the same data format (which should > > > always be the case, IMO), > > > > That was before t# was introduced. No more, alas. If you replace s# > > with t#, I agree with you completely. > > Done :-) > > > > but also hides the internal > > > format from the user and gives him a reliable cross-platform > > > data representation of Unicode data (note that UTF-8 doesn't > > > have the byte order problems of UTF-16). > > > > > > If you are still with, let's look at what "s" and "s#" > > > > (and t#, which is more relevant here) > > > > > do: they return pointers into data areas which have to > > > be kept alive until the corresponding object dies. > > > > > > The only way to support this feature is by allocating > > > a buffer for just this purpose (on the fly and only if > > > needed to prevent excessive memory load). The other > > > options of adding new magic parser markers or switching > > > to more generic one all have one downside: you need to > > > change existing code which is in conflict with the idea > > > we started out with. > > > > Agreed. I think this was our thinking when Greg & I introduced t#. > > My own preference would be to allocate a whole string object, not > > just a buffer; this could then also be used for the .encode() method > > using the default encoding. > > Good point. I'll change <defencbuf> to <defencstr>, a Python > string object created on request. > > > > So, again, the question is: do we want this magical > > > intergration or not ? Note that this is a design question, > > > not one of memory consumption... > > > > Yes, I want it. > > > > Note that this doesn't guarantee that all old extensions will work > > flawlessly when passed Unicode objects; but I think that it covers > > most cases where you could have a reasonable expectation that it > > works. > > > > (Hm, unfortunately many reasonable expectations seem to involve > > the current user's preferred encoding. :-( ) > > -- > Marc-Andre Lemburg --Guido van Rossum (home page: http://www.python.org/~guido/) From Mike.Da.Silva at uk.fid-intl.com Mon Nov 15 17:01:59 1999 From: Mike.Da.Silva at uk.fid-intl.com (Da Silva, Mike) Date: Mon, 15 Nov 1999 16:01:59 -0000 Subject: [Python-Dev] Some thoughts on the codecs... Message-ID: <DBF3B37F7BF1D111B2A10000F6B14B1FDDAF2C@ukhil704nts.hld.uk.fid-intl.com> Andy Robinson wrote: 1. Stream interface At the moment a codec has dump and load methods which read a (slice of a) stream into a string in memory and vice versa. As the proposal notes, this could lead to errors if you take a slice out of a stream. This is not just due to character truncation; some Asian encodings are modal and have shift-in and shift-out sequences as they move from Western single-byte characters to double-byte ones. It also seems a bit pointless to me as the source (or target) is still a Unicode string in memory. This is a real problem - a filter to convert big files between two encodings should be possible without knowledge of the particular encoding, as should one on the input/output of some server. We can still give a default implementation for single-byte encodings. What's a good API for real stream conversion? just Codec.encodeStream(infile, outfile) ? or is it more useful to feed the codec with data a chunk at a time? A user defined chunking factor (suitably defaulted) would be useful for processing large files. 2. Data driven codecs I really like codecs being objects, and believe we could build support for a lot more encodings, a lot sooner than is otherwise possible, by making them data driven rather making each one compiled C code with static mapping tables. What do people think about the approach below? First of all, the ISO8859-1 series are straight mappings to Unicode code points. So one Python script could parse these files and build the mapping table, and a very small data file could hold these encodings. A compiled helper function analogous to string.translate() could deal with most of them. Secondly, the double-byte ones involve a mixture of algorithms and data. The worst cases I know are modal encodings which need a single-byte lookup table, a double-byte lookup table, and have some very simple rules about escape sequences in between them. A simple state machine could still handle these (and the single-byte mappings above become extra-simple special cases); I could imagine feeding it a totally data-driven set of rules. Third, we can massively compress the mapping tables using a notation which just lists contiguous ranges; and very often there are relationships between encodings. For example, "cpXYZ is just like cpXYY but with an extra 'smiley' at 0XFE32". In these cases, a script can build a family of related codecs in an auditable manner. The problem here is that we need to decide whether we are Unicode-centric, or whether Unicode is just another supported encoding. If we are Unicode-centric, then all code-page translations will require static mapping tables between the appropriate Unicode character and the relevant code points in the other encoding. This would involve (worst case) 64k static tables for each supported encoding. Unfortunately this also precludes the use of algorithmic conversions and or sparse conversion tables because most of these transformations are relative to a source and target non-Unicode encoding, eg JIS <---->EUCJIS. If we are taking the IBM approach (see CDRA), then we can mix and match approaches, and treat Unicode strings as just Unicode, and normal strings as being any arbitrary MBCS encoding. To guarantee the utmost interoperability and Unicode 3.0 (and beyond) compliance, we should probably assume that all core encodings are relative to Unicode as the pivot encoding. This should hopefully avoid any gotcha's with roundtrips between any two arbitrary native encodings. The downside is this will probably be slower than an optimised algorithmic transformation. 3. What encodings to distribute? The only clean answers to this are 'almost none', or 'everything that Unicode 3.0 has a mapping for'. The latter is going to add some weight to the distribution. What are people's feelings? Do we ship any at all apart from the Unicode ones? Should new encodings be downloadable from www.python.org <http://www.python.org> ? Should there be an optional package outside the main distribution? Ship with Unicode encodings in the core, the rest should be an add on package. If we are truly Unicode-centric, this gives us the most value in terms of accessing a Unicode character properties database, which will provide language neutral case folding, Hankaku <----> Zenkaku folding (Japan specific), and composition / normalisation between composed characters and their component nonspacing characters. Regards, Mike da Silva From captainrobbo at yahoo.com Mon Nov 15 17:18:13 1999 From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=) Date: Mon, 15 Nov 1999 08:18:13 -0800 (PST) Subject: [Python-Dev] just say no... Message-ID: <19991115161813.13111.rocketmail@web606.mail.yahoo.com> --- Guido van Rossum <guido at CNRI.Reston.VA.US> wrote: > Did you read Andy Robinson's case study? He > suggested that for certain encodings there may be > other things you can do that are more > user-friendly than raising an exception, depending > on the application. I am proposing to leave this a > detail of each specific translation. > There may even be translations that do the same thing > except they have a different behavior for > untranslatable cases -- e.g. a strict version > that raises an exception and a non-strict version > that replaces bad characters with '?'. I think this > is one of the powers of having an extensible set of > encodings. This would be a desirable option in almost every case. Default is an exception (I want to know my data is not clean), but an option to specify an error character. It is usually a question mark but Mike tells me that some encodings specify the error character to use. Example - I query a Sybase Unicode database containing European accents or Japanese. By default it will give me question marks. If I issue the command 'set char_convert utf8', then I see the lot (as garbage, but never mind). If it always errored whenever a query result contained unexpected data, it would be almost impossible to maintain the database. If I wrote my own codec class for a family of encodings, I'd give it an even wider variety of error-logging options - maybe a mode where it told me where in the file the dodgy characters were. We've already taken the key step by allowing codecs to be separate objects registered at run-time, implemented in either C or Python. This means that once again Python will have the most flexible solution around. - Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From jim at digicool.com Mon Nov 15 17:29:13 1999 From: jim at digicool.com (Jim Fulton) Date: Mon, 15 Nov 1999 11:29:13 -0500 Subject: [Python-Dev] PyErr_Format security note References: <199911150149.UAA00408@mira.erols.com> Message-ID: <383034D9.6E1E74D4@digicool.com> "A.M. Kuchling" wrote: > > I noticed this in PyErr_Format(exception, format, va_alist): > > char buffer[500]; /* Caller is responsible for limiting the format */ > ... > vsprintf(buffer, format, vargs); > > Making the caller responsible for this is error-prone. The danger, of > course, is a buffer overflow caused by generating an error string > that's larger than the buffer, possibly letting people execute > arbitrary code. We could add a test to the configure script for > vsnprintf() and use it when possible, but that only fixes the problem > on platforms which have it. Can we find an implementation of > vsnprintf() someplace? I would prefer to see a different interface altogether: PyObject *PyErr_StringFormat(errtype, format, buildformat, ...) So, you could generate an error like this: return PyErr_StringFormat(ErrorObject, "You had too many, %d, foos. The last one was %s", "iO", n, someObject) I implemented this in cPickle. See cPickle_ErrFormat. (Note that it always returns NULL.) Jim -- Jim Fulton mailto:jim at digicool.com Python Powered! Technical Director (888) 344-4332 http://www.python.org Digital Creations http://www.digicool.com http://www.zope.org Under US Code Title 47, Sec.227(b)(1)(C), Sec.227(a)(2)(B) This email address may not be added to any commercial mail list with out my permission. Violation of my privacy with advertising or SPAM will result in a suit for a MINIMUM of $500 damages/incident, $1500 for repeats. From bwarsaw at cnri.reston.va.us Mon Nov 15 17:54:10 1999 From: bwarsaw at cnri.reston.va.us (Barry A. Warsaw) Date: Mon, 15 Nov 1999 11:54:10 -0500 (EST) Subject: [Python-Dev] PyErr_Format security note References: <199911150149.UAA00408@mira.erols.com> <199911151523.KAA27163@eric.cnri.reston.va.us> Message-ID: <14384.15026.392781.151886@anthem.cnri.reston.va.us> >>>>> "Guido" == Guido van Rossum <guido at cnri.reston.va.us> writes: Guido> Assuming that Linux and Solaris have vsnprintf(), can't we Guido> just use the configure script to detect it, and issue a Guido> warning blaming the platform for those platforms that don't Guido> have it? That seems much simpler (from a maintenance Guido> perspective) than carrying our own implementation around Guido> (even if we can borrow the Apache version). Mailman uses vsnprintf in it's C wrapper. There's a simple configure test... # Checks for library functions. AC_CHECK_FUNCS(vsnprintf) ...and for systems that don't have a vsnprintf, I modified a version from GNU screen. It may not have gone through the scrutiny of Apache's implementation, but for Mailman it was more important that it be GPL'd (not a Python requirement). -Barry From jim at digicool.com Mon Nov 15 17:56:38 1999 From: jim at digicool.com (Jim Fulton) Date: Mon, 15 Nov 1999 11:56:38 -0500 Subject: [Python-Dev] PyErr_Format security note References: <199911150149.UAA00408@mira.erols.com> <199911151523.KAA27163@eric.cnri.reston.va.us> <14384.10383.718373.432606@amarok.cnri.reston.va.us> Message-ID: <38303B46.F6AEEDF1@digicool.com> "Andrew M. Kuchling" wrote: > > Guido van Rossum writes: > >Assuming that Linux and Solaris have vsnprintf(), can't we just use > >the configure script to detect it, and issue a warning blaming the > >platform for those platforms that don't have it? That seems much > > But people using an already-installed Python binary won't see any such > configure-time warning, and won't find out about the potential > problem. Plus, how do people fix the problem on platforms that don't > have vsnprintf() -- switch to Solaris or Linux? Not much of a > solution. (vsnprintf() isn't ANSI C, though it's a common extension, > so platforms that lack it aren't really deficient.) > > Hmm... could we maybe use Python's existing (string % vars) machinery? > <think think> No, that seems to be hard, because it would want > PyObjects, and we can't know what Python types to convert the varargs > to, unless we parse the format string (at which point we may as well > get a vsnprintf() implementation. It's easy. You use two format strings. One a Python string format, and the other a Py_BuildValue format. See my other note. Jim -- Jim Fulton mailto:jim at digicool.com Python Powered! Technical Director (888) 344-4332 http://www.python.org Digital Creations http://www.digicool.com http://www.zope.org Under US Code Title 47, Sec.227(b)(1)(C), Sec.227(a)(2)(B) This email address may not be added to any commercial mail list with out my permission. Violation of my privacy with advertising or SPAM will result in a suit for a MINIMUM of $500 damages/incident, $1500 for repeats. From tismer at appliedbiometrics.com Mon Nov 15 18:02:20 1999 From: tismer at appliedbiometrics.com (Christian Tismer) Date: Mon, 15 Nov 1999 18:02:20 +0100 Subject: [Python-Dev] PyErr_Format security note References: <199911150149.UAA00408@mira.erols.com> <199911151523.KAA27163@eric.cnri.reston.va.us> Message-ID: <38303C9C.42C5C830@appliedbiometrics.com> Guido van Rossum wrote: > > > I noticed this in PyErr_Format(exception, format, va_alist): > > > > char buffer[500]; /* Caller is responsible for limiting the format */ > > ... > > vsprintf(buffer, format, vargs); > > > > Making the caller responsible for this is error-prone. > > Agreed. The limit of 500 chars, while technically undocumented, is > part of the specs for PyErr_Format (which is currently wholly > undocumented). The current callers all have explicit precautions, but > of course I agree that this is a potential danger. All but one (checked them all): In ceval.c, function call_builtin, there is a possible security hole. If an extension module happens to create a very long type name (maybe just via a bug), we will crash. } PyErr_Format(PyExc_TypeError, "call of non-function (type %s)", func->ob_type->tp_name); return NULL; } ciao - chris -- Christian Tismer :^) <mailto:tismer at appliedbiometrics.com> Applied Biometrics GmbH : Have a break! Take a ride on Python's Kaiserin-Augusta-Allee 101 : *Starship* http://starship.python.net 10553 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF we're tired of banana software - shipped green, ripens at home From guido at CNRI.Reston.VA.US Mon Nov 15 20:32:00 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Mon, 15 Nov 1999 14:32:00 -0500 Subject: [Python-Dev] PyErr_Format security note In-Reply-To: Your message of "Mon, 15 Nov 1999 18:02:20 +0100." <38303C9C.42C5C830@appliedbiometrics.com> References: <199911150149.UAA00408@mira.erols.com> <199911151523.KAA27163@eric.cnri.reston.va.us> <38303C9C.42C5C830@appliedbiometrics.com> Message-ID: <199911151932.OAA28008@eric.cnri.reston.va.us> > All but one (checked them all): Thanks for checking. > In ceval.c, function call_builtin, there is a possible security hole. > If an extension module happens to create a very long type name > (maybe just via a bug), we will crash. > > } > PyErr_Format(PyExc_TypeError, "call of non-function (type %s)", > func->ob_type->tp_name); > return NULL; > } I would think that an extension module with a name of nearly 500 characters would draw a lot of attention as being ridiculous. If there was a bug through which you could make tp_name point to such a long string, you could probably exploit that bug without having to use this particular PyErr_Format() statement. However, I agree it's better to be safe than sorry, so I've checked in a fix making it %.400s. --Guido van Rossum (home page: http://www.python.org/~guido/) From tismer at appliedbiometrics.com Mon Nov 15 20:41:14 1999 From: tismer at appliedbiometrics.com (Christian Tismer) Date: Mon, 15 Nov 1999 20:41:14 +0100 Subject: [Python-Dev] PyErr_Format security note References: <199911150149.UAA00408@mira.erols.com> <199911151523.KAA27163@eric.cnri.reston.va.us> <38303C9C.42C5C830@appliedbiometrics.com> <199911151932.OAA28008@eric.cnri.reston.va.us> Message-ID: <383061DA.CA5CB373@appliedbiometrics.com> Guido van Rossum wrote: > > > All but one (checked them all): [ceval.c without limits] > I would think that an extension module with a name of nearly 500 > characters would draw a lot of attention as being ridiculous. If > there was a bug through which you could make tp_name point to such a > long string, you could probably exploit that bug without having to use > this particular PyErr_Format() statement. Of course this case is very unlikely. My primary intent was to create such a mess without an extension, and ExtensionClass seemed to be a candidate since it synthetizes a type name at runtime (!). This would have been dangerous since EC is in the heart of Zope. But, I could not get at this special case since EC always stands the class/instance checks and so this case can never happen :( The above lousy result was just to say *something* after no success. > However, I agree it's better to be safe than sorry, so I've checked in > a fix making it %.400s. cheap, consistent, fine - thanks - chris -- Christian Tismer :^) <mailto:tismer at appliedbiometrics.com> Applied Biometrics GmbH : Have a break! Take a ride on Python's Kaiserin-Augusta-Allee 101 : *Starship* http://starship.python.net 10553 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF we're tired of banana software - shipped green, ripens at home From mal at lemburg.com Mon Nov 15 20:04:59 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Mon, 15 Nov 1999 20:04:59 +0100 Subject: [Python-Dev] just say no... References: <Pine.LNX.4.10.9911121505210.2535-100000@nebula.lyra.org> <382D315F.A7ADEC42@lemburg.com> <199911131306.IAA26030@eric.cnri.reston.va.us> <382F33AA.C3EE825A@lemburg.com> <199911151550.KAA27188@eric.cnri.reston.va.us> Message-ID: <3830595B.348E8CC7@lemburg.com> Guido van Rossum wrote: > > [Misunderstanding in the reasoning behind "t#" and "s#"] > > Thanks for not picking an argument. Multibyte encodings typically > have ASCII as a subset (in such a way that an ASCII string is > represented as itself in bytes). This is the characteristic that's > needed in my view. > > > It was my understanding that "t#" refers to single byte character > > data. That's where the above arguments were aiming at... > > t# refers to byte-encoded data. Multibyte encodings are explicitly > designed to be passed cleanly through processing steps that handle > single-byte character data, as long as they are 8-bit clean and don't > do too much processing. Ah, ok. I interpreted 8-bit to mean: 8 bits in length, not "8-bit clean" as you obviously did. > > Perhaps I'm missing something... > > The idea is that (1)/s# disallows any translation of the data, while > (2)/t# requires translation of the data to an ASCII superset (possibly > multibyte, such as UTF-8 or shift-JIS). (2)/t# assumes that the data > contains text and that if the text consists of only ASCII characters > they are represented as themselves. (1)/s# makes no such assumption. > > In terms of implementation, Unicode objects should translate > themselves to the default encoding for t# (if possible), but they > should make the native representation available for s#. > > For example, take an encryption engine. While it is defined in terms > of byte streams, there's no requirement that the bytes represent > characters -- they could be the bytes of a GIF file, an MP3 file, or a > gzipped tar file. If we pass Unicode to an encryption engine, we want > Unicode to come out at the other end, not UTF-8. (If we had wanted to > encrypt UTF-8, we should have fed it UTF-8.) > > > > Note that the definition of the 's' format was left alone -- as > > > before, it means you need an 8-bit text string not containing null > > > bytes. > > > > This definition should then be changed to "text string without > > null bytes" dropping the 8-bit reference. > > Aha, I think there's a confusion about what "8-bit" means. For me, a > multibyte encoding like UTF-8 is still 8-bit. Am I alone in this? > (As far as I know, C uses char* to represent multibyte characters.) > Maybe we should disambiguate it more explicitly? There should be some definition for the two markers and the ideas behind them in the API guide, I guess. > > Hmm, I would strongly object to making "s#" return the internal > > format. file.write() would then default to writing UTF-16 data > > instead of UTF-8 data. This could result in strange errors > > due to the UTF-16 format being endian dependent. > > But this was the whole design. file.write() needs to be changed to > use s# when the file is open in binary mode and t# when the file is > open in text mode. Ok, that would make the situation a little clearer (even though I expect the two different encodings to produce some FAQs). I still don't feel very comfortable about the fact that all existing APIs using "s#" will suddenly receive UTF-16 data if being passed Unicode objects: this probably won't get us the "magical" Unicode integration we invision, since "t#" usage is not very wide spread and character handling code will probably not work well with UTF-16 encoded strings. Anyway, we should probably try out both methods... > > It would also break the symmetry between file.write(u) and > > unicode(file.read()), since the default encoding is not used as > > internal format for other reasons (see proposal). > > If the file is encoded using UTF-16 or UCS-2, you should open it in > binary mode and use unicode(file.read(), 'utf-16'). (Or perhaps the > app should read the first 2 bytes and check for a BOM and then decide > to choose bewteen 'utf-16-be' and 'utf-16-le'.) Right, that's the idea (there is a note on this in the Standard Codec section of the proposal). > > > Any of the following choices is acceptable (from the point of view of > > > not breaking the intended t# semantics; we can now start deciding > > > which we like best): > > > > I think we have already agreed on using UTF-8 for the default > > encoding. It has quite a few advantages. See > > > > http://czyborra.com/utf/ > > > > for a good overview of the pros and cons. > > Of course. I was just presenting the list as an argument that if > we changed our mind about the default encoding, t# should follow the > default encoding (and not pick an encoding by other means). Ok. > > > - utf-8 > > > - latin-1 > > > - ascii > > > - shift-jis > > > - lower byte of unicode ordinal > > > - some user- or os-specified multibyte encoding > > > > > > As far as t# is concerned, for encodings that don't encode all of > > > Unicode, untranslatable characters could be dealt with in any number > > > of ways (raise an exception, ignore, replace with '?', make best > > > effort, etc.). > > > > The usual Python way would be: raise an exception. This is what > > the proposal defines for Codecs in case an encoding/decoding > > mapping is not possible, BTW. (UTF-8 will always succeed on > > output.) > > Did you read Andy Robinson's case study? He suggested that for > certain encodings there may be other things you can do that are more > user-friendly than raising an exception, depending on the application. > I am proposing to leave this a detail of each specific translation. > There may even be translations that do the same thing except they have > a different behavior for untranslatable cases -- e.g. a strict version > that raises an exception and a non-strict version that replaces bad > characters with '?'. I think this is one of the powers of having an > extensible set of encodings. Agreed, the Codecs should decide for themselves what to do. I'll add a note to the next version of the proposal. > > > Given the current context, it should probably be the same as the > > > default encoding -- i.e., utf-8. If we end up making the default > > > user-settable, we'll have to decide what to do with untranslatable > > > characters -- but that will probably be decided by the user too (it > > > would be a property of a specific translation specification). > > > > > > In any case, I feel that t# could receive a multi-byte encoding, > > > s# should receive raw binary data, and they should correspond to > > > getcharbuffer and getreadbuffer, respectively. > > > > Why would you want to have "s#" return the raw binary data for > > Unicode objects ? > > Because file.write() for a binary file, and other similar things > (e.g. the encryption engine example I mentioned above) must have > *some* way to get at the raw bits. What for ? Any lossless encoding should do the trick... UTF-8 is just as good as UTF-16 for binary files; plus it's more compact for ASCII data. I don't really see a need to get explicitly at the internal data representation because both encodings are in fact "internal" w/r to Unicode objects. The only argument I can come up with is that using UTF-16 for binary files could (possibly) eliminate the UTF-8 conversion step which is otherwise always needed. > > Note that it is not mentioned anywhere that > > "s#" and "t#" do have to necessarily return different things > > (binary being a superset of text). I'd opt for "s#" and "t#" both > > returning UTF-8 data. This can be implemented by delegating the > > buffer slots to the <defencstr> object (see below). > > This would defeat the whole purpose of introducing t#. We might as > well drop t# then altogether if we adopt this. Well... yes ;-) > > > > Now Greg would chime in with the buffer interface and > > > > argue that it should make the underlying internal > > > > format accessible. This is a bad idea, IMHO, since you > > > > shouldn't really have to know what the internal data format > > > > is. > > > > > > This is for C code. Quite likely it *does* know what the internal > > > data format is! > > > > C code can use the PyUnicode_* APIs to access the data. I > > don't think that argument parsing is powerful enough to > > provide the C code with enough information about the data > > contents, e.g. it can only state the encoding length, not the > > string length. > > Typically, all the C code does is pass multibyte encoded strings on to > other library routines that know what to do to them, or simply give > them back unchanged at a later time. It is essential to know the > number of bytes, for memory allocation purposes. The number of > characters is totally immaterial (and multibyte-handling code knows > how to calculate the number of characters anyway). -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 46 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Mon Nov 15 20:20:55 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Mon, 15 Nov 1999 20:20:55 +0100 Subject: [Python-Dev] Some thoughts on the codecs... References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> Message-ID: <38305D17.60EC94D0@lemburg.com> Andy Robinson wrote: > > Some thoughts on the codecs... > > 1. Stream interface > At the moment a codec has dump and load methods which > read a (slice of a) stream into a string in memory and > vice versa. As the proposal notes, this could lead to > errors if you take a slice out of a stream. This is > not just due to character truncation; some Asian > encodings are modal and have shift-in and shift-out > sequences as they move from Western single-byte > characters to double-byte ones. It also seems a bit > pointless to me as the source (or target) is still a > Unicode string in memory. > > This is a real problem - a filter to convert big files > between two encodings should be possible without > knowledge of the particular encoding, as should one on > the input/output of some server. We can still give a > default implementation for single-byte encodings. > > What's a good API for real stream conversion? just > Codec.encodeStream(infile, outfile) ? or is it more > useful to feed the codec with data a chunk at a time? The idea was to use Unicode as intermediate for all encoding conversions. What you invision here are stream recoders. The can easily be implemented as an useful addition to the Codec subclasses, but I don't think that these have to go into the core. > 2. Data driven codecs > I really like codecs being objects, and believe we > could build support for a lot more encodings, a lot > sooner than is otherwise possible, by making them data > driven rather making each one compiled C code with > static mapping tables. What do people think about the > approach below? > > First of all, the ISO8859-1 series are straight > mappings to Unicode code points. So one Python script > could parse these files and build the mapping table, > and a very small data file could hold these encodings. > A compiled helper function analogous to > string.translate() could deal with most of them. The problem with these large tables is that currently Python modules are not shared among processes since every process builds its own table. Static C data has the advantage of being shareable at the OS level. You can of course implement Python based lookup tables, but these should be too large... > Secondly, the double-byte ones involve a mixture of > algorithms and data. The worst cases I know are modal > encodings which need a single-byte lookup table, a > double-byte lookup table, and have some very simple > rules about escape sequences in between them. A > simple state machine could still handle these (and the > single-byte mappings above become extra-simple special > cases); I could imagine feeding it a totally > data-driven set of rules. > > Third, we can massively compress the mapping tables > using a notation which just lists contiguous ranges; > and very often there are relationships between > encodings. For example, "cpXYZ is just like cpXYY but > with an extra 'smiley' at 0XFE32". In these cases, a > script can build a family of related codecs in an > auditable manner. These are all great ideas, but I think they unnecessarily complicate the proposal. > 3. What encodings to distribute? > The only clean answers to this are 'almost none', or > 'everything that Unicode 3.0 has a mapping for'. The > latter is going to add some weight to the > distribution. What are people's feelings? Do we ship > any at all apart from the Unicode ones? Should new > encodings be downloadable from www.python.org? Should > there be an optional package outside the main > distribution? Since Codecs can be registered at runtime, there is quite some potential there for extension writers coding their own fast codecs. E.g. one could use mxTextTools as codec engine working at C speeds. I would propose to only add some very basic encodings to the standard distribution, e.g. the ones mentioned under Standard Codecs in the proposal: 'utf-8': 8-bit variable length encoding 'utf-16': 16-bit variable length encoding (litte/big endian) 'utf-16-le': utf-16 but explicitly little endian 'utf-16-be': utf-16 but explicitly big endian 'ascii': 7-bit ASCII codepage 'latin-1': Latin-1 codepage 'html-entities': Latin-1 + HTML entities; see htmlentitydefs.py from the standard Pythin Lib 'jis' (a popular version XXX): Japanese character encoding 'unicode-escape': See Unicode Constructors for a definition 'native': Dump of the Internal Format used by Python Perhaps not even 'html-entities' (even though it would make a cool replacement for cgi.escape()) and maybe we should also place the JIS encoding into a separate Unicode package. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 46 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Mon Nov 15 20:26:16 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Mon, 15 Nov 1999 20:26:16 +0100 Subject: [Python-Dev] Some thoughts on the codecs... References: <DBF3B37F7BF1D111B2A10000F6B14B1FDDAF2C@ukhil704nts.hld.uk.fid-intl.com> Message-ID: <38305E58.28B20E24@lemburg.com> "Da Silva, Mike" wrote: > > Andy Robinson wrote: > -- > 1. Stream interface > At the moment a codec has dump and load methods which read a (slice of a) > stream into a string in memory and vice versa. As the proposal notes, this > could lead to errors if you take a slice out of a stream. This is not just > due to character truncation; some Asian encodings are modal and have > shift-in and shift-out sequences as they move from Western single-byte > characters to double-byte ones. It also seems a bit pointless to me as the > source (or target) is still a Unicode string in memory. > This is a real problem - a filter to convert big files between two encodings > should be possible without knowledge of the particular encoding, as should > one on the input/output of some server. We can still give a default > implementation for single-byte encodings. > What's a good API for real stream conversion? just > Codec.encodeStream(infile, outfile) ? or is it more useful to feed the > codec with data a chunk at a time? > -- > A user defined chunking factor (suitably defaulted) would be useful for > processing large files. > -- > 2. Data driven codecs > I really like codecs being objects, and believe we could build support for a > lot more encodings, a lot sooner than is otherwise possible, by making them > data driven rather making each one compiled C code with static mapping > tables. What do people think about the approach below? > First of all, the ISO8859-1 series are straight mappings to Unicode code > points. So one Python script could parse these files and build the mapping > table, and a very small data file could hold these encodings. A compiled > helper function analogous to string.translate() could deal with most of > them. > Secondly, the double-byte ones involve a mixture of algorithms and data. > The worst cases I know are modal encodings which need a single-byte lookup > table, a double-byte lookup table, and have some very simple rules about > escape sequences in between them. A simple state machine could still handle > these (and the single-byte mappings above become extra-simple special > cases); I could imagine feeding it a totally data-driven set of rules. > Third, we can massively compress the mapping tables using a notation which > just lists contiguous ranges; and very often there are relationships between > encodings. For example, "cpXYZ is just like cpXYY but with an extra > 'smiley' at 0XFE32". In these cases, a script can build a family of related > codecs in an auditable manner. > -- > The problem here is that we need to decide whether we are Unicode-centric, > or whether Unicode is just another supported encoding. If we are > Unicode-centric, then all code-page translations will require static mapping > tables between the appropriate Unicode character and the relevant code > points in the other encoding. This would involve (worst case) 64k static > tables for each supported encoding. Unfortunately this also precludes the > use of algorithmic conversions and or sparse conversion tables because most > of these transformations are relative to a source and target non-Unicode > encoding, eg JIS <---->EUCJIS. If we are taking the IBM approach (see > CDRA), then we can mix and match approaches, and treat Unicode strings as > just Unicode, and normal strings as being any arbitrary MBCS encoding. > > To guarantee the utmost interoperability and Unicode 3.0 (and beyond) > compliance, we should probably assume that all core encodings are relative > to Unicode as the pivot encoding. This should hopefully avoid any gotcha's > with roundtrips between any two arbitrary native encodings. The downside is > this will probably be slower than an optimised algorithmic transformation. Optimizations should go into separate packages for direct EncodingA -> EncodingB conversions. I don't think we need them in the core. > -- > 3. What encodings to distribute? > The only clean answers to this are 'almost none', or 'everything that > Unicode 3.0 has a mapping for'. The latter is going to add some weight to > the distribution. What are people's feelings? Do we ship any at all apart > from the Unicode ones? Should new encodings be downloadable from > www.python.org <http://www.python.org> ? Should there be an optional > package outside the main distribution? > -- > Ship with Unicode encodings in the core, the rest should be an add on > package. > > If we are truly Unicode-centric, this gives us the most value in terms of > accessing a Unicode character properties database, which will provide > language neutral case folding, Hankaku <----> Zenkaku folding (Japan > specific), and composition / normalisation between composed characters and > their component nonspacing characters. >From the proposal: """ Unicode Character Properties: ----------------------------- A separate module "unicodedata" should provide a compact interface to all Unicode character properties defined in the standard's UnicodeData.txt file. Among other things, these properties provide ways to recognize numbers, digits, spaces, whitespace, etc. Since this module will have to provide access to all Unicode characters, it will eventually have to contain the data from UnicodeData.txt which takes up around 200kB. For this reason, the data should be stored in static C data. This enables compilation as shared module which the underlying OS can shared between processes (unlike normal Python code modules). XXX Define the interface... """ Special CJK packages can then access this data for the purposes you mentioned above. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 46 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From guido at CNRI.Reston.VA.US Mon Nov 15 22:37:28 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Mon, 15 Nov 1999 16:37:28 -0500 Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: Your message of "Mon, 15 Nov 1999 20:20:55 +0100." <38305D17.60EC94D0@lemburg.com> References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com> Message-ID: <199911152137.QAA28280@eric.cnri.reston.va.us> > Andy Robinson wrote: > > > > Some thoughts on the codecs... > > > > 1. Stream interface > > At the moment a codec has dump and load methods which > > read a (slice of a) stream into a string in memory and > > vice versa. As the proposal notes, this could lead to > > errors if you take a slice out of a stream. This is > > not just due to character truncation; some Asian > > encodings are modal and have shift-in and shift-out > > sequences as they move from Western single-byte > > characters to double-byte ones. It also seems a bit > > pointless to me as the source (or target) is still a > > Unicode string in memory. > > > > This is a real problem - a filter to convert big files > > between two encodings should be possible without > > knowledge of the particular encoding, as should one on > > the input/output of some server. We can still give a > > default implementation for single-byte encodings. > > > > What's a good API for real stream conversion? just > > Codec.encodeStream(infile, outfile) ? or is it more > > useful to feed the codec with data a chunk at a time? M.-A. Lemburg responds: > The idea was to use Unicode as intermediate for all > encoding conversions. > > What you invision here are stream recoders. The can > easily be implemented as an useful addition to the Codec > subclasses, but I don't think that these have to go > into the core. What I wanted was a codec API that acts somewhat like a buffered file; the buffer makes it possible to efficient handle shift states. This is not exactly what Andy shows, but it's not what Marc's current spec has either. I had thought something more like what Java does: an output stream codec's constructor takes a writable file object and the object returned by the constructor has a write() method, a flush() method and a close() method. It acts like a buffering interface to the underlying file; this allows it to generate the minimal number of shift sequeuces. Similar for input stream codecs. Andy's file translation example could then be written as follows: # assuming variables input_file, input_encoding, output_file, # output_encoding, and constant BUFFER_SIZE f = open(input_file, "rb") f1 = unicodec.codecs[input_encoding].stream_reader(f) g = open(output_file, "wb") g1 = unicodec.codecs[output_encoding].stream_writer(f) while 1: buffer = f1.read(BUFFER_SIZE) if not buffer: break f2.write(buffer) f2.close() f1.close() Note that we could possibly make these the only API that a codec needs to provide; the string object <--> unicode object conversions can be done using this and the cStringIO module. (On the other hand it seems a common case that would be quite useful.) > > 2. Data driven codecs > > I really like codecs being objects, and believe we > > could build support for a lot more encodings, a lot > > sooner than is otherwise possible, by making them data > > driven rather making each one compiled C code with > > static mapping tables. What do people think about the > > approach below? > > > > First of all, the ISO8859-1 series are straight > > mappings to Unicode code points. So one Python script > > could parse these files and build the mapping table, > > and a very small data file could hold these encodings. > > A compiled helper function analogous to > > string.translate() could deal with most of them. > > The problem with these large tables is that currently > Python modules are not shared among processes since > every process builds its own table. > > Static C data has the advantage of being shareable at > the OS level. Don't worry about it. 128K is too small to care, I think... > You can of course implement Python based lookup tables, > but these should be too large... > > > Secondly, the double-byte ones involve a mixture of > > algorithms and data. The worst cases I know are modal > > encodings which need a single-byte lookup table, a > > double-byte lookup table, and have some very simple > > rules about escape sequences in between them. A > > simple state machine could still handle these (and the > > single-byte mappings above become extra-simple special > > cases); I could imagine feeding it a totally > > data-driven set of rules. > > > > Third, we can massively compress the mapping tables > > using a notation which just lists contiguous ranges; > > and very often there are relationships between > > encodings. For example, "cpXYZ is just like cpXYY but > > with an extra 'smiley' at 0XFE32". In these cases, a > > script can build a family of related codecs in an > > auditable manner. > > These are all great ideas, but I think they unnecessarily > complicate the proposal. Agreed, let's leave the *implementation* of codecs out of the current efforts. However I want to make sure that the *interface* to codecs is defined right, because changing it will be expensive. (This is Linus Torvald's philosophy on drivers -- he doesn't care about bugs in drivers, as they will get fixed; however he greatly cares about defining the driver APIs correctly.) > > 3. What encodings to distribute? > > The only clean answers to this are 'almost none', or > > 'everything that Unicode 3.0 has a mapping for'. The > > latter is going to add some weight to the > > distribution. What are people's feelings? Do we ship > > any at all apart from the Unicode ones? Should new > > encodings be downloadable from www.python.org? Should > > there be an optional package outside the main > > distribution? > > Since Codecs can be registered at runtime, there is quite > some potential there for extension writers coding their > own fast codecs. E.g. one could use mxTextTools as codec > engine working at C speeds. (Do you think you'll be able to extort some money from HP for these? :-) > I would propose to only add some very basic encodings to > the standard distribution, e.g. the ones mentioned under > Standard Codecs in the proposal: > > 'utf-8': 8-bit variable length encoding > 'utf-16': 16-bit variable length encoding (litte/big endian) > 'utf-16-le': utf-16 but explicitly little endian > 'utf-16-be': utf-16 but explicitly big endian > 'ascii': 7-bit ASCII codepage > 'latin-1': Latin-1 codepage > 'html-entities': Latin-1 + HTML entities; > see htmlentitydefs.py from the standard Pythin Lib > 'jis' (a popular version XXX): > Japanese character encoding > 'unicode-escape': See Unicode Constructors for a definition > 'native': Dump of the Internal Format used by Python > > Perhaps not even 'html-entities' (even though it would make > a cool replacement for cgi.escape()) and maybe we should > also place the JIS encoding into a separate Unicode package. I'd drop html-entities, it seems too cutesie. (And who uses these anyway, outside browsers?) For JIS (shift-JIS?) I hope that Andy can help us with some pointers and validation. And unicode-escape: now that you mention it, this is a section of the proposal that I don't understand. I quote it here: | Python should provide a built-in constructor for Unicode strings which | is available through __builtins__: | | u = unicode(<encoded Python string>[,<encoding name>=<default encoding>]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ What do you mean by this notation? Since encoding names are not always legal Python identifiers (most contain hyphens), I don't understand what you really meant here. Do you mean to say that it has to be a keyword argument? I would disagree; and then I would have expected the notation [,encoding=<default encoding>]. | With the 'unicode-escape' encoding being defined as: | | u = u'<unicode-escape encoded Python string>' | | ? for single characters (and this includes all \XXX sequences except \uXXXX), | take the ordinal and interpret it as Unicode ordinal; | | ? for \uXXXX sequences, insert the Unicode character with ordinal 0xXXXX | instead, e.g. \u03C0 to represent the character Pi. I've looked at this several times and I don't see the difference between the two bullets. (Ironically, you are using a non-ASCII character here that doesn't always display, depending on where I look at your mail :-). Can you give some examples? Is u'\u0020' different from u'\x20' (a space)? Does '\u0020' (no u prefix) have a meaning? Also, I remember reading Tim Peters who suggested that a "raw unicode" notation (ur"...") might be necessary, to encode regular expressions. I tend to agree. While I'm on the topic, I don't see in your proposal a description of the source file character encoding. Currently, this is undefined, and in fact can be (ab)used to enter non-ASCII in string literals. For example, a programmer named Fran?ois might write a file containing this statement: print "Written by Fran?ois." # (There's a cedilla in there!) (He assumes his source character encoding is Latin-1, and he doesn't want to have to type \347 when he can type a cedilla on his keyboard.) If his source file (or .pyc file!) is executed by a Japanese user, this will probably print some garbage. Using the new Unicode strings, Fran?ois could change his program as follows: print unicode("Written by Fran?ois.", "latin-1") Assuming that Fran?ois sets his sys.stdout to use Latin-1, while the Japanese user sets his to shift-JIS (or whatever his kanjiterm uses). But when the Japanese user views Fran?ois' source file, he will again see garbage. If he uses a generic tool to translate latin-1 files to shift-JIS (assuming shift-JIS has a cedilla character) the program will no longer work correctly -- the string "latin-1" has to be changed to "shift-jis". What should we do about this? The safest and most radical solution is to disallow non-ASCII source characters; Fran?ois will then have to type print u"Written by Fran\u00E7ois." but, knowing Fran?ois, he probably won't like this solution very much (since he didn't like the \347 version either). --Guido van Rossum (home page: http://www.python.org/~guido/) From andy at robanal.demon.co.uk Mon Nov 15 22:41:21 1999 From: andy at robanal.demon.co.uk (Andy Robinson) Date: Mon, 15 Nov 1999 21:41:21 GMT Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: <38305D17.60EC94D0@lemburg.com> References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com> Message-ID: <38307984.12653394@post.demon.co.uk> On Mon, 15 Nov 1999 20:20:55 +0100, you wrote: >These are all great ideas, but I think they unnecessarily >complicate the proposal. However, to claim that Python is properly internationalized, we will need a large number of multi-byte encodings to be available. It's a large amount of work, it must be provably correct, and someone's going to have to do it. So if anyone with more C expertise than me - not hard :-) - is interested I'm not suggesting putting my points in the Unicode proposal - in fact, I'm very happy we have a proposal which allows for extension, and lets us work on the encodings separately (and later). >Since Codecs can be registered at runtime, there is quite >some potential there for extension writers coding their >own fast codecs. E.g. one could use mxTextTools as codec >engine working at C speeds. Exactly my thoughts , although I was thinking of a more slimmed down and specialized one. The right tool might be usable for things like compression algorithms too. Separate project to the Unicode stuff, but if anyone is interested, talk to me. >I would propose to only add some very basic encodings to >the standard distribution, e.g. the ones mentioned under >Standard Codecs in the proposal: > > 'utf-8': 8-bit variable length encoding > 'utf-16': 16-bit variable length encoding (litte/big endian) > 'utf-16-le': utf-16 but explicitly little endian > 'utf-16-be': utf-16 but explicitly big endian > 'ascii': 7-bit ASCII codepage > 'latin-1': Latin-1 codepage > 'html-entities': Latin-1 + HTML entities; > see htmlentitydefs.py from the standard Pythin Lib > 'jis' (a popular version XXX): > Japanese character encoding > 'unicode-escape': See Unicode Constructors for a definition > 'native': Dump of the Internal Format used by Python > Leave JISXXX and the CJK stuff out. If you get into Japanese, you really need to cover ShiftJIS, EUC-JP and JIS, they are big, and there are lots of options about how to do it. The other ones are algorithmic and can be small and fast and fit into the core. Ditto with HTML, and maybe even escaped-unicode too. In summary, the current discussion is clearly doing the right things, but is only covering a small percentage of what needs to be done to internationalize Python fully. - Andy From guido at CNRI.Reston.VA.US Mon Nov 15 22:49:26 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Mon, 15 Nov 1999 16:49:26 -0500 Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: Your message of "Mon, 15 Nov 1999 21:41:21 GMT." <38307984.12653394@post.demon.co.uk> References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com> <38307984.12653394@post.demon.co.uk> Message-ID: <199911152149.QAA28345@eric.cnri.reston.va.us> > In summary, the current discussion is clearly doing the right things, > but is only covering a small percentage of what needs to be done to > internationalize Python fully. Agreed. So let's focus on defining interfaces that are correct and convenient so others who want to add codecs won't have to fight our architecture! Is the current architecture good enough so that the Japanese codecs will fit in it? (I'm particularly worried about the stream codecs, see my previous message.) --Guido van Rossum (home page: http://www.python.org/~guido/) From andy at robanal.demon.co.uk Mon Nov 15 22:58:34 1999 From: andy at robanal.demon.co.uk (Andy Robinson) Date: Mon, 15 Nov 1999 21:58:34 GMT Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: <199911152149.QAA28345@eric.cnri.reston.va.us> References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com> <38307984.12653394@post.demon.co.uk> <199911152149.QAA28345@eric.cnri.reston.va.us> Message-ID: <3831806d.14422147@post.demon.co.uk> On Mon, 15 Nov 1999 16:49:26 -0500, you wrote: >> In summary, the current discussion is clearly doing the right things, >> but is only covering a small percentage of what needs to be done to >> internationalize Python fully. > >Agreed. So let's focus on defining interfaces that are correct and >convenient so others who want to add codecs won't have to fight our >architecture! > >Is the current architecture good enough so that the Japanese codecs >will fit in it? (I'm particularly worried about the stream codecs, >see my previous message.) > No, I don't think it is good enough. We need a stream codec, and as you said the string and file interfaces can be built out of that. You guys will know better than me what the best patterns for that are... - Andy From andy at robanal.demon.co.uk Mon Nov 15 23:30:53 1999 From: andy at robanal.demon.co.uk (Andy Robinson) Date: Mon, 15 Nov 1999 22:30:53 GMT Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: <199911152137.QAA28280@eric.cnri.reston.va.us> References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com> <199911152137.QAA28280@eric.cnri.reston.va.us> Message-ID: <383086da.16067684@post.demon.co.uk> On Mon, 15 Nov 1999 16:37:28 -0500, you wrote: ># assuming variables input_file, input_encoding, output_file, ># output_encoding, and constant BUFFER_SIZE > >f = open(input_file, "rb") >f1 = unicodec.codecs[input_encoding].stream_reader(f) >g = open(output_file, "wb") >g1 = unicodec.codecs[output_encoding].stream_writer(f) > >while 1: > buffer = f1.read(BUFFER_SIZE) > if not buffer: > break > f2.write(buffer) > >f2.close() >f1.close() > >Note that we could possibly make these the only API that a codec needs >to provide; the string object <--> unicode object conversions can be >done using this and the cStringIO module. (On the other hand it seems >a common case that would be quite useful.) Perfect. I'd keep the string ones - easy to implement but a big convenience. The proposal also says: >For explicit handling of Unicode using files, the unicodec module >could provide stream wrappers which provide transparent >encoding/decoding for any open stream (file-like object): > > import unicodec > file = open('mytext.txt','rb') > ufile = unicodec.stream(file,'utf-16') > u = ufile.read() > ... > ufile.close() It seems to me that if we go for stream_reader, it replaces this bit of the proposal too - no need for unicodec to provide anything. If you want to have a convenience function there to save a line or two, you could have unicodec.open(filename, mode, encoding) which returned a stream_reader. - Andy From mal at lemburg.com Mon Nov 15 23:54:38 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Mon, 15 Nov 1999 23:54:38 +0100 Subject: [Python-Dev] Some thoughts on the codecs... References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com> <199911152137.QAA28280@eric.cnri.reston.va.us> Message-ID: <38308F2E.44B9C6BF@lemburg.com> [I'll get back on this tomorrow, just some quick notes here...] Guido van Rossum wrote: > > > Andy Robinson wrote: > > > > > > Some thoughts on the codecs... > > > > > > 1. Stream interface > > > At the moment a codec has dump and load methods which > > > read a (slice of a) stream into a string in memory and > > > vice versa. As the proposal notes, this could lead to > > > errors if you take a slice out of a stream. This is > > > not just due to character truncation; some Asian > > > encodings are modal and have shift-in and shift-out > > > sequences as they move from Western single-byte > > > characters to double-byte ones. It also seems a bit > > > pointless to me as the source (or target) is still a > > > Unicode string in memory. > > > > > > This is a real problem - a filter to convert big files > > > between two encodings should be possible without > > > knowledge of the particular encoding, as should one on > > > the input/output of some server. We can still give a > > > default implementation for single-byte encodings. > > > > > > What's a good API for real stream conversion? just > > > Codec.encodeStream(infile, outfile) ? or is it more > > > useful to feed the codec with data a chunk at a time? > > M.-A. Lemburg responds: > > > The idea was to use Unicode as intermediate for all > > encoding conversions. > > > > What you invision here are stream recoders. The can > > easily be implemented as an useful addition to the Codec > > subclasses, but I don't think that these have to go > > into the core. > > What I wanted was a codec API that acts somewhat like a buffered file; > the buffer makes it possible to efficient handle shift states. This > is not exactly what Andy shows, but it's not what Marc's current spec > has either. > > I had thought something more like what Java does: an output stream > codec's constructor takes a writable file object and the object > returned by the constructor has a write() method, a flush() method and > a close() method. It acts like a buffering interface to the > underlying file; this allows it to generate the minimal number of > shift sequeuces. Similar for input stream codecs. The Codecs provide implementations for encoding and decoding, they are not intended as complete wrappers for e.g. files or sockets. The unicodec module will define a generic stream wrapper (which is yet to be defined) for dealing with files, sockets, etc. It will use the codec registry to do the actual codec work. >From the proposal: """ For explicit handling of Unicode using files, the unicodec module could provide stream wrappers which provide transparent encoding/decoding for any open stream (file-like object): import unicodec file = open('mytext.txt','rb') ufile = unicodec.stream(file,'utf-16') u = ufile.read() ... ufile.close() XXX unicodec.file(<filename>,<mode>,<encname>) could be provided as short-hand for unicodec.file(open(<filename>,<mode>),<encname>) which also assures that <mode> contains the 'b' character when needed. XXX Specify the wrapper(s)... Open issues: what to do with Python strings fed to the .write() method (may need to know the encoding of the strings) and when/if to return Python strings through the .read() method. Perhaps we need more than one type of wrapper here. """ > Andy's file translation example could then be written as follows: > > # assuming variables input_file, input_encoding, output_file, > # output_encoding, and constant BUFFER_SIZE > > f = open(input_file, "rb") > f1 = unicodec.codecs[input_encoding].stream_reader(f) > g = open(output_file, "wb") > g1 = unicodec.codecs[output_encoding].stream_writer(f) > > while 1: > buffer = f1.read(BUFFER_SIZE) > if not buffer: > break > f2.write(buffer) > > f2.close() > f1.close() > Note that we could possibly make these the only API that a codec needs > to provide; the string object <--> unicode object conversions can be > done using this and the cStringIO module. (On the other hand it seems > a common case that would be quite useful.) You wouldn't want to go via cStringIO for *every* encoding translation. The Codec interface defines two pairs of methods on purpose: one which works internally (ie. directly between strings and Unicode objects), and one which works externally (directly between a stream and Unicode objects). > > > 2. Data driven codecs > > > I really like codecs being objects, and believe we > > > could build support for a lot more encodings, a lot > > > sooner than is otherwise possible, by making them data > > > driven rather making each one compiled C code with > > > static mapping tables. What do people think about the > > > approach below? > > > > > > First of all, the ISO8859-1 series are straight > > > mappings to Unicode code points. So one Python script > > > could parse these files and build the mapping table, > > > and a very small data file could hold these encodings. > > > A compiled helper function analogous to > > > string.translate() could deal with most of them. > > > > The problem with these large tables is that currently > > Python modules are not shared among processes since > > every process builds its own table. > > > > Static C data has the advantage of being shareable at > > the OS level. > > Don't worry about it. 128K is too small to care, I think... Huh ? 128K for every process using Python ? That quickly sums up to lots of megabytes lying around pretty much unused. > > You can of course implement Python based lookup tables, > > but these should be too large... > > > > > Secondly, the double-byte ones involve a mixture of > > > algorithms and data. The worst cases I know are modal > > > encodings which need a single-byte lookup table, a > > > double-byte lookup table, and have some very simple > > > rules about escape sequences in between them. A > > > simple state machine could still handle these (and the > > > single-byte mappings above become extra-simple special > > > cases); I could imagine feeding it a totally > > > data-driven set of rules. > > > > > > Third, we can massively compress the mapping tables > > > using a notation which just lists contiguous ranges; > > > and very often there are relationships between > > > encodings. For example, "cpXYZ is just like cpXYY but > > > with an extra 'smiley' at 0XFE32". In these cases, a > > > script can build a family of related codecs in an > > > auditable manner. > > > > These are all great ideas, but I think they unnecessarily > > complicate the proposal. > > Agreed, let's leave the *implementation* of codecs out of the current > efforts. > > However I want to make sure that the *interface* to codecs is defined > right, because changing it will be expensive. (This is Linus > Torvald's philosophy on drivers -- he doesn't care about bugs in > drivers, as they will get fixed; however he greatly cares about > defining the driver APIs correctly.) > > > > 3. What encodings to distribute? > > > The only clean answers to this are 'almost none', or > > > 'everything that Unicode 3.0 has a mapping for'. The > > > latter is going to add some weight to the > > > distribution. What are people's feelings? Do we ship > > > any at all apart from the Unicode ones? Should new > > > encodings be downloadable from www.python.org? Should > > > there be an optional package outside the main > > > distribution? > > > > Since Codecs can be registered at runtime, there is quite > > some potential there for extension writers coding their > > own fast codecs. E.g. one could use mxTextTools as codec > > engine working at C speeds. > > (Do you think you'll be able to extort some money from HP for these? :-) Don't know, it depends on what their specs look like. I use mxTextTools for fast HTML file processing. It uses a small Turing machine with some extra magic and is progammable via Python tuples. > > I would propose to only add some very basic encodings to > > the standard distribution, e.g. the ones mentioned under > > Standard Codecs in the proposal: > > > > 'utf-8': 8-bit variable length encoding > > 'utf-16': 16-bit variable length encoding (litte/big endian) > > 'utf-16-le': utf-16 but explicitly little endian > > 'utf-16-be': utf-16 but explicitly big endian > > 'ascii': 7-bit ASCII codepage > > 'latin-1': Latin-1 codepage > > 'html-entities': Latin-1 + HTML entities; > > see htmlentitydefs.py from the standard Pythin Lib > > 'jis' (a popular version XXX): > > Japanese character encoding > > 'unicode-escape': See Unicode Constructors for a definition > > 'native': Dump of the Internal Format used by Python > > > > Perhaps not even 'html-entities' (even though it would make > > a cool replacement for cgi.escape()) and maybe we should > > also place the JIS encoding into a separate Unicode package. > > I'd drop html-entities, it seems too cutesie. (And who uses these > anyway, outside browsers?) Ok. > For JIS (shift-JIS?) I hope that Andy can help us with some pointers > and validation. > > And unicode-escape: now that you mention it, this is a section of > the proposal that I don't understand. I quote it here: > > | Python should provide a built-in constructor for Unicode strings which > | is available through __builtins__: > | > | u = unicode(<encoded Python string>[,<encoding name>=<default encoding>]) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ I meant this as optional second argument defaulting to whatever we define <default encoding> to mean, e.g. 'utf-8'. u = unicode("string","utf-8") == unicode("string") The <encoding name> argument must be a string identifying one of the registered codecs. > | With the 'unicode-escape' encoding being defined as: > | > | u = u'<unicode-escape encoded Python string>' > | > | ? for single characters (and this includes all \XXX sequences except \uXXXX), > | take the ordinal and interpret it as Unicode ordinal; > | > | ? for \uXXXX sequences, insert the Unicode character with ordinal 0xXXXX > | instead, e.g. \u03C0 to represent the character Pi. > > I've looked at this several times and I don't see the difference > between the two bullets. (Ironically, you are using a non-ASCII > character here that doesn't always display, depending on where I look > at your mail :-). The first bullet covers the normal Python string characters and escapes, e.g. \n and \267 (the center dot ;-), while the second explains how \uXXXX is interpreted. > Can you give some examples? > > Is u'\u0020' different from u'\x20' (a space)? No, they both map to the same Unicode ordinal. > Does '\u0020' (no u prefix) have a meaning? No, \uXXXX is only defined for u"" strings or strings that are used to build Unicode objects with this encoding: u = u'\u0020' == unicode(r'\u0020','unicode-escape') Note that writing \uXX is an error, e.g. u"\u12 " will cause cause a syntax error. Aside: I just noticed that '\x2010' doesn't give '\x20' + '10' but instead '\x10' -- is this intended ? > Also, I remember reading Tim Peters who suggested that a "raw unicode" > notation (ur"...") might be necessary, to encode regular expressions. > I tend to agree. This can be had via unicode(): u = unicode(r'\a\b\c\u0020','unicode-escaped') If that's too long, define a ur() function which wraps up the above line in a function. > While I'm on the topic, I don't see in your proposal a description of > the source file character encoding. Currently, this is undefined, and > in fact can be (ab)used to enter non-ASCII in string literals. For > example, a programmer named Fran?ois might write a file containing > this statement: > > print "Written by Fran?ois." # (There's a cedilla in there!) > > (He assumes his source character encoding is Latin-1, and he doesn't > want to have to type \347 when he can type a cedilla on his keyboard.) > > If his source file (or .pyc file!) is executed by a Japanese user, > this will probably print some garbage. > > Using the new Unicode strings, Fran?ois could change his program as > follows: > > print unicode("Written by Fran?ois.", "latin-1") > > Assuming that Fran?ois sets his sys.stdout to use Latin-1, while the > Japanese user sets his to shift-JIS (or whatever his kanjiterm uses). > > But when the Japanese user views Fran?ois' source file, he will again > see garbage. If he uses a generic tool to translate latin-1 files to > shift-JIS (assuming shift-JIS has a cedilla character) the program > will no longer work correctly -- the string "latin-1" has to be > changed to "shift-jis". > > What should we do about this? The safest and most radical solution is > to disallow non-ASCII source characters; Fran?ois will then have to > type > > print u"Written by Fran\u00E7ois." > > but, knowing Fran?ois, he probably won't like this solution very much > (since he didn't like the \347 version either). I think best is to leave it undefined... as with all files, only the programmer knows what format and encoding it contains, e.g. a Japanese programmer might want to use a shift-JIS editor to enter strings directly in shift-JIS via u = unicode("...shift-JIS encoded text...","shift-jis") Of course, this is not readable using an ASCII editor, but Python will continue to produce the intended string. NLS strings don't belong into program text anyway: i10n usually takes the gettext() approach to handle these issues. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 46 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From andy at robanal.demon.co.uk Tue Nov 16 01:09:28 1999 From: andy at robanal.demon.co.uk (Andy Robinson) Date: Tue, 16 Nov 1999 00:09:28 GMT Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: <38308F2E.44B9C6BF@lemburg.com> References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com> <199911152137.QAA28280@eric.cnri.reston.va.us> <38308F2E.44B9C6BF@lemburg.com> Message-ID: <3839a078.22625844@post.demon.co.uk> On Mon, 15 Nov 1999 23:54:38 +0100, you wrote: >[I'll get back on this tomorrow, just some quick notes here...] >The Codecs provide implementations for encoding and decoding, >they are not intended as complete wrappers for e.g. files or >sockets. > >The unicodec module will define a generic stream wrapper >(which is yet to be defined) for dealing with files, sockets, >etc. It will use the codec registry to do the actual codec >work. > >XXX unicodec.file(<filename>,<mode>,<encname>) could be provided as > short-hand for unicodec.file(open(<filename>,<mode>),<encname>) which > also assures that <mode> contains the 'b' character when needed. > >The Codec interface defines two pairs of methods >on purpose: one which works internally (ie. directly between >strings and Unicode objects), and one which works externally >(directly between a stream and Unicode objects). That's the problem Guido and I are worried about. Your present API is not enough to build stream encoders. The 'slurp it into a unicode string in one go' approach fails for big files or for network connections. And you just cannot build a generic stream reader/writer by slicing it into strings. The solution must be specific to the codec - only it knows how much to buffer, when to flip states etc. So the codec should provide proper stream reading and writing services. Unicodec can then wrap those up in labour-saving ways - I'm not fussy which but I like the one-line file-open utility. - Andy From tim_one at email.msn.com Tue Nov 16 06:38:32 1999 From: tim_one at email.msn.com (Tim Peters) Date: Tue, 16 Nov 1999 00:38:32 -0500 Subject: [Python-Dev] Unicode proposal: %-formatting ? In-Reply-To: <382AE7D9.147D58CB@lemburg.com> Message-ID: <000001bf2ff4$d36e2540$042d153f@tim> [MAL] > I wonder how we could add %-formatting to Unicode strings without > duplicating the PyString_Format() logic. > > First, do we need Unicode object %-formatting at all ? Sure -- in the end, all the world speaks Unicode natively and encodings become historical baggage. Granted I won't live that long, but I may last long enough to see encodings become almost purely an I/O hassle, with all computation done in Unicode. > Second, here is an emulation using strings and <default encoding> > that should give an idea of one could work with the different > encodings: > > s = '%s %i abc???' # a Latin-1 encoded string > t = (u,3) What's u? A Unicode object? Another Latin-1 string? A default-encoded string? How does the following know the difference? > # Convert Latin-1 s to a <default encoding> string via Unicode > s1 = unicode(s,'latin-1').encode() > > # The '%s' will now add u in <default encoding> > s2 = s1 % t > > # Finally, convert the <default encoding> encoded string to Unicode > u1 = unicode(s2) I don't expect this actually works: for example, change %s to %4s. Assuming u is either UTF-8 or Unicode, PyString_Format isn't smart enough to know that some (or all) characters in u consume multiple bytes, so can't extract "the right" number of bytes from u. I think % formating has to know the truth of what you're doing. > Note that .encode() defaults to the current setting of > <default encoding>. > > Provided u maps to Latin-1, an alternative would be: > > u1 = unicode('%s %i abc???' % (u.encode('latin-1'),3), 'latin-1') More interesting is fmt % tuple where everything is Unicode; people can muck with Latin-1 directly today using regular strings, so the example above mostly shows artificial convolution. From tim_one at email.msn.com Tue Nov 16 06:38:40 1999 From: tim_one at email.msn.com (Tim Peters) Date: Tue, 16 Nov 1999 00:38:40 -0500 Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit) In-Reply-To: <382BDD81.458D3125@lemburg.com> Message-ID: <000101bf2ff4$d636bb20$042d153f@tim> [MAL, on raw Unicode strings] > ... > Agreed... note that you could also write your own codec for just this > reason and then use: > > u = unicode('....\u1234...\...\...','raw-unicode-escaped') > > Put that into a function called 'ur' and you have: > > u = ur('...\u4545...\...\...') > > which is not that far away from ur'...' w/r to cosmetics. Well, not quite. In general you need to pass raw strings: u = unicode(r'....\u1234...\...\...','raw-unicode-escaped') ^ u = ur(r'...\u4545...\...\...') ^ else Python will replace all the other backslash sequences. This is a crucial distinction at times; e.g., else \b in a Unicode regexp will expand into a backspace character before the regexp processor ever sees it (\b is supposed to be a word boundary assertion). From tim_one at email.msn.com Tue Nov 16 06:44:42 1999 From: tim_one at email.msn.com (Tim Peters) Date: Tue, 16 Nov 1999 00:44:42 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <Pine.LNX.4.10.9911120225080.27203-100000@nebula.lyra.org> Message-ID: <000201bf2ff5$ae6aefc0$042d153f@tim> [Tim, wonders why Perl and Tcl went w/ UTF-8 internally] [Greg Stein] > Probably for the exact reason that you stated in your messages: many > 8-bit (7-bit?) functions continue to work quite well when given a > UTF-8-encoded string. i.e. they didn't have to rewrite the entire > Perl/TCL interpreter to deal with a new string type. > > I'd guess it is a helluva lot easier for us to add a Python Type than > for Perl or TCL to whack around with new string types (since they use > strings so heavily). Sounds convincing to me! Bumped into an old thread on c.l.p.m. that suggested Perl was also worried about UCS-2's 64K code point limit. But I'm already on record as predicting we'll regret any decision <wink>. From tim_one at email.msn.com Tue Nov 16 06:52:12 1999 From: tim_one at email.msn.com (Tim Peters) Date: Tue, 16 Nov 1999 00:52:12 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <DBF3B37F7BF1D111B2A10000F6B14B1FDDAF22@ukhil704nts.hld.uk.fid-intl.com> Message-ID: <000501bf2ff6$ba943a80$042d153f@tim> [Da Silva, Mike] > ... > 5. UTF-16 requires string operations that do not make assumptions > about nulls - this means re-implementing most of the C runtime > functions to work with unsigned shorts. Python strings are already null-friendly, so Python has already recoded everything it needs to get away from the no-null assumption; stropmodule.c is < 1,500 lines of code, and MAL can turn it into C++ template functions in his sleep <wink -- but stuff "like this" really is easier in C++>. From tim_one at email.msn.com Tue Nov 16 06:56:18 1999 From: tim_one at email.msn.com (Tim Peters) Date: Tue, 16 Nov 1999 00:56:18 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <19991112121303.27452.rocketmail@ web605.yahoomail.com> Message-ID: <000601bf2ff7$4d8a4c80$042d153f@tim> [Andy Robinson] > ... > I presume no one is actually advocating dropping > ordinary Python strings, or the ability to do > rawdata = open('myfile.txt', 'rb').read() > without any transformations? If anyone has advocated either, they've successfully hidden it from me. Anyone? From tim_one at email.msn.com Tue Nov 16 07:09:04 1999 From: tim_one at email.msn.com (Tim Peters) Date: Tue, 16 Nov 1999 01:09:04 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <382BF6C3.D79840EC@lemburg.com> Message-ID: <000701bf2ff9$15cecda0$042d153f@tim> [MAL] > BTW, wouldn't it be possible to take pcre and have it > use Py_Unicode instead of char ? [Of course, there would have to > be some extensions for character classes etc.] No, alas. The assumption that characters are 8 bits is ubiquitous, in both obvious and subtle ways. if ((start_bits[c/8] & (1 << (c&7))) == 0) start_match++; else break; From tim_one at email.msn.com Tue Nov 16 07:19:16 1999 From: tim_one at email.msn.com (Tim Peters) Date: Tue, 16 Nov 1999 01:19:16 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <382C3749.198EEBC6@lemburg.com> Message-ID: <000801bf2ffa$82273400$042d153f@tim> [MAL] > sys.bom should return the byte order mark (BOM) for the format used > internally. The unicodec module should provide symbols for all > possible values of this variable: > > BOM_BE: '\376\377' > (corresponds to Unicode 0x0000FEFF in UTF-16 > == ZERO WIDTH NO-BREAK SPACE) > > BOM_LE: '\377\376' > (corresponds to Unicode 0x0000FFFE in UTF-16 > == illegal Unicode character) > > BOM4_BE: '\000\000\377\376' > (corresponds to Unicode 0x0000FEFF in UCS-4) Should be BOM4_BE: '\000\000\376\377' > BOM4_LE: '\376\377\000\000' > (corresponds to Unicode 0x0000FFFE in UCS-4) Should be BOM4_LE: '\377\376\000\000' From tim_one at email.msn.com Tue Nov 16 07:31:39 1999 From: tim_one at email.msn.com (Tim Peters) Date: Tue, 16 Nov 1999 01:31:39 -0500 Subject: [Python-Dev] just say no... In-Reply-To: <14380.16437.71847.832880@weyr.cnri.reston.va.us> Message-ID: <000901bf2ffc$3d4bb8e0$042d153f@tim> [Fred L. Drake, Jr.] > ... > I wasn't suggesting the PyStringObject be changed, only that the > PyUnicodeObject could maintain a reference. Consider: > > s = fp.read() > u = unicode(s, 'utf-8') > > u would now hold a reference to s, and s/s# would return a pointer > into s instead of re-building the UTF-8 form. I talked myself out of > this because it would be too easy to keep a lot more string objects > around than were actually needed. Yet another use for a weak reference <0.5 wink>. From tim_one at email.msn.com Tue Nov 16 07:41:44 1999 From: tim_one at email.msn.com (Tim Peters) Date: Tue, 16 Nov 1999 01:41:44 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <Pine.LNX.4.10.9911121519440.2535-100000@nebula.lyra.org> Message-ID: <000b01bf2ffd$a5ad69a0$042d153f@tim> [MAL] > BOM_BE: '\376\377' > (corresponds to Unicode 0x0000FEFF in UTF-16 > == ZERO WIDTH NO-BREAK SPACE) [Greg Stein] > Are you sure about that interpretation? I thought the BOM characters > (0xFEFF and 0xFFFE) were *reserved* in the UCS-2 space. I can't speak to MAL's degree of certainty <wink>, but he's right about this stuff. There is only one BOM character, U+FEFF, which is the zero-width no-break space. The byte-swapped form is not only reserved, it's guaranteed never to be assigned to a character. From tim_one at email.msn.com Tue Nov 16 08:47:06 1999 From: tim_one at email.msn.com (Tim Peters) Date: Tue, 16 Nov 1999 02:47:06 -0500 Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: <199911152137.QAA28280@eric.cnri.reston.va.us> Message-ID: <000d01bf3006$c7823700$042d153f@tim> [Guido] > ... > While I'm on the topic, I don't see in your proposal a description of > the source file character encoding. Currently, this is undefined, and > in fact can be (ab)used to enter non-ASCII in string literals. > ... > What should we do about this? The safest and most radical solution is > to disallow non-ASCII source characters; Fran?ois will then have to > type > > print u"Written by Fran\u00E7ois." > > but, knowing Fran?ois, he probably won't like this solution very much > (since he didn't like the \347 version either). So long as Python opens source files using libc text mode, it can't guarantee more than C does: the presence of any character other than tab, newline, and ASCII 32-126 inclusive renders the file contents undefined. Go beyond that, and you've got the same problem as mailers and browsers, and so also the same solution: open source files in binary mode, and add a pragma specifying the intended charset. As a practical matter, declare that Python source is Latin-1 for now, and declare any *system* that doesn't support that non-conforming <wink>. python-is-the-measure-of-all-things-ly y'rs - tim From tim_one at email.msn.com Tue Nov 16 08:47:08 1999 From: tim_one at email.msn.com (Tim Peters) Date: Tue, 16 Nov 1999 02:47:08 -0500 Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: <38308F2E.44B9C6BF@lemburg.com> Message-ID: <000e01bf3006$c8c11fa0$042d153f@tim> [Guido] >> Does '\u0020' (no u prefix) have a meaning? [MAL] > No, \uXXXX is only defined for u"" strings or strings that are > used to build Unicode objects with this encoding: I believe your intent is that '\u0020' be exactly those 6 characters, just as today. That is, it does have a meaning, but its meaning differs between Unicode string literals and regular string literals. > Note that writing \uXX is an error, e.g. u"\u12 " will cause > cause a syntax error. Although I believe your intent <wink> is that, just as today, '\u12' is not an error. > Aside: I just noticed that '\x2010' doesn't give '\x20' + '10' > but instead '\x10' -- is this intended ? Yes; see 2.4.1 ("String literals") of the Lang Ref. Blame the C committee for not defining \x in a platform-independent way. Note that a Python \x escape consumes *all* following hex characters, no matter how many -- and ignores all but the last two. > This [raw Unicode strings] can be had via unicode(): > > u = unicode(r'\a\b\c\u0020','unicode-escaped') > > If that's too long, define a ur() function which wraps up the > above line in a function. As before, I think that's fine for now, but won't stand forever. From fredrik at pythonware.com Tue Nov 16 09:39:20 1999 From: fredrik at pythonware.com (Fredrik Lundh) Date: Tue, 16 Nov 1999 09:39:20 +0100 Subject: [Python-Dev] Some thoughts on the codecs... References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com> <199911152137.QAA28280@eric.cnri.reston.va.us> Message-ID: <010001bf300e$14741310$f29b12c2@secret.pythonware.com> Guido van Rossum <guido at CNRI.Reston.VA.US> wrote: > I had thought something more like what Java does: an output stream > codec's constructor takes a writable file object and the object > returned by the constructor has a write() method, a flush() method and > a close() method. It acts like a buffering interface to the > underlying file; this allows it to generate the minimal number of > shift sequeuces. Similar for input stream codecs. note that the html/sgml/xml parsers generally support the feed/close protocol. to be able to use these codecs in that context, we need 1) codes written according to the "data consumer model", instead of the "stream" model. class myDecoder: def __init__(self, target): self.target = target self.state = ... def feed(self, data): ... extract as much data as possible ... self.target.feed(extracted data) def close(self): ... extract what's left ... self.target.feed(additional data) self.target.close() or 2) make threads mandatory, just like in Java. or 3) add light-weight threads (ala stackless python) to the interpreter... (I vote for alternative 3, but that's another story ;-) </F> From fredrik at pythonware.com Tue Nov 16 09:58:50 1999 From: fredrik at pythonware.com (Fredrik Lundh) Date: Tue, 16 Nov 1999 09:58:50 +0100 Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit) References: <000101bf2ff4$d636bb20$042d153f@tim> Message-ID: <016a01bf3010$cde52620$f29b12c2@secret.pythonware.com> Tim Peters <tim_one at email.msn.com> wrote: > (\b is supposed to be a word boundary assertion). in some places, that is. </F> Main Entry: reg?u?lar Pronunciation: 're-gy&-l&r, 're-g(&-)l&r 1 : belonging to a religious order 2 a : formed, built, arranged, or ordered according to some established rule, law, principle, or type ... 3 a : ORDERLY, METHODICAL <regular habits> ... 4 a : constituted, conducted, or done in conformity with established or prescribed usages, rules, or discipline ... From jack at oratrix.nl Tue Nov 16 12:05:55 1999 From: jack at oratrix.nl (Jack Jansen) Date: Tue, 16 Nov 1999 12:05:55 +0100 Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: Message by "M.-A. Lemburg" <mal@lemburg.com> , Mon, 15 Nov 1999 20:20:55 +0100 , <38305D17.60EC94D0@lemburg.com> Message-ID: <19991116110555.8B43335BB1E@snelboot.oratrix.nl> > I would propose to only add some very basic encodings to > the standard distribution, e.g. the ones mentioned under > Standard Codecs in the proposal: > > 'utf-8': 8-bit variable length encoding > 'utf-16': 16-bit variable length encoding (litte/big endian) > 'utf-16-le': utf-16 but explicitly little endian > 'utf-16-be': utf-16 but explicitly big endian > 'ascii': 7-bit ASCII codepage > 'latin-1': Latin-1 codepage > 'html-entities': Latin-1 + HTML entities; > see htmlentitydefs.py from the standard Pythin Lib > 'jis' (a popular version XXX): > Japanese character encoding > 'unicode-escape': See Unicode Constructors for a definition > 'native': Dump of the Internal Format used by Python I would suggest adding the Dos, Windows and Macintosh standard 8-bit charsets (their equivalents of latin-1) too, as documents in these encoding are pretty ubiquitous. But maybe these should only be added on the respective platforms. -- Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++ Jack.Jansen at oratrix.com | ++++ if you agree copy these lines to your sig ++++ www.oratrix.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm From mal at lemburg.com Tue Nov 16 09:35:28 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Tue, 16 Nov 1999 09:35:28 +0100 Subject: [Python-Dev] Some thoughts on the codecs... References: <000e01bf3006$c8c11fa0$042d153f@tim> Message-ID: <38311750.22D17EC1@lemburg.com> Tim Peters wrote: > > [Guido] > >> Does '\u0020' (no u prefix) have a meaning? > > [MAL] > > No, \uXXXX is only defined for u"" strings or strings that are > > used to build Unicode objects with this encoding: > > I believe your intent is that '\u0020' be exactly those 6 characters, just > as today. That is, it does have a meaning, but its meaning differs between > Unicode string literals and regular string literals. Right. > > Note that writing \uXX is an error, e.g. u"\u12 " will cause > > cause a syntax error. > > Although I believe your intent <wink> is that, just as today, '\u12' is not > an error. Right again :-) "\u12" gives a 4 byte string, u"\u12" produces an exception. > > Aside: I just noticed that '\x2010' doesn't give '\x20' + '10' > > but instead '\x10' -- is this intended ? > > Yes; see 2.4.1 ("String literals") of the Lang Ref. Blame the C committee > for not defining \x in a platform-independent way. Note that a Python \x > escape consumes *all* following hex characters, no matter how many -- and > ignores all but the last two. Strange definition... > > This [raw Unicode strings] can be had via unicode(): > > > > u = unicode(r'\a\b\c\u0020','unicode-escaped') > > > > If that's too long, define a ur() function which wraps up the > > above line in a function. > > As before, I think that's fine for now, but won't stand forever. If Guido agrees to ur"", I can put that into the proposal too -- it's just that things are starting to get a little crowded for a strawman proposal ;-) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Tue Nov 16 11:50:31 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Tue, 16 Nov 1999 11:50:31 +0100 Subject: [Python-Dev] Some thoughts on the codecs... References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com> <38307984.12653394@post.demon.co.uk> Message-ID: <383136F7.AB73A90@lemburg.com> Andy Robinson wrote: > > Leave JISXXX and the CJK stuff out. If you get into Japanese, you > really need to cover ShiftJIS, EUC-JP and JIS, they are big, and there > are lots of options about how to do it. The other ones are > algorithmic and can be small and fast and fit into the core. > > Ditto with HTML, and maybe even escaped-unicode too. So I can drop JIS ? [I won't be able to drop the escaped unicode codec because this is needed for u"" and ur"".] -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Tue Nov 16 11:42:19 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Tue, 16 Nov 1999 11:42:19 +0100 Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit) References: <000101bf2ff4$d636bb20$042d153f@tim> Message-ID: <3831350B.8F69CB6D@lemburg.com> Tim Peters wrote: > > [MAL, on raw Unicode strings] > > ... > > Agreed... note that you could also write your own codec for just this > > reason and then use: > > > > u = unicode('....\u1234...\...\...','raw-unicode-escaped') > > > > Put that into a function called 'ur' and you have: > > > > u = ur('...\u4545...\...\...') > > > > which is not that far away from ur'...' w/r to cosmetics. > > Well, not quite. In general you need to pass raw strings: > > u = unicode(r'....\u1234...\...\...','raw-unicode-escaped') > ^ > u = ur(r'...\u4545...\...\...') > ^ > > else Python will replace all the other backslash sequences. This is a > crucial distinction at times; e.g., else \b in a Unicode regexp will expand > into a backspace character before the regexp processor ever sees it (\b is > supposed to be a word boundary assertion). Right. Here is a sample implementation of what I had in mind: """ Demo for 'unicode-escape' encoding. """ import struct,string,re pack_format = '>H' def convert_string(s): l = map(None,s) for i in range(len(l)): l[i] = struct.pack(pack_format,ord(l[i])) return l u_escape = re.compile(r'\\u([0-9a-fA-F]{0,4})') def unicode_unescape(s): l = [] start = 0 while start < len(s): m = u_escape.search(s,start) if not m: l[len(l):] = convert_string(s[start:]) break m_start,m_end = m.span() if m_start > start: l[len(l):] = convert_string(s[start:m_start]) hexcode = m.group(1) #print hexcode,start,m_start if len(hexcode) != 4: raise SyntaxError,'illegal \\uXXXX sequence: \\u%s' % hexcode ordinal = string.atoi(hexcode,16) l.append(struct.pack(pack_format,ordinal)) start = m_end #print l return string.join(l,'') def hexstr(s,sep=''): return string.join(map(lambda x,hex=hex,ord=ord: '%02x' % ord(x),s),sep) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Tue Nov 16 11:40:42 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Tue, 16 Nov 1999 11:40:42 +0100 Subject: [Python-Dev] Unicode proposal: %-formatting ? References: <000001bf2ff4$d36e2540$042d153f@tim> Message-ID: <383134AA.4B49D178@lemburg.com> Tim Peters wrote: > > [MAL] > > I wonder how we could add %-formatting to Unicode strings without > > duplicating the PyString_Format() logic. > > > > First, do we need Unicode object %-formatting at all ? > > Sure -- in the end, all the world speaks Unicode natively and encodings > become historical baggage. Granted I won't live that long, but I may last > long enough to see encodings become almost purely an I/O hassle, with all > computation done in Unicode. > > > Second, here is an emulation using strings and <default encoding> > > that should give an idea of one could work with the different > > encodings: > > > > s = '%s %i abc???' # a Latin-1 encoded string > > t = (u,3) > > What's u? A Unicode object? Another Latin-1 string? A default-encoded > string? How does the following know the difference? u refers to a Unicode object in the proposal. Sorry, forgot to mention that. > > # Convert Latin-1 s to a <default encoding> string via Unicode > > s1 = unicode(s,'latin-1').encode() > > > > # The '%s' will now add u in <default encoding> > > s2 = s1 % t > > > > # Finally, convert the <default encoding> encoded string to Unicode > > u1 = unicode(s2) > > I don't expect this actually works: for example, change %s to %4s. > Assuming u is either UTF-8 or Unicode, PyString_Format isn't smart enough to > know that some (or all) characters in u consume multiple bytes, so can't > extract "the right" number of bytes from u. I think % formating has to know > the truth of what you're doing. Hmm, guess you're right... format parameters should indeed refer to characters rather than number of encoding bytes. This means a new PyUnicode_Format() implementation mapping Unicode format objects to Unicode objects. > > Note that .encode() defaults to the current setting of > > <default encoding>. > > > > Provided u maps to Latin-1, an alternative would be: > > > > u1 = unicode('%s %i abc???' % (u.encode('latin-1'),3), 'latin-1') > > More interesting is fmt % tuple where everything is Unicode; people can muck > with Latin-1 directly today using regular strings, so the example above > mostly shows artificial convolution. ... hmm, there is a problem there: how should the PyUnicode_Format() API deal with '%s' when it sees a Unicode object as argument ? E.g. what would you get in these cases: u = u"%s %s" % (u"abc", "abc") Perhaps we need a new marker for "insert Unicode object here". -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Tue Nov 16 11:48:13 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Tue, 16 Nov 1999 11:48:13 +0100 Subject: [Python-Dev] Some thoughts on the codecs... References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com> <199911152137.QAA28280@eric.cnri.reston.va.us> <38308F2E.44B9C6BF@lemburg.com> <3839a078.22625844@post.demon.co.uk> Message-ID: <3831366D.8A09E194@lemburg.com> Andy Robinson wrote: > > On Mon, 15 Nov 1999 23:54:38 +0100, you wrote: > > >[I'll get back on this tomorrow, just some quick notes here...] > >The Codecs provide implementations for encoding and decoding, > >they are not intended as complete wrappers for e.g. files or > >sockets. > > > >The unicodec module will define a generic stream wrapper > >(which is yet to be defined) for dealing with files, sockets, > >etc. It will use the codec registry to do the actual codec > >work. > > > >XXX unicodec.file(<filename>,<mode>,<encname>) could be provided as > > short-hand for unicodec.file(open(<filename>,<mode>),<encname>) which > > also assures that <mode> contains the 'b' character when needed. > > > >The Codec interface defines two pairs of methods > >on purpose: one which works internally (ie. directly between > >strings and Unicode objects), and one which works externally > >(directly between a stream and Unicode objects). > > That's the problem Guido and I are worried about. Your present API is > not enough to build stream encoders. The 'slurp it into a unicode > string in one go' approach fails for big files or for network > connections. And you just cannot build a generic stream reader/writer > by slicing it into strings. The solution must be specific to the > codec - only it knows how much to buffer, when to flip states etc. > > So the codec should provide proper stream reading and writing > services. I guess I'll have to rethink the Codec specs. Some leads: 1. introduce a new StreamCodec class which is designed for handling stream encoding and decoding (and supports state) 2. give more information to the unicodec registry: one could register classes instead of instances which the Unicode imlementation would then instantiate whenever it needs to apply the conversion; since this is only needed for encodings maintaining state, the registery would only have to do the instantiation for these codecs and could use cached instances for stateless codecs. > Unicodec can then wrap those up in labour-saving ways - I'm not fussy > which but I like the one-line file-open utility. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From fredrik at pythonware.com Tue Nov 16 12:38:31 1999 From: fredrik at pythonware.com (Fredrik Lundh) Date: Tue, 16 Nov 1999 12:38:31 +0100 Subject: [Python-Dev] Some thoughts on the codecs... References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com> Message-ID: <024b01bf3027$1cff1480$f29b12c2@secret.pythonware.com> > I would propose to only add some very basic encodings to > the standard distribution, e.g. the ones mentioned under > Standard Codecs in the proposal: > > 'utf-8': 8-bit variable length encoding > 'utf-16': 16-bit variable length encoding (litte/big endian) > 'utf-16-le': utf-16 but explicitly little endian > 'utf-16-be': utf-16 but explicitly big endian > 'ascii': 7-bit ASCII codepage > 'latin-1': Latin-1 codepage > 'html-entities': Latin-1 + HTML entities; > see htmlentitydefs.py from the standard Pythin Lib > 'jis' (a popular version XXX): > Japanese character encoding > 'unicode-escape': See Unicode Constructors for a definition > 'native': Dump of the Internal Format used by Python since this is already very close, maybe we could adopt the naming guidelines from XML: In an encoding declaration, the values "UTF-8", "UTF-16", "ISO-10646-UCS-2", and "ISO-10646-UCS-4" should be used for the various encodings and transformations of Unicode/ISO/IEC 10646, the values "ISO-8859-1", "ISO-8859-2", ... "ISO-8859-9" should be used for the parts of ISO 8859, and the values "ISO-2022-JP", "Shift_JIS", and "EUC-JP" should be used for the various encoded forms of JIS X-0208-1997. XML processors may recognize other encodings; it is recommended that character encodings registered (as charsets) with the Internet Assigned Numbers Authority [IANA], other than those just listed, should be referred to using their registered names. Note that these registered names are defined to be case-insensitive, so processors wishing to match against them should do so in a case-insensitive way. (ie "iso-8859-1" instead of "latin-1", etc -- at least as aliases...). </F> From gstein at lyra.org Tue Nov 16 12:45:48 1999 From: gstein at lyra.org (Greg Stein) Date: Tue, 16 Nov 1999 03:45:48 -0800 (PST) Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: <024b01bf3027$1cff1480$f29b12c2@secret.pythonware.com> Message-ID: <Pine.LNX.4.10.9911160344500.2535-100000@nebula.lyra.org> On Tue, 16 Nov 1999, Fredrik Lundh wrote: >... > since this is already very close, maybe we could adopt > the naming guidelines from XML: > > In an encoding declaration, the values "UTF-8", "UTF-16", > "ISO-10646-UCS-2", and "ISO-10646-UCS-4" should be used > for the various encodings and transformations of > Unicode/ISO/IEC 10646, the values "ISO-8859-1", > "ISO-8859-2", ... "ISO-8859-9" should be used for the parts > of ISO 8859, and the values "ISO-2022-JP", "Shift_JIS", > and "EUC-JP" should be used for the various encoded > forms of JIS X-0208-1997. > > XML processors may recognize other encodings; it is > recommended that character encodings registered > (as charsets) with the Internet Assigned Numbers > Authority [IANA], other than those just listed, > should be referred to using their registered names. > > Note that these registered names are defined to be > case-insensitive, so processors wishing to match > against them should do so in a case-insensitive way. > > (ie "iso-8859-1" instead of "latin-1", etc -- at least as > aliases...). +1 (as we'd say in Apache-land... :-) -g -- Greg Stein, http://www.lyra.org/ From gstein at lyra.org Tue Nov 16 13:04:47 1999 From: gstein at lyra.org (Greg Stein) Date: Tue, 16 Nov 1999 04:04:47 -0800 (PST) Subject: [Python-Dev] just say no... In-Reply-To: <3830595B.348E8CC7@lemburg.com> Message-ID: <Pine.LNX.4.10.9911160348480.2535-100000@nebula.lyra.org> On Mon, 15 Nov 1999, M.-A. Lemburg wrote: > Guido van Rossum wrote: >... > > t# refers to byte-encoded data. Multibyte encodings are explicitly > > designed to be passed cleanly through processing steps that handle > > single-byte character data, as long as they are 8-bit clean and don't > > do too much processing. > > Ah, ok. I interpreted 8-bit to mean: 8 bits in length, not > "8-bit clean" as you obviously did. Hrm. That might be dangerous. Many of the functions that use "t#" assume that each character is 8-bits long. i.e. the returned length == the number of characters. I'm not sure what the implications would be if you interpret the semantics of "t#" as multi-byte characters. >... > > For example, take an encryption engine. While it is defined in terms > > of byte streams, there's no requirement that the bytes represent > > characters -- they could be the bytes of a GIF file, an MP3 file, or a > > gzipped tar file. If we pass Unicode to an encryption engine, we want > > Unicode to come out at the other end, not UTF-8. (If we had wanted to > > encrypt UTF-8, we should have fed it UTF-8.) Heck. I just want to quickly throw the data onto my disk. I'll write a BOM, following by the raw data. Done. It's even portable. >... > > Aha, I think there's a confusion about what "8-bit" means. For me, a > > multibyte encoding like UTF-8 is still 8-bit. Am I alone in this? Maybe. I don't see multi-byte characters as 8-bit (in the sense of the "t" format). > > (As far as I know, C uses char* to represent multibyte characters.) > > Maybe we should disambiguate it more explicitly? We can disambiguate with a new format character, or we can clarify the semantics of "t" to mean single- *or* multi- byte characters. Again, I think there may be trouble if the semantics of "t" are defined to allow multibyte characters. > There should be some definition for the two markers and the > ideas behind them in the API guide, I guess. Certainly. [ man, I'm bad... I've got doc updates there and for the buffer stuff :-( ] > > > Hmm, I would strongly object to making "s#" return the internal > > > format. file.write() would then default to writing UTF-16 data > > > instead of UTF-8 data. This could result in strange errors > > > due to the UTF-16 format being endian dependent. > > > > But this was the whole design. file.write() needs to be changed to > > use s# when the file is open in binary mode and t# when the file is > > open in text mode. Interesting idea, but that presumes that "t" will be defined for the Unicode object (i.e. it implements the getcharbuffer type slot). Because of the multi-byte problem, I don't think it will. [ not to mention, that I don't think the Unicode object should implicitly do a UTF-8 conversion and hold a ref to the resulting string ] >... > I still don't feel very comfortable about the fact that all > existing APIs using "s#" will suddenly receive UTF-16 data if > being passed Unicode objects: this probably won't get us the > "magical" Unicode integration we invision, since "t#" usage is not > very wide spread and character handling code will probably not > work well with UTF-16 encoded strings. I'm not sure that we should definitely go for "magical." Perl has magic in it, and that is one of its worst faults. Go for clean and predictable, and leave as much logic to the Python level as possible. The interpreter should provide a minimum of functionality, rather than second-guessing and trying to be neat and sneaky with its operation. >... > > Because file.write() for a binary file, and other similar things > > (e.g. the encryption engine example I mentioned above) must have > > *some* way to get at the raw bits. > > What for ? How about: "because I'm the application developer, and I say that I want the raw bytes in the file." > Any lossless encoding should do the trick... UTF-8 > is just as good as UTF-16 for binary files; plus it's more compact > for ASCII data. I don't really see a need to get explicitly > at the internal data representation because both encodings are > in fact "internal" w/r to Unicode objects. > > The only argument I can come up with is that using UTF-16 for > binary files could (possibly) eliminate the UTF-8 conversion step > which is otherwise always needed. The argument that I come up with is "don't tell me how to design my storage format, and don't make Python force me into one." If I want to write Unicode text to a file, the most natural thing to do is: open('file', 'w').write(u) If you do a conversion on me, then I'm not writing Unicode. I've got to go and do some nasty conversion which just monkeys up my program. If I have a Unicode object, but I *want* to write UTF-8 to the file, then the cleanest thing is: open('file', 'w').write(encode(u, 'utf-8')) This is clear that I've got a Unicode object input, but I'm writing UTF-8. I have a second argument, too: See my first argument. :-) Really... this is kind of what Fredrik was trying to say: don't get in the way of the application programmer. Give them tools, but avoid policy and gimmicks and other "magic". Cheers, -g -- Greg Stein, http://www.lyra.org/ From gstein at lyra.org Tue Nov 16 13:09:17 1999 From: gstein at lyra.org (Greg Stein) Date: Tue, 16 Nov 1999 04:09:17 -0800 (PST) Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...) In-Reply-To: <199911152137.QAA28280@eric.cnri.reston.va.us> Message-ID: <Pine.LNX.4.10.9911160406070.2535-100000@nebula.lyra.org> On Mon, 15 Nov 1999, Guido van Rossum wrote: >... > > The problem with these large tables is that currently > > Python modules are not shared among processes since > > every process builds its own table. > > > > Static C data has the advantage of being shareable at > > the OS level. > > Don't worry about it. 128K is too small to care, I think... This is the reason Python starts up so slow and has a large memory footprint. There hasn't been any concern for moving stuff into shared data pages. As a result, a process must map in a bunch of vmem pages, for no other reason than to allocate Python structures in that memory and copy constants in. Go start Perl 100 times, then do the same with Python. Python is significantly slower. I've actually written a web app in PHP because another one that I did in Python had slow response time. [ yah: the Real Man Answer is to write a real/good mod_python. ] Cheers, -g -- Greg Stein, http://www.lyra.org/ From captainrobbo at yahoo.com Tue Nov 16 13:18:19 1999 From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=) Date: Tue, 16 Nov 1999 04:18:19 -0800 (PST) Subject: [Python-Dev] Some thoughts on the codecs... Message-ID: <19991116121819.21509.rocketmail@web606.mail.yahoo.com> --- "M.-A. Lemburg" <mal at lemburg.com> wrote: > So I can drop JIS ? [I won't be able to drop the > escaped unicode > codec because this is needed for u"" and ur"".] Drop Japanese from the core language. JIS0208 is a big character set with three popular encodings (Shift-JIS, EUC-JP and JIS), and a host of slight variations; it has 6879 characters, and there are a range of options a user might need to set for it to be useful. So let's assume for now this a separate package. There's a good chance I'll do it but it is not a small job. If you start statically linking in tables of 7000 characters for one Asian language, you'll have to do the lot. As for the single-byte Latin ones, a prototype Python module could be whipped up in a couple of evenings, and a tiny C function which does single-byte to double-byte mappings and vice versa could make it fast. We can have an extensible, data driven solution in no time without having to build it into the core. The way I see it, to claim that python has i18n, a serious effort is needed to ensure every major encoding in the world is available to Python users. But that's separate to the core languages. Your spec should only cover what is going to be hard-coded into Python. I'd like to see one paragraph in your spec stating that our architecture seperates the encodings themselves from the core language changes, and that getting them sorted is a logically separate (but important) project. Ideally, we could put together a separate proposal for the encoding library itself and run it by some world class experts in that field, but after yours is done. - Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From guido at CNRI.Reston.VA.US Tue Nov 16 14:28:42 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Tue, 16 Nov 1999 08:28:42 -0500 Subject: [Python-Dev] Unicode proposal: %-formatting ? In-Reply-To: Your message of "Tue, 16 Nov 1999 11:40:42 +0100." <383134AA.4B49D178@lemburg.com> References: <000001bf2ff4$d36e2540$042d153f@tim> <383134AA.4B49D178@lemburg.com> Message-ID: <199911161328.IAA29042@eric.cnri.reston.va.us> > ... hmm, there is a problem there: how should the PyUnicode_Format() > API deal with '%s' when it sees a Unicode object as argument ? > > E.g. what would you get in these cases: > > u = u"%s %s" % (u"abc", "abc") From guido at CNRI.Reston.VA.US Tue Nov 16 14:45:17 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Tue, 16 Nov 1999 08:45:17 -0500 Subject: [Python-Dev] just say no... In-Reply-To: Your message of "Tue, 16 Nov 1999 04:04:47 PST." <Pine.LNX.4.10.9911160348480.2535-100000@nebula.lyra.org> References: <Pine.LNX.4.10.9911160348480.2535-100000@nebula.lyra.org> Message-ID: <199911161345.IAA29064@eric.cnri.reston.va.us> > > Ah, ok. I interpreted 8-bit to mean: 8 bits in length, not > > "8-bit clean" as you obviously did. > > Hrm. That might be dangerous. Many of the functions that use "t#" assume > that each character is 8-bits long. i.e. the returned length == the number > of characters. > > I'm not sure what the implications would be if you interpret the semantics > of "t#" as multi-byte characters. Hrm. Can you quote examples of users of t# who would be confused by multibyte characters? I guess that there are quite a few places where they will be considered illegal, but that's okay -- the string will be parsed at some point and rejected, e.g. as an illegal filename, hostname or whatever. On the other hand, there are quite some places where I would think that multibyte characters would do just the right thing. Many places using t# could just as well be using 's' except they need to know the length and they don't want to call strlen(). In all cases I've looked at, the reason they need the length because they are allocating a buffer (or checking whether it fits in a statically allocated buffer) -- and there the number of bytes in a multibyte string is just fine. Note that I take the same stance on 's' -- it should return multibyte characters. > > What for ? > > How about: "because I'm the application developer, and I say that I want > the raw bytes in the file." Here I'm with you, man! > Greg Stein, http://www.lyra.org/ --Guido van Rossum (home page: http://www.python.org/~guido/) From gward at cnri.reston.va.us Tue Nov 16 15:10:33 1999 From: gward at cnri.reston.va.us (Greg Ward) Date: Tue, 16 Nov 1999 09:10:33 -0500 Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...) In-Reply-To: <Pine.LNX.4.10.9911160406070.2535-100000@nebula.lyra.org>; from gstein@lyra.org on Tue, Nov 16, 1999 at 04:09:17AM -0800 References: <199911152137.QAA28280@eric.cnri.reston.va.us> <Pine.LNX.4.10.9911160406070.2535-100000@nebula.lyra.org> Message-ID: <19991116091032.A4063@cnri.reston.va.us> On 16 November 1999, Greg Stein said: > This is the reason Python starts up so slow and has a large memory > footprint. There hasn't been any concern for moving stuff into shared data > pages. As a result, a process must map in a bunch of vmem pages, for no > other reason than to allocate Python structures in that memory and copy > constants in. > > Go start Perl 100 times, then do the same with Python. Python is > significantly slower. I've actually written a web app in PHP because > another one that I did in Python had slow response time. > [ yah: the Real Man Answer is to write a real/good mod_python. ] I don't think this is the only factor in startup overhead. Try looking into the number of system calls for the trivial startup case of each interpreter: $ truss perl -e 1 2> perl.log $ truss python -c 1 2> python.log (This is on Solaris; I did the same thing on Linux with "strace", and on IRIX with "par -s -SS". Dunno about other Unices.) The results are interesting, and useful despite the platform and version disparities. (For the record: Python 1.5.2 on all three platforms; Perl 5.005_03 on Solaris, 5.004_05 on Linux, and 5.004_04 on IRIX. The Solaris is 2.6, using the Official CNRI Python Build by Barry, and the ditto Perl build by me; the Linux system is starship, using whatever Perl and Python the Starship Masters provide us with; the IRIX box is an elderly but well-maintained SGI Challenge running IRIX 5.3.) Also, this is with an empty PYTHONPATH. The Solaris build of Python has different prefix and exec_prefix, but on the Linux and IRIX builds, they are the same. (I think this will reflect poorly on the Solaris version.) PERLLIB, PERL5LIB, and Perl's builtin @INC should not affect startup of the trivial "1" script, so I haven't paid attention to them. First, the size of log files (in lines), i.e. number of system calls: Solaris Linux IRIX[1] Perl 88 85 70 Python 425 316 257 [1] after chopping off the summary counts from the "par" output -- ie. these really are the number of system calls, not the number of lines in the log files Next, the number of "open" calls: Solaris Linux IRIX Perl 16 10 9 Python 107 71 48 (It looks as though *all* of the Perl 'open' calls are due to the dynamic linker going through /usr/lib and/or /lib.) And the number of unsuccessful "open" calls: Solaris Linux IRIX Perl 6 1 3 Python 77 49 32 Number of "mmap" calls: Solaris Linux IRIX Perl 25 25 1 Python 36 24 1 ...nope, guess we can't blame mmap for any Perl/Python startup disparity. How about "brk": Solaris Linux IRIX Perl 6 11 12 Python 47 39 25 ...ok, looks like Greg's gripe about memory holds some water. Rerunning "truss" on Solaris with "python -S -c 1" drastically reduces the startup overhead as measured by "number of system calls". Some quick timing experiments show a drastic speedup (in wall-clock time) by adding "-S": about 37% faster under Solaris, 56% faster under Linux, and 35% under IRIX. These figures should be taken with a large grain of salt, as the Linux and IRIX systems were fairly well loaded at the time, and the wall-clock results I measured had huge variance. Still, it gets the point across. Oh, also for the record, all timings were done like: perl -e 'for $i (1 .. 100) { system "python", "-S", "-c", "1"; }' because I wanted to guarantee no shell was involved in the Python startup. Greg -- Greg Ward - software developer gward at cnri.reston.va.us Corporation for National Research Initiatives 1895 Preston White Drive voice: +1-703-620-8990 Reston, Virginia, USA 20191-5434 fax: +1-703-620-0913 From mal at lemburg.com Tue Nov 16 12:33:07 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Tue, 16 Nov 1999 12:33:07 +0100 Subject: [Python-Dev] Some thoughts on the codecs... References: <19991116110555.8B43335BB1E@snelboot.oratrix.nl> Message-ID: <383140F3.EDDB307A@lemburg.com> Jack Jansen wrote: > > > I would propose to only add some very basic encodings to > > the standard distribution, e.g. the ones mentioned under > > Standard Codecs in the proposal: > > > > 'utf-8': 8-bit variable length encoding > > 'utf-16': 16-bit variable length encoding (litte/big endian) > > 'utf-16-le': utf-16 but explicitly little endian > > 'utf-16-be': utf-16 but explicitly big endian > > 'ascii': 7-bit ASCII codepage > > 'latin-1': Latin-1 codepage > > 'html-entities': Latin-1 + HTML entities; > > see htmlentitydefs.py from the standard Pythin Lib > > 'jis' (a popular version XXX): > > Japanese character encoding > > 'unicode-escape': See Unicode Constructors for a definition > > 'native': Dump of the Internal Format used by Python > > I would suggest adding the Dos, Windows and Macintosh standard 8-bit charsets > (their equivalents of latin-1) too, as documents in these encoding are pretty > ubiquitous. But maybe these should only be added on the respective platforms. Good idea. What code pages would that be ? -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Tue Nov 16 15:13:25 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Tue, 16 Nov 1999 15:13:25 +0100 Subject: [Python-Dev] Unicode Proposal: Version 0.6 References: <382C0A54.E6E8328D@lemburg.com> <382D625B.DC14DBDE@lemburg.com> Message-ID: <38316685.7977448D@lemburg.com> FYI, I've uploaded a new version of the proposal which incorporates many things we have discussed lately, e.g. the buffer interface, "s#" vs. "t#", etc. The latest version of the proposal is available at: http://starship.skyport.net/~lemburg/unicode-proposal.txt Older versions are available as: http://starship.skyport.net/~lemburg/unicode-proposal-X.X.txt Some POD (points of discussion) that are still open: ? Unicode objects support for %-formatting ? specifying StreamCodecs -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Tue Nov 16 13:54:51 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Tue, 16 Nov 1999 13:54:51 +0100 Subject: [Python-Dev] Some thoughts on the codecs... References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com> <024b01bf3027$1cff1480$f29b12c2@secret.pythonware.com> Message-ID: <3831541B.B242FFA9@lemburg.com> Fredrik Lundh wrote: > > > I would propose to only add some very basic encodings to > > the standard distribution, e.g. the ones mentioned under > > Standard Codecs in the proposal: > > > > 'utf-8': 8-bit variable length encoding > > 'utf-16': 16-bit variable length encoding (litte/big endian) > > 'utf-16-le': utf-16 but explicitly little endian > > 'utf-16-be': utf-16 but explicitly big endian > > 'ascii': 7-bit ASCII codepage > > 'latin-1': Latin-1 codepage > > 'html-entities': Latin-1 + HTML entities; > > see htmlentitydefs.py from the standard Pythin Lib > > 'jis' (a popular version XXX): > > Japanese character encoding > > 'unicode-escape': See Unicode Constructors for a definition > > 'native': Dump of the Internal Format used by Python > > since this is already very close, maybe we could adopt > the naming guidelines from XML: > > In an encoding declaration, the values "UTF-8", "UTF-16", > "ISO-10646-UCS-2", and "ISO-10646-UCS-4" should be used > for the various encodings and transformations of > Unicode/ISO/IEC 10646, the values "ISO-8859-1", > "ISO-8859-2", ... "ISO-8859-9" should be used for the parts > of ISO 8859, and the values "ISO-2022-JP", "Shift_JIS", > and "EUC-JP" should be used for the various encoded > forms of JIS X-0208-1997. > > XML processors may recognize other encodings; it is > recommended that character encodings registered > (as charsets) with the Internet Assigned Numbers > Authority [IANA], other than those just listed, > should be referred to using their registered names. > > Note that these registered names are defined to be > case-insensitive, so processors wishing to match > against them should do so in a case-insensitive way. > > (ie "iso-8859-1" instead of "latin-1", etc -- at least as > aliases...). >From the proposal: """ General Remarks: ---------------- ? Unicode encoding names should be lower case on output and case-insensitive on input (they will be converted to lower case by all APIs taking an encoding name as input). Encoding names should follow the name conventions as used by the Unicode Consortium: spaces are converted to hyphens, e.g. 'utf 16' is written as 'utf-16'. """ Is there a naming scheme definition for these encoding names? (The quote you gave above doesn't really sound like a definition to me.) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Tue Nov 16 14:15:19 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Tue, 16 Nov 1999 14:15:19 +0100 Subject: [Python-Dev] Some thoughts on the codecs... References: <19991116121819.21509.rocketmail@web606.mail.yahoo.com> Message-ID: <383158E7.BC574A1F@lemburg.com> Andy Robinson wrote: > > --- "M.-A. Lemburg" <mal at lemburg.com> wrote: > > So I can drop JIS ? [I won't be able to drop the > > escaped unicode > > codec because this is needed for u"" and ur"".] > > Drop Japanese from the core language. Done ... that one was easy ;-) > JIS0208 is a big character set with three popular > encodings (Shift-JIS, EUC-JP and JIS), and a host of > slight variations; it has 6879 characters, and there > are a range of options a user might need to set for it > to be useful. So let's assume for now this a separate > package. There's a good chance I'll do it but it is > not a small job. If you start statically linking in > tables of 7000 characters for one Asian language, > you'll have to do the lot. > > As for the single-byte Latin ones, a prototype Python > module could be whipped up in a couple of evenings, > and a tiny C function which does single-byte to > double-byte mappings and vice versa could make it > fast. We can have an extensible, data driven solution > in no time without having to build it into the core. Perhaps these helper function could be intergrated into the core to avoid compilation when adding a new codec. > The way I see it, to claim that python has i18n, a > serious effort is needed to ensure every major > encoding in the world is available to Python users. > But that's separate to the core languages. Your spec > should only cover what is going to be hard-coded into > Python. Right. > I'd like to see one paragraph in your spec stating > that our architecture seperates the encodings > themselves from the core language changes, and that > getting them sorted is a logically separate (but > important) project. Ideally, we could put together a > separate proposal for the encoding library itself and > run it by some world class experts in that field, but > after yours is done. I've added: All other encoding such as the CJK ones to support Asian scripts should be implemented in seperate packages which do not get included in the core Python distribution and are not a part of this proposal. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Tue Nov 16 14:06:39 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Tue, 16 Nov 1999 14:06:39 +0100 Subject: [Python-Dev] just say no... References: <Pine.LNX.4.10.9911160348480.2535-100000@nebula.lyra.org> Message-ID: <383156DF.2209053F@lemburg.com> Greg Stein wrote: > > On Mon, 15 Nov 1999, M.-A. Lemburg wrote: > > Guido van Rossum wrote: > >... > > > t# refers to byte-encoded data. Multibyte encodings are explicitly > > > designed to be passed cleanly through processing steps that handle > > > single-byte character data, as long as they are 8-bit clean and don't > > > do too much processing. > > > > Ah, ok. I interpreted 8-bit to mean: 8 bits in length, not > > "8-bit clean" as you obviously did. > > Hrm. That might be dangerous. Many of the functions that use "t#" assume > that each character is 8-bits long. i.e. the returned length == the number > of characters. > > I'm not sure what the implications would be if you interpret the semantics > of "t#" as multi-byte characters. FYI, the next version of the proposal now says "s#" gives you UTF-16 and "t#" returns UTF-8. File objects opened in text mode will use "t#" and binary ones use "s#". I'll just use explicit u.encode('utf-8') calls if I want to write UTF-8 to binary files -- perhaps everyone else should too ;-) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From akuchlin at mems-exchange.org Tue Nov 16 15:35:39 1999 From: akuchlin at mems-exchange.org (Andrew M. Kuchling) Date: Tue, 16 Nov 1999 09:35:39 -0500 (EST) Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...) In-Reply-To: <19991116091032.A4063@cnri.reston.va.us> References: <199911152137.QAA28280@eric.cnri.reston.va.us> <Pine.LNX.4.10.9911160406070.2535-100000@nebula.lyra.org> <19991116091032.A4063@cnri.reston.va.us> Message-ID: <14385.27579.292173.433577@amarok.cnri.reston.va.us> Greg Ward writes: >Next, the number of "open" calls: > Solaris Linux IRIX > Perl 16 10 9 > Python 107 71 48 Running 'python -v' explains this: amarok akuchlin>python -v # /usr/local/lib/python1.5/exceptions.pyc matches /usr/local/lib/python1.5/exceptions.py import exceptions # precompiled from /usr/local/lib/python1.5/exceptions.pyc # /usr/local/lib/python1.5/site.pyc matches /usr/local/lib/python1.5/site.py import site # precompiled from /usr/local/lib/python1.5/site.pyc # /usr/local/lib/python1.5/os.pyc matches /usr/local/lib/python1.5/os.py import os # precompiled from /usr/local/lib/python1.5/os.pyc import posix # builtin # /usr/local/lib/python1.5/posixpath.pyc matches /usr/local/lib/python1.5/posixpath.py import posixpath # precompiled from /usr/local/lib/python1.5/posixpath.pyc # /usr/local/lib/python1.5/stat.pyc matches /usr/local/lib/python1.5/stat.py import stat # precompiled from /usr/local/lib/python1.5/stat.pyc # /usr/local/lib/python1.5/UserDict.pyc matches /usr/local/lib/python1.5/UserDict.py import UserDict # precompiled from /usr/local/lib/python1.5/UserDict.pyc Python 1.5.2 (#80, May 25 1999, 18:06:07) [GCC 2.8.1] on sunos5 Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam import readline # dynamically loaded from /usr/local/lib/python1.5/lib-dynload/readline.so And each import tries several different forms of the module name: stat("/usr/local/lib/python1.5/os", 0xEFFFD5E0) Err#2 ENOENT open("/usr/local/lib/python1.5/os.so", O_RDONLY) Err#2 ENOENT open("/usr/local/lib/python1.5/osmodule.so", O_RDONLY) Err#2 ENOENT open("/usr/local/lib/python1.5/os.py", O_RDONLY) = 4 I don't see how this is fixable, unless we strip down site.py, which drags in os, which drags in os.path and stat and UserDict. -- A.M. Kuchling http://starship.python.net/crew/amk/ I'm going stir-crazy, and I've joined the ranks of the walking brain-dead, but otherwise I'm just peachy. -- Lyta Hall on parenthood, in SANDMAN #40: "Parliament of Rooks" From guido at CNRI.Reston.VA.US Tue Nov 16 15:43:07 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Tue, 16 Nov 1999 09:43:07 -0500 Subject: [Python-Dev] just say no... In-Reply-To: Your message of "Tue, 16 Nov 1999 14:06:39 +0100." <383156DF.2209053F@lemburg.com> References: <Pine.LNX.4.10.9911160348480.2535-100000@nebula.lyra.org> <383156DF.2209053F@lemburg.com> Message-ID: <199911161443.JAA29149@eric.cnri.reston.va.us> > FYI, the next version of the proposal now says "s#" gives you > UTF-16 and "t#" returns UTF-8. File objects opened in text mode > will use "t#" and binary ones use "s#". Good. > I'll just use explicit u.encode('utf-8') calls if I want to write > UTF-8 to binary files -- perhaps everyone else should too ;-) You could write UTF-8 to files opened in text mode too; at least most actual systems will leave the UTF-8 escapes alone and just to LF -> CRLF translation, which should be fine. --Guido van Rossum (home page: http://www.python.org/~guido/) From fdrake at acm.org Tue Nov 16 15:50:55 1999 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Tue, 16 Nov 1999 09:50:55 -0500 (EST) Subject: [Python-Dev] just say no... In-Reply-To: <000901bf2ffc$3d4bb8e0$042d153f@tim> References: <14380.16437.71847.832880@weyr.cnri.reston.va.us> <000901bf2ffc$3d4bb8e0$042d153f@tim> Message-ID: <14385.28495.685427.598748@weyr.cnri.reston.va.us> Tim Peters writes: > Yet another use for a weak reference <0.5 wink>. Those just keep popping up! I seem to recall Diane Hackborne actually implemented these under the name "vref" long ago; perhaps that's worth revisiting after all? (Not the implementation so much as the idea.) I think to make it general would cost one PyObject* in each object's structure, and some code in some constructors (maybe), and all destructors, but not much. Is this worth pursuing, or is it locked out of the core because of the added space for the PyObject*? (Note that the concept isn't necessarily useful for all object types -- numbers in particular -- but it only makes sense to bother if it works for everything, even if it's not very useful in some cases.) -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> Corporation for National Research Initiatives From fdrake at acm.org Tue Nov 16 16:12:43 1999 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Tue, 16 Nov 1999 10:12:43 -0500 (EST) Subject: [Python-Dev] just say no... In-Reply-To: <Pine.LNX.4.10.9911160348480.2535-100000@nebula.lyra.org> References: <3830595B.348E8CC7@lemburg.com> <Pine.LNX.4.10.9911160348480.2535-100000@nebula.lyra.org> Message-ID: <14385.29803.459364.456840@weyr.cnri.reston.va.us> Greg Stein writes: > [ man, I'm bad... I've got doc updates there and for the buffer stuff :-( ] And the sooner I receive them, the sooner they can be integrated! Any plans to get them to me? I'll probably want to do another release before the IPC8. -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> Corporation for National Research Initiatives From mal at lemburg.com Tue Nov 16 15:36:54 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Tue, 16 Nov 1999 15:36:54 +0100 Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...) References: <199911152137.QAA28280@eric.cnri.reston.va.us> <Pine.LNX.4.10.9911160406070.2535-100000@nebula.lyra.org> <19991116091032.A4063@cnri.reston.va.us> Message-ID: <38316C06.8B0E1D7B@lemburg.com> Greg Ward wrote: > > > Go start Perl 100 times, then do the same with Python. Python is > > significantly slower. I've actually written a web app in PHP because > > another one that I did in Python had slow response time. > > [ yah: the Real Man Answer is to write a real/good mod_python. ] > > I don't think this is the only factor in startup overhead. Try looking > into the number of system calls for the trivial startup case of each > interpreter: > > $ truss perl -e 1 2> perl.log > $ truss python -c 1 2> python.log > > (This is on Solaris; I did the same thing on Linux with "strace", and on > IRIX with "par -s -SS". Dunno about other Unices.) The results are > interesting, and useful despite the platform and version disparities. > > (For the record: Python 1.5.2 on all three platforms; Perl 5.005_03 on > Solaris, 5.004_05 on Linux, and 5.004_04 on IRIX. The Solaris is 2.6, > using the Official CNRI Python Build by Barry, and the ditto Perl build > by me; the Linux system is starship, using whatever Perl and Python the > Starship Masters provide us with; the IRIX box is an elderly but > well-maintained SGI Challenge running IRIX 5.3.) > > Also, this is with an empty PYTHONPATH. The Solaris build of Python has > different prefix and exec_prefix, but on the Linux and IRIX builds, they > are the same. (I think this will reflect poorly on the Solaris > version.) PERLLIB, PERL5LIB, and Perl's builtin @INC should not affect > startup of the trivial "1" script, so I haven't paid attention to them. For kicks I've done a similar test with cgipython, the one file version of Python 1.5.2: > First, the size of log files (in lines), i.e. number of system calls: > > Solaris Linux IRIX[1] > Perl 88 85 70 > Python 425 316 257 cgipython 182 > [1] after chopping off the summary counts from the "par" output -- ie. > these really are the number of system calls, not the number of > lines in the log files > > Next, the number of "open" calls: > > Solaris Linux IRIX > Perl 16 10 9 > Python 107 71 48 cgipython 33 > (It looks as though *all* of the Perl 'open' calls are due to the > dynamic linker going through /usr/lib and/or /lib.) > > And the number of unsuccessful "open" calls: > > Solaris Linux IRIX > Perl 6 1 3 > Python 77 49 32 cgipython 28 Note that cgipython does search for sitecutomize.py. > > Number of "mmap" calls: > > Solaris Linux IRIX > Perl 25 25 1 > Python 36 24 1 cgipython 13 > > ...nope, guess we can't blame mmap for any Perl/Python startup > disparity. > > How about "brk": > > Solaris Linux IRIX > Perl 6 11 12 > Python 47 39 25 cgipython 41 (?) So at least in theory, using cgipython for the intended purpose should gain some performance. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Tue Nov 16 17:00:58 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Tue, 16 Nov 1999 17:00:58 +0100 Subject: [Python-Dev] Codecs and StreamCodecs Message-ID: <38317FBA.4F3D6B1F@lemburg.com> Here is a new proposal for the codec interface: class Codec: def encode(self,u,slice=None): """ Return the Unicode object u encoded as Python string. If slice is given (as slice object), only the sliced part of the Unicode object is encoded. The method may not store state in the Codec instance. Use SteamCodec for codecs which have to keep state in order to make encoding/decoding efficient. """ ... def decode(self,s,slice=None): """ Return an equivalent Unicode object for the encoded Python string s. If slice is given (as slice object), only the sliced part of the Python string is decoded and returned as Unicode object. Note that this can cause the decoding algorithm to fail due to truncations in the encoding. The method may not store state in the Codec instance. Use SteamCodec for codecs which have to keep state in order to make encoding/decoding efficient. """ ... class StreamCodec(Codec): def __init__(self,stream=None,errors='strict'): """ Creates a StreamCodec instance. stream must be a file-like object open for reading and/or writing binary data depending on the intended codec action or None. The StreamCodec may implement different error handling schemes by providing the errors argument. These parameters are known (they need not all be supported by StreamCodec subclasses): 'strict' - raise an UnicodeError (or a subclass) 'ignore' - ignore the character and continue with the next (a single character) - replace errorneous characters with the given character (may also be a Unicode character) """ self.stream = stream def write(self,u,slice=None): """ Writes the Unicode object's contents encoded to self.stream. stream must be a file-like object open for writing binary data. If slice is given (as slice object), only the sliced part of the Unicode object is written. """ ... the base class should provide a default implementation of this method using self.encode ... def read(self,length=None): """ Reads an encoded string from the stream and returns an equivalent Unicode object. If length is given, only length Unicode characters are returned (the StreamCodec instance reads as many raw bytes as needed to fulfill this requirement). Otherwise, all available data is read and decoded. """ ... the base class should provide a default implementation of this method using self.decode ... It is not required by the unicodec.register() API to provide a subclass of these base class, only the given methods must be present; this allows writing Codecs as extensions types. All Codecs must provide the .encode()/.decode() methods. Codecs having the .read() and/or .write() methods are considered to be StreamCodecs. The Unicode implementation will by itself only use the stateless .encode() and .decode() methods. All other conversion have to be done by explicitly instantiating the appropriate [Stream]Codec. -- Feel free to beat on this one ;-) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Tue Nov 16 17:08:49 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Tue, 16 Nov 1999 17:08:49 +0100 Subject: [Python-Dev] just say no... References: <14380.16437.71847.832880@weyr.cnri.reston.va.us> <000901bf2ffc$3d4bb8e0$042d153f@tim> <14385.28495.685427.598748@weyr.cnri.reston.va.us> Message-ID: <38318191.11D93903@lemburg.com> "Fred L. Drake, Jr." wrote: > > Tim Peters writes: > > Yet another use for a weak reference <0.5 wink>. > > Those just keep popping up! I seem to recall Diane Hackborne > actually implemented these under the name "vref" long ago; perhaps > that's worth revisiting after all? (Not the implementation so much as > the idea.) I think to make it general would cost one PyObject* in > each object's structure, and some code in some constructors (maybe), > and all destructors, but not much. > Is this worth pursuing, or is it locked out of the core because of > the added space for the PyObject*? (Note that the concept isn't > necessarily useful for all object types -- numbers in particular -- > but it only makes sense to bother if it works for everything, even if > it's not very useful in some cases.) FYI, there's mxProxy which implements a flavor of them. Look in the standard places for mx stuff ;-) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From fdrake at acm.org Tue Nov 16 17:14:06 1999 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Tue, 16 Nov 1999 11:14:06 -0500 (EST) Subject: [Python-Dev] just say no... In-Reply-To: <38318191.11D93903@lemburg.com> References: <14380.16437.71847.832880@weyr.cnri.reston.va.us> <000901bf2ffc$3d4bb8e0$042d153f@tim> <14385.28495.685427.598748@weyr.cnri.reston.va.us> <38318191.11D93903@lemburg.com> Message-ID: <14385.33486.855802.187739@weyr.cnri.reston.va.us> M.-A. Lemburg writes: > FYI, there's mxProxy which implements a flavor of them. Look > in the standard places for mx stuff ;-) Yes, but still not in the core. So we have two general examples (vrefs and mxProxy) and there's WeakDict (or something like that). I think there really needs to be a core facility for this. There are a lot of users (including myself) who think that things are far less useful if they're not in the core. (No, I'm not saying that everything should be in the core, or even that it needs a lot more stuff. I just don't want to be writing code that requires a lot of separate packages to be installed. At least not until we can tell an installation tool to "install this and everything it depends on." ;) -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> Corporation for National Research Initiatives From bwarsaw at cnri.reston.va.us Tue Nov 16 17:14:55 1999 From: bwarsaw at cnri.reston.va.us (Barry A. Warsaw) Date: Tue, 16 Nov 1999 11:14:55 -0500 (EST) Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...) References: <199911152137.QAA28280@eric.cnri.reston.va.us> <Pine.LNX.4.10.9911160406070.2535-100000@nebula.lyra.org> <19991116091032.A4063@cnri.reston.va.us> <14385.27579.292173.433577@amarok.cnri.reston.va.us> Message-ID: <14385.33535.23316.286575@anthem.cnri.reston.va.us> >>>>> "AMK" == Andrew M Kuchling <akuchlin at mems-exchange.org> writes: AMK> I don't see how this is fixable, unless we strip down AMK> site.py, which drags in os, which drags in os.path and stat AMK> and UserDict. One approach might be to support loading modules out of jar files (or whatever) using Greg imputils. We could put the bootstrap .pyc files in this jar and teach Python to import from it first. Python installations could even craft their own modules.jar file to include whatever modules they are willing to "hard code". This, with -S might make Python start up much faster, at the small cost of some flexibility (which could be regained with a c.l. switch or other mechanism to bypass modules.jar). -Barry From guido at CNRI.Reston.VA.US Tue Nov 16 17:20:28 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Tue, 16 Nov 1999 11:20:28 -0500 Subject: [Python-Dev] Codecs and StreamCodecs In-Reply-To: Your message of "Tue, 16 Nov 1999 17:00:58 +0100." <38317FBA.4F3D6B1F@lemburg.com> References: <38317FBA.4F3D6B1F@lemburg.com> Message-ID: <199911161620.LAA02643@eric.cnri.reston.va.us> > It is not required by the unicodec.register() API to provide a > subclass of these base class, only the given methods must be present; > this allows writing Codecs as extensions types. All Codecs must > provide the .encode()/.decode() methods. Codecs having the .read() > and/or .write() methods are considered to be StreamCodecs. > > The Unicode implementation will by itself only use the > stateless .encode() and .decode() methods. > > All other conversion have to be done by explicitly instantiating > the appropriate [Stream]Codec. Looks okay, although I'd like someone to implement a simple shift-state-based stream codec to check this out further. I have some questions about the constructor. You seem to imply that instantiating the class without arguments creates a codec without state. That's fine. When given a stream argument, shouldn't the direction of the stream be given as an additional argument, so the proper state for encoding or decoding can be set up? I can see that for an implementation it might be more convenient to have separate classes for encoders and decoders -- certainly the state being kept is very different. Also, I don't want to ignore the alternative interface that was suggested by /F. It uses feed() similar to htmllib c.s. This has some advantages (although we might want to define some compatibility so it can also feed directly into a file). Perhaps someone should go ahead and implement prototype codecs using either paradigm and then write some simple apps, so we can make a better decision. In any case I think the specs codec registry API aren't on the critical path, integration of /F's basic unicode object is the first thing we need. --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at CNRI.Reston.VA.US Tue Nov 16 17:27:53 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Tue, 16 Nov 1999 11:27:53 -0500 Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...) In-Reply-To: Your message of "Tue, 16 Nov 1999 11:14:55 EST." <14385.33535.23316.286575@anthem.cnri.reston.va.us> References: <199911152137.QAA28280@eric.cnri.reston.va.us> <Pine.LNX.4.10.9911160406070.2535-100000@nebula.lyra.org> <19991116091032.A4063@cnri.reston.va.us> <14385.27579.292173.433577@amarok.cnri.reston.va.us> <14385.33535.23316.286575@anthem.cnri.reston.va.us> Message-ID: <199911161627.LAA02665@eric.cnri.reston.va.us> > >>>>> "AMK" == Andrew M Kuchling <akuchlin at mems-exchange.org> writes: > > AMK> I don't see how this is fixable, unless we strip down > AMK> site.py, which drags in os, which drags in os.path and stat > AMK> and UserDict. > > One approach might be to support loading modules out of jar files (or > whatever) using Greg imputils. We could put the bootstrap .pyc files > in this jar and teach Python to import from it first. Python > installations could even craft their own modules.jar file to include > whatever modules they are willing to "hard code". This, with -S might > make Python start up much faster, at the small cost of some > flexibility (which could be regained with a c.l. switch or other > mechanism to bypass modules.jar). A completely different approach (which, incidentally, HP has lobbied for before; and which has been implemented by Sjoerd Mullender for one particular application) would be to cache a mapping from module names to filenames in a dbm file. For Sjoerd's app (which imported hundreds of modules) this made a huge difference. The problem is that it's hard to deal with issues like updating the cache while sharing it with other processes and even other users... But if those can be solved, this could greatly reduce the number of stats and unsuccessful opens, without having to resort to jar files. --Guido van Rossum (home page: http://www.python.org/~guido/) From gmcm at hypernet.com Tue Nov 16 17:56:19 1999 From: gmcm at hypernet.com (Gordon McMillan) Date: Tue, 16 Nov 1999 11:56:19 -0500 Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...) In-Reply-To: <14385.33535.23316.286575@anthem.cnri.reston.va.us> Message-ID: <1269351119-9152905@hypernet.com> Barry A. Warsaw writes: > One approach might be to support loading modules out of jar files > (or whatever) using Greg imputils. We could put the bootstrap > .pyc files in this jar and teach Python to import from it first. > Python installations could even craft their own modules.jar file > to include whatever modules they are willing to "hard code". > This, with -S might make Python start up much faster, at the > small cost of some flexibility (which could be regained with a > c.l. switch or other mechanism to bypass modules.jar). Couple hundred Windows users have been doing this for months (http://starship.python.net/crew/gmcm/install.html). The .pyz files are cross-platform, although the "embedding" app would have to be redone for *nix, (and all the embedding really does is keep Python from hunting all over your disk). Yeah, it's faster. And I can put Python+Tcl/Tk+IDLE on a diskette with a little room left over. but-since-its-WIndows-it-must-be-tainted-ly y'rs - Gordon From guido at CNRI.Reston.VA.US Tue Nov 16 18:00:15 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Tue, 16 Nov 1999 12:00:15 -0500 Subject: [Python-Dev] Python 1.6 status Message-ID: <199911161700.MAA02716@eric.cnri.reston.va.us> Greg Stein recently reminded me that he was holding off on 1.6 patches because he was under the impression that I wasn't accepting them yet. The situation is rather more complicated than that. There are a great deal of things that need to be done, and for many of them I'd be most happy to receive patches! For other things, however, I'm still in the requirements analysis phase, and patches might be premature (e.g., I want to redesign the import mechanisms, and while I like some of the prototypes that have been posted, I'm not ready to commit to any specific implementation). How do you know for which things I'm ready for patches? Ask me. I've tried to make lists before, and there are probably some hints in the TODO FAQ wizard as well as in the "requests" section of the Python Bugs List. Greg also suggested that I might receive more patches if I opened up the CVS tree for checkins by certain valued contributors. On the one hand I'm reluctant to do that (I feel I have a pretty good track record of checking in patches that are mailed to me, assuming I agree with them) but on the other hand there might be something to say for this, because it gives contributors more of a sense of belonging to the inner core. Of course, checkin privileges don't mean you can check in anything you like -- as in the Apache world, changes must be discussed and approved by the group, and I would like to have a veto. However once a change is approved, it's much easier if the contributor can check the code in without having to go through me all the time. A drawback may be that some people will make very forceful requests to be given checkin privileges, only to never use them; just like there are some members of python-dev who have never contributed. I definitely want to limit the number of privileged contributors to a very small number (e.g. 10-15). One additional detail is the legal side -- contributors will have to sign some kind of legal document similar to the current (wetsign.html) release form, but guiding all future contributions. I'll have to discuss this with CNRI's legal team. Greg, I understand you have checkin privileges for Apache. What is the procedure there for handing out those privileges? What is the procedure for using them? (E.g. if you made a bogus change to part of Apache you're not supposed to work on, what happens?) I'm hoping for several kind of responses to this email: - uncontroversial patches - questions about whether specific issues are sufficiently settled to start coding a patch - discussion threads opening up some issues that haven't been settled yet (like the current, very productive, thread in i18n) - posts summarizing issues that were settled long ago in the past, requesting reverification that the issue is still settled - suggestions for new issues that maybe ought to be settled in 1.6 - requests for checkin privileges, preferably with a specific issue or area of expertise for which the requestor will take responsibility --Guido van Rossum (home page: http://www.python.org/~guido/) From akuchlin at mems-exchange.org Tue Nov 16 18:11:48 1999 From: akuchlin at mems-exchange.org (Andrew M. Kuchling) Date: Tue, 16 Nov 1999 12:11:48 -0500 (EST) Subject: [Python-Dev] Python 1.6 status In-Reply-To: <199911161700.MAA02716@eric.cnri.reston.va.us> References: <199911161700.MAA02716@eric.cnri.reston.va.us> Message-ID: <14385.36948.610106.195971@amarok.cnri.reston.va.us> Guido van Rossum writes: >I'm hoping for several kind of responses to this email: My list of things to do for 1.6 is: * Translate re.py to C and switch to the latest PCRE 2 codebase (mostly done, perhaps ready for public review in a week or so). * Go through the O'Reilly POSIX book and draw up a list of missing POSIX functions that aren't available in the posix module. This was sparked by Greg Ward showing me a Perl daemonize() function he'd written, and I realized that some of the functions it used weren't available in Python at all. (setsid() was one of them, I think.) * A while back I got approval to add the mmapfile module to the core. The outstanding issue there is that the constructor has a different interface on Unix and Windows platforms. On Windows: mm = mmapfile.mmapfile("filename", "tag name", <mapsize>) On Unix, it looks like the mmap() function: mm = mmapfile.mmapfile(<filedesc>, <mapsize>, <flags> (like MAP_SHARED), <prot> (like PROT_READ, PROT_READWRITE) ) Can we reconcile these interfaces, have two different function names, or what? >- suggestions for new issues that maybe ought to be settled in 1.6 Perhaps we should figure out what new capabilities, if any, should be added in 1.6. Fred has mentioned weak references, and there are other possibilities such as ExtensionClass. -- A.M. Kuchling http://starship.python.net/crew/amk/ Society, my dear, is like salt water, good to swim in but hard to swallow. -- Arthur Stringer, _The Silver Poppy_ From beazley at cs.uchicago.edu Tue Nov 16 18:24:24 1999 From: beazley at cs.uchicago.edu (David Beazley) Date: Tue, 16 Nov 1999 11:24:24 -0600 (CST) Subject: [Python-Dev] Python 1.6 status References: <199911161700.MAA02716@eric.cnri.reston.va.us> <14385.36948.610106.195971@amarok.cnri.reston.va.us> Message-ID: <199911161724.LAA13496@gargoyle.cs.uchicago.edu> Andrew M. Kuchling writes: > Guido van Rossum writes: > >I'm hoping for several kind of responses to this email: > > * Go through the O'Reilly POSIX book and draw up a list of missing > POSIX functions that aren't available in the posix module. This > was sparked by Greg Ward showing me a Perl daemonize() function > he'd written, and I realized that some of the functions it used > weren't available in Python at all. (setsid() was one of them, I > think.) > I second this! This was one of the things I noticed when doing the Essential Reference Book. Assuming no one has done it already, I wouldn't mind volunteering to take a crack at it. Cheers, Dave From fdrake at acm.org Tue Nov 16 18:25:02 1999 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Tue, 16 Nov 1999 12:25:02 -0500 (EST) Subject: [Python-Dev] Codecs and StreamCodecs In-Reply-To: <199911161620.LAA02643@eric.cnri.reston.va.us> References: <38317FBA.4F3D6B1F@lemburg.com> <199911161620.LAA02643@eric.cnri.reston.va.us> Message-ID: <14385.37742.816993.642515@weyr.cnri.reston.va.us> Guido van Rossum writes: > Also, I don't want to ignore the alternative interface that was > suggested by /F. It uses feed() similar to htmllib c.s. This has > some advantages (although we might want to define some compatibility > so it can also feed directly into a file). I think one or the other can be used, and then a wrapper that converts to the other interface. Perhaps the encoders should provide feed(), and a file-like wrapper can convert write() to feed(). It could also be done the other way; I'm not sure if it matters which is "normal." (Or perhaps feed() was badly named and should be write()? The general intent was a little different, I think, but an output file is very much a stream consumer.) -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> Corporation for National Research Initiatives From akuchlin at mems-exchange.org Tue Nov 16 18:32:41 1999 From: akuchlin at mems-exchange.org (Andrew M. Kuchling) Date: Tue, 16 Nov 1999 12:32:41 -0500 (EST) Subject: [Python-Dev] mmapfile module In-Reply-To: <199911161720.MAA02764@eric.cnri.reston.va.us> References: <199911161700.MAA02716@eric.cnri.reston.va.us> <14385.36948.610106.195971@amarok.cnri.reston.va.us> <199911161720.MAA02764@eric.cnri.reston.va.us> Message-ID: <14385.38201.301429.786642@amarok.cnri.reston.va.us> Guido van Rossum writes: >Hm, this seems to require a higher-level Python module to hide the >differences. Maybe the Unix version could also use a filename? I >would think that mmap'ed files should always be backed by a file (not >by a pipe, socket etc.). Or is there an issue with secure creation of >temp files? This is a question for a separate thread. Hmm... I don't know of any way to use mmap() on non-file things, either; there are odd special cases, like using MAP_ANONYMOUS on /dev/zero to allocate memory, but that's still using a file. On the other hand, there may be some special case where you need to do that. We could add a fileno() method to get the file descriptor, but I don't know if that's useful to Windows. (Is Sam Rushing, the original author of the Win32 mmapfile, on this list?) What do we do about the tagname, which is a Win32 argument that has no Unix counterpart -- I'm not even sure what its function is. -- A.M. Kuchling http://starship.python.net/crew/amk/ I had it in me to be the Pierce Brosnan of my generation. -- Vincent Me's past career plans in EGYPT #1 From mal at lemburg.com Tue Nov 16 18:53:46 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Tue, 16 Nov 1999 18:53:46 +0100 Subject: [Python-Dev] Codecs and StreamCodecs References: <38317FBA.4F3D6B1F@lemburg.com> <199911161620.LAA02643@eric.cnri.reston.va.us> Message-ID: <38319A2A.4385D2E7@lemburg.com> Guido van Rossum wrote: > > > It is not required by the unicodec.register() API to provide a > > subclass of these base class, only the given methods must be present; > > this allows writing Codecs as extensions types. All Codecs must > > provide the .encode()/.decode() methods. Codecs having the .read() > > and/or .write() methods are considered to be StreamCodecs. > > > > The Unicode implementation will by itself only use the > > stateless .encode() and .decode() methods. > > > > All other conversion have to be done by explicitly instantiating > > the appropriate [Stream]Codec. > > Looks okay, although I'd like someone to implement a simple > shift-state-based stream codec to check this out further. > > I have some questions about the constructor. You seem to imply > that instantiating the class without arguments creates a codec without > state. That's fine. When given a stream argument, shouldn't the > direction of the stream be given as an additional argument, so the > proper state for encoding or decoding can be set up? I can see that > for an implementation it might be more convenient to have separate > classes for encoders and decoders -- certainly the state being kept is > very different. Wouldn't it be possible to have the read/write methods set up the state when called for the first time ? Note that I wrote ".read() and/or .write() methods" in the proposal on purpose: you can of course implement Codecs which only implement one of them, i.e. Readers and Writers. The registry doesn't care about them anyway :-) Then, if you use a Reader for writing, it will result in an AttributeError... > Also, I don't want to ignore the alternative interface that was > suggested by /F. It uses feed() similar to htmllib c.s. This has > some advantages (although we might want to define some compatibility > so it can also feed directly into a file). AFAIK, .feed() and .finalize() (or .close() etc.) have a different backgound: you add data in chunks and then process it at some final stage rather than for each feed. This is often more efficient. With respest to codecs this would mean, that you buffer the output in memory, first doing only preliminary operations on the feeds and then apply some final logic to the buffer at the time .finalize() is called. We could define a StreamCodec subclass for this kind of operation. > Perhaps someone should go ahead and implement prototype codecs using > either paradigm and then write some simple apps, so we can make a > better decision. > > In any case I think the specs codec registry API aren't on the > critical path, integration of /F's basic unicode object is the first > thing we need. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From gward at cnri.reston.va.us Tue Nov 16 18:54:06 1999 From: gward at cnri.reston.va.us (Greg Ward) Date: Tue, 16 Nov 1999 12:54:06 -0500 Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...) In-Reply-To: <199911161627.LAA02665@eric.cnri.reston.va.us>; from guido@cnri.reston.va.us on Tue, Nov 16, 1999 at 11:27:53AM -0500 References: <199911152137.QAA28280@eric.cnri.reston.va.us> <Pine.LNX.4.10.9911160406070.2535-100000@nebula.lyra.org> <19991116091032.A4063@cnri.reston.va.us> <14385.27579.292173.433577@amarok.cnri.reston.va.us> <14385.33535.23316.286575@anthem.cnri.reston.va.us> <199911161627.LAA02665@eric.cnri.reston.va.us> Message-ID: <19991116125405.B4063@cnri.reston.va.us> On 16 November 1999, Guido van Rossum said: > A completely different approach (which, incidentally, HP has lobbied > for before; and which has been implemented by Sjoerd Mullender for one > particular application) would be to cache a mapping from module names > to filenames in a dbm file. For Sjoerd's app (which imported hundreds > of modules) this made a huge difference. Hey, this could be a big win for Zope startup. Dunno how much of that 20-30 sec startup overhead is due to loading modules, but I'm sure it's a sizeable percentage. Any Zope-heads listening? > The problem is that it's > hard to deal with issues like updating the cache while sharing it with > other processes and even other users... Probably not a concern in the case of Zope: one installation, one process, only gets started when it's explicitly shut down and restarted. HmmmMMMMmmm... Greg From petrilli at amber.org Tue Nov 16 19:04:46 1999 From: petrilli at amber.org (Christopher Petrilli) Date: Tue, 16 Nov 1999 13:04:46 -0500 Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...) In-Reply-To: <19991116125405.B4063@cnri.reston.va.us>; from gward@cnri.reston.va.us on Tue, Nov 16, 1999 at 12:54:06PM -0500 References: <199911152137.QAA28280@eric.cnri.reston.va.us> <Pine.LNX.4.10.9911160406070.2535-100000@nebula.lyra.org> <19991116091032.A4063@cnri.reston.va.us> <14385.27579.292173.433577@amarok.cnri.reston.va.us> <14385.33535.23316.286575@anthem.cnri.reston.va.us> <199911161627.LAA02665@eric.cnri.reston.va.us> <19991116125405.B4063@cnri.reston.va.us> Message-ID: <19991116130446.A3068@trump.amber.org> Greg Ward [gward at cnri.reston.va.us] wrote: > On 16 November 1999, Guido van Rossum said: > > A completely different approach (which, incidentally, HP has lobbied > > for before; and which has been implemented by Sjoerd Mullender for one > > particular application) would be to cache a mapping from module names > > to filenames in a dbm file. For Sjoerd's app (which imported hundreds > > of modules) this made a huge difference. > > Hey, this could be a big win for Zope startup. Dunno how much of that > 20-30 sec startup overhead is due to loading modules, but I'm sure it's > a sizeable percentage. Any Zope-heads listening? Wow, that's a huge start up that I've personally never seen. I can't imagine... even loading the Oracle libraries dynamically, which are HUGE (2Mb or so), it's only a couple seconds. > > The problem is that it's > > hard to deal with issues like updating the cache while sharing it with > > other processes and even other users... > > Probably not a concern in the case of Zope: one installation, one > process, only gets started when it's explicitly shut down and > restarted. HmmmMMMMmmm... This doesn't reslve a lot of other users of Python howver... and Zope would always benefit, especially when you're running multiple instances on th same machine... would perhaps share more code. Chris -- | Christopher Petrilli | petrilli at amber.org From gmcm at hypernet.com Tue Nov 16 19:04:41 1999 From: gmcm at hypernet.com (Gordon McMillan) Date: Tue, 16 Nov 1999 13:04:41 -0500 Subject: [Python-Dev] mmapfile module In-Reply-To: <14385.38201.301429.786642@amarok.cnri.reston.va.us> References: <199911161720.MAA02764@eric.cnri.reston.va.us> Message-ID: <1269347016-9399681@hypernet.com> Andrew M. Kuchling wrote: > Hmm... I don't know of any way to use mmap() on non-file things, > either; there are odd special cases, like using MAP_ANONYMOUS on > /dev/zero to allocate memory, but that's still using a file. On > the other hand, there may be some special case where you need to > do that. We could add a fileno() method to get the file > descriptor, but I don't know if that's useful to Windows. (Is > Sam Rushing, the original author of the Win32 mmapfile, on this > list?) > > What do we do about the tagname, which is a Win32 argument that > has no Unix counterpart -- I'm not even sure what its function > is. On Windows, a mmap is always backed by disk (swap space), but is not necessarily associated with a (user-land) file. The tagname is like the "name" associated with a semaphore; two processes opening the same tagname get shared memory. Fileno (in the c runtime sense) would be useless on Windows. As with all Win32 resources, there's a "handle", which is analagous. But different enough, it seems to me, to confound any attempts at a common API. Another fundamental difference (IIRC) is that Windows mmap's can be resized on the fly. - Gordon From guido at CNRI.Reston.VA.US Tue Nov 16 19:09:43 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Tue, 16 Nov 1999 13:09:43 -0500 Subject: [Python-Dev] Codecs and StreamCodecs In-Reply-To: Your message of "Tue, 16 Nov 1999 18:53:46 +0100." <38319A2A.4385D2E7@lemburg.com> References: <38317FBA.4F3D6B1F@lemburg.com> <199911161620.LAA02643@eric.cnri.reston.va.us> <38319A2A.4385D2E7@lemburg.com> Message-ID: <199911161809.NAA02894@eric.cnri.reston.va.us> > > I have some questions about the constructor. You seem to imply > > that instantiating the class without arguments creates a codec without > > state. That's fine. When given a stream argument, shouldn't the > > direction of the stream be given as an additional argument, so the > > proper state for encoding or decoding can be set up? I can see that > > for an implementation it might be more convenient to have separate > > classes for encoders and decoders -- certainly the state being kept is > > very different. > > Wouldn't it be possible to have the read/write methods set up > the state when called for the first time ? Hm, I'd rather be explicit. We don't do this for files either. > Note that I wrote ".read() and/or .write() methods" in the proposal > on purpose: you can of course implement Codecs which only implement > one of them, i.e. Readers and Writers. The registry doesn't care > about them anyway :-) > > Then, if you use a Reader for writing, it will result in an > AttributeError... > > > Also, I don't want to ignore the alternative interface that was > > suggested by /F. It uses feed() similar to htmllib c.s. This has > > some advantages (although we might want to define some compatibility > > so it can also feed directly into a file). > > AFAIK, .feed() and .finalize() (or .close() etc.) have a different > backgound: you add data in chunks and then process it at some > final stage rather than for each feed. This is often more > efficient. > > With respest to codecs this would mean, that you buffer the > output in memory, first doing only preliminary operations on > the feeds and then apply some final logic to the buffer at > the time .finalize() is called. This is part of the purpose, yes. > We could define a StreamCodec subclass for this kind of operation. The difference is that to decode from a file, your proposed interface is to call read() on the codec which will in turn call read() on the stream. In /F's version, I call read() on the stream (geting multibyte encoded data), feed() that to the codec, which in turn calls feed() to some other back end -- perhaps another codec which in turn feed()s its converted data to another file, perhaps an XML parser. --Guido van Rossum (home page: http://www.python.org/~guido/) From fdrake at acm.org Tue Nov 16 19:16:42 1999 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Tue, 16 Nov 1999 13:16:42 -0500 (EST) Subject: [Python-Dev] Codecs and StreamCodecs In-Reply-To: <38319A2A.4385D2E7@lemburg.com> References: <38317FBA.4F3D6B1F@lemburg.com> <199911161620.LAA02643@eric.cnri.reston.va.us> <38319A2A.4385D2E7@lemburg.com> Message-ID: <14385.40842.709711.12141@weyr.cnri.reston.va.us> M.-A. Lemburg writes: > Wouldn't it be possible to have the read/write methods set up > the state when called for the first time ? That slows the down; the constructor should handle initialization. Perhaps what gets registered should be: encoding function, decoding function, stream encoder factory (can be a class), stream decoder factory (again, can be a class). These can be encapsulated either before or after hitting the registry, and can be None. The registry and provide default implementations from what is provided (stream handlers from the functions, or functions from the stream handlers) as required. Ideally, I should be able to write a module with four well-known entry points and then provide the module object itself as the registration entry. Or I could construct a new object that has the right interface and register that if it made more sense for the encoding. > AFAIK, .feed() and .finalize() (or .close() etc.) have a different > backgound: you add data in chunks and then process it at some > final stage rather than for each feed. This is often more Many of the classes that provide feed() do as much work as possible as data is fed into them (see htmllib.HTMLParser); this structure is commonly used to support asynchonous operation. > With respest to codecs this would mean, that you buffer the > output in memory, first doing only preliminary operations on > the feeds and then apply some final logic to the buffer at > the time .finalize() is called. That depends on the encoding. I'd expect it to feed encoded data to a sink as quickly as it could and let the target decide what needs to happen. If buffering is needed, the target could be a StringIO or whatever. -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> Corporation for National Research Initiatives From fredrik at pythonware.com Tue Nov 16 20:32:21 1999 From: fredrik at pythonware.com (Fredrik Lundh) Date: Tue, 16 Nov 1999 20:32:21 +0100 Subject: [Python-Dev] mmapfile module References: <199911161700.MAA02716@eric.cnri.reston.va.us><14385.36948.610106.195971@amarok.cnri.reston.va.us><199911161720.MAA02764@eric.cnri.reston.va.us> <14385.38201.301429.786642@amarok.cnri.reston.va.us> Message-ID: <002201bf3069$4e232a50$f29b12c2@secret.pythonware.com> > Hmm... I don't know of any way to use mmap() on non-file things, > either; there are odd special cases, like using MAP_ANONYMOUS on > /dev/zero to allocate memory, but that's still using a file. but that's not always the case -- OSF/1 supports truly anonymous mappings, for example. in fact, it bombs if you use ANONYMOUS with a file handle: $ man mmap ... If MAP_ANONYMOUS is set in the flags parameter: + A new memory region is created and initialized to all zeros. This memory region can be shared only with descendents of the current pro- cess. + If the filedes parameter is not -1, the mmap() function fails. ... (btw, doing anonymous maps isn't exactly an odd special case under this operating system; it's the only memory- allocation mechanism provided by the kernel...) </F> From fredrik at pythonware.com Tue Nov 16 20:33:52 1999 From: fredrik at pythonware.com (Fredrik Lundh) Date: Tue, 16 Nov 1999 20:33:52 +0100 Subject: [Python-Dev] Codecs and StreamCodecs References: <38317FBA.4F3D6B1F@lemburg.com> <199911161620.LAA02643@eric.cnri.reston.va.us> Message-ID: <002e01bf3069$8477b440$f29b12c2@secret.pythonware.com> Guido van Rossum <guido at CNRI.Reston.VA.US> wrote: > Also, I don't want to ignore the alternative interface that was > suggested by /F. It uses feed() similar to htmllib c.s. This has > some advantages (although we might want to define some > compatibility so it can also feed directly into a file). seeing this made me switch on my brain for a moment, and recall how things are done in PIL (which is, as I've bragged about before, another library with an internal format, and many possible external encodings). among other things, PIL lets you read and write images to both ordinary files and arbitrary file objects, but it also lets you incrementally decode images by feeding it chunks of data (through ImageFile.Parser). and it's fast -- it has to be, since images tends to contain lots of pixels... anyway, here's what I came up with (code will follow, if someone's interested). -------------------------------------------------------------------- A PIL-like Unicode Codec Proposal -------------------------------------------------------------------- In the PIL model, the codecs are called with a piece of data, and returns the result to the caller. The codecs maintain internal state when needed. class decoder: def decode(self, s, offset=0): # decode as much data as we possibly can from the # given string. if there's not enough data in the # input string to form a full character, return # what we've got this far (this might be an empty # string). def flush(self): # flush the decoding buffers. this should usually # return None, unless the fact that knowing that the # input stream has ended means that the state can be # interpreted in a meaningful way. however, if the # state indicates that there last character was not # finished, this method should raise a UnicodeError # exception. class encoder: def encode(self, u, offset=0, buffersize=0): # encode data from the given offset in the input # unicode string into a buffer of the given size # (or slightly larger, if required to proceed). # if the buffer size is 0, the decoder is free # to pick a suitable size itself (if at all # possible, it should make it large enough to # encode the entire input string). returns a # 2-tuple containing the encoded data, and the # number of characters consumed by this call. def flush(self): # flush the encoding buffers. returns an ordinary # string (which may be empty), or None. Note that a codec instance can be used for a single string; the codec registry should hold codec factories, not codec instances. In addition, you may use a single type or class to implement both interfaces at once. -------------------------------------------------------------------- Use Cases -------------------------------------------------------------------- A null decoder: class decoder: def decode(self, s, offset=0): return s[offset:] def flush(self): pass A null encoder: class encoder: def encode(self, s, offset=0, buffersize=0): if buffersize: s = s[offset:offset+buffersize] else: s = s[offset:] return s, len(s) def flush(self): pass Decoding a string: def decode(s, encoding) c = registry.getdecoder(encoding) u = c.decode(s) t = c.flush() if not t: return u return u + t # not very common Encoding a string: def encode(u, encoding) c = registry.getencoder(encoding) p = [] o = 0 while o < len(u): s, n = c.encode(u, o) p.append(s) o = o + n if len(p) == 1: return p[0] return string.join(p, "") # not very common Implementing stream codecs is left as an exercise (see the zlib material in the eff-bot guide for a decoder example). --- end of proposal From fredrik at pythonware.com Tue Nov 16 20:37:40 1999 From: fredrik at pythonware.com (Fredrik Lundh) Date: Tue, 16 Nov 1999 20:37:40 +0100 Subject: [Python-Dev] Python 1.6 status References: <199911161700.MAA02716@eric.cnri.reston.va.us> <14385.36948.610106.195971@amarok.cnri.reston.va.us> Message-ID: <003d01bf306a$0bdea330$f29b12c2@secret.pythonware.com> > * Go through the O'Reilly POSIX book and draw up a list of missing > POSIX functions that aren't available in the posix module. This > was sparked by Greg Ward showing me a Perl daemonize() function > he'd written, and I realized that some of the functions it used > weren't available in Python at all. (setsid() was one of them, I > think.) $ python Python 1.5.2 (#1, Aug 23 1999, 14:42:39) [GCC 2.7.2.3] on linux2 Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam >>> import os >>> os.setsid <built-in function setsid> </F> From mhammond at skippinet.com.au Tue Nov 16 22:54:15 1999 From: mhammond at skippinet.com.au (Mark Hammond) Date: Wed, 17 Nov 1999 08:54:15 +1100 Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: <19991116110555.8B43335BB1E@snelboot.oratrix.nl> Message-ID: <00f701bf307d$20f0cb00$0501a8c0@bobcat> [Andy writes:] > Leave JISXXX and the CJK stuff out. If you get into Japanese, you > really need to cover ShiftJIS, EUC-JP and JIS, they are big, and there [Then Marc relpies:] > 2. give more information to the unicodec registry: > one could register classes instead of instances which the Unicode [Jack chimes in with:] > I would suggest adding the Dos, Windows and Macintosh > standard 8-bit charsets > (their equivalents of latin-1) too, as documents in these > encoding are pretty > ubiquitous. But maybe these should only be added on the > respective platforms. [And the conversation twisted around to Greg noting:] > Next, the number of "open" calls: > > Solaris Linux IRIX > Perl 16 10 9 > Python 107 71 48 This is leading me to conclude that our "codec registry" should be the file system, and Python modules. Would it be possible to define a "standard package" called "encodings", and when we need an encoding, we simply attempt to load a module from that package? The key benefits I see are: * No need to load modules simply to register a codec (which would make the number of open calls even higher, and the startup time even slower.) This makes it truly demand-loading of the codecs, rather than explicit load-and-register. * Making language specific distributions becomes simple - simply select a different set of modules from the "encodings" directory. The Python source distribution has them all, but (say) the Windows binary installer selects only a few. The Japanese binary installer for Windows installs a few more. * Installing new codecs becomes trivial - no need to hack site.py etc - simply copy the new "codec module" to the encodings directory and you are done. * No serious problem for GMcM's installer nor for freeze We would probably need to assume that certain codes exist for _all_ platforms and language - but this is no different to assuming that "exceptions.py" also exists for all platforms. Is this worthy of consideration? Mark. From andy at robanal.demon.co.uk Wed Nov 17 01:14:06 1999 From: andy at robanal.demon.co.uk (Andy Robinson) Date: Wed, 17 Nov 1999 00:14:06 GMT Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: <010001bf300e$14741310$f29b12c2@secret.pythonware.com> References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com> <199911152137.QAA28280@eric.cnri.reston.va.us> <010001bf300e$14741310$f29b12c2@secret.pythonware.com> Message-ID: <3836f28c.4929177@post.demon.co.uk> On Tue, 16 Nov 1999 09:39:20 +0100, you wrote: >1) codes written according to the "data > consumer model", instead of the "stream" > model. > > class myDecoder: > def __init__(self, target): > self.target = target > self.state = ... > def feed(self, data): > ... extract as much data as possible ... > self.target.feed(extracted data) > def close(self): > ... extract what's left ... > self.target.feed(additional data) > self.target.close() > Apart from feed() instead of write(), how is that different from a Java-like Stream writer as Guido suggested? He said: >Andy's file translation example could then be written as follows: > ># assuming variables input_file, input_encoding, output_file, ># output_encoding, and constant BUFFER_SIZE > >f = open(input_file, "rb") >f1 = unicodec.codecs[input_encoding].stream_reader(f) >g = open(output_file, "wb") >g1 = unicodec.codecs[output_encoding].stream_writer(f) > >while 1: > buffer = f1.read(BUFFER_SIZE) > if not buffer: > break > f2.write(buffer) > >f2.close() >f1.close() > >Note that we could possibly make these the only API that a codec needs >to provide; the string object <--> unicode object conversions can be >done using this and the cStringIO module. (On the other hand it seems >a common case that would be quite useful.) - Andy From gstein at lyra.org Wed Nov 17 03:03:21 1999 From: gstein at lyra.org (Greg Stein) Date: Tue, 16 Nov 1999 18:03:21 -0800 (PST) Subject: [Python-Dev] shared data In-Reply-To: <1269351119-9152905@hypernet.com> Message-ID: <Pine.LNX.4.10.9911161756290.10639-100000@nebula.lyra.org> On Tue, 16 Nov 1999, Gordon McMillan wrote: > Barry A. Warsaw writes: > > One approach might be to support loading modules out of jar files > > (or whatever) using Greg imputils. We could put the bootstrap > > .pyc files in this jar and teach Python to import from it first. > > Python installations could even craft their own modules.jar file > > to include whatever modules they are willing to "hard code". > > This, with -S might make Python start up much faster, at the > > small cost of some flexibility (which could be regained with a > > c.l. switch or other mechanism to bypass modules.jar). > > Couple hundred Windows users have been doing this for > months (http://starship.python.net/crew/gmcm/install.html). > The .pyz files are cross-platform, although the "embedding" > app would have to be redone for *nix, (and all the embedding > really does is keep Python from hunting all over your disk). > Yeah, it's faster. And I can put Python+Tcl/Tk+IDLE on a > diskette with a little room left over. I've got a patch from Jim Ahlstrom to provide a "standardized" library file. I've got to review and fold that thing in (I'll post here when that is done). As Gordon states: yes, the startup time is considerably improved. The DBM approach is interesting. That could definitely be used thru an imputils Importer; it would be quite interesting to try that out. (Note that the library style approach would be even harder to deal with updates, relative to what Sjoerd saw with the DBM approach; I would guess that the "right" approach is to rebuild the library from scratch and atomically replace the thing (but that would bust people with open references...)) Certainly something to look at. Cheers, -g p.s. I also want to try mmap'ing a library and creating code objects that use PyBufferObjects (rather than PyStringObjects) that refer to portions of the mmap. Presuming the mmap is shared, there "should" be a large reduction in heap usage. Question is that I don't know the proportion of code bytes to other heap usage caused by loading a .pyc. p.p.s. I also want to try the buffer approach for frozen code. -- Greg Stein, http://www.lyra.org/ From gstein at lyra.org Wed Nov 17 03:29:42 1999 From: gstein at lyra.org (Greg Stein) Date: Tue, 16 Nov 1999 18:29:42 -0800 (PST) Subject: [Python-Dev] Codecs and StreamCodecs In-Reply-To: <14385.40842.709711.12141@weyr.cnri.reston.va.us> Message-ID: <Pine.LNX.4.10.9911161821230.10639-100000@nebula.lyra.org> On Tue, 16 Nov 1999, Fred L. Drake, Jr. wrote: > M.-A. Lemburg writes: > > Wouldn't it be possible to have the read/write methods set up > > the state when called for the first time ? > > That slows the down; the constructor should handle initialization. > Perhaps what gets registered should be: encoding function, decoding > function, stream encoder factory (can be a class), stream decoder > factory (again, can be a class). These can be encapsulated either > before or after hitting the registry, and can be None. The registry I'm with Fred here; he beat me to the punch (and his email is better than what I'd write anyhow :-). I'd like to see the API be *functions* rather than a particular class specification. If the spec is going to say "do not alter/store state", then a function makes much more sense than a method on an object. Of course, bound method objects could be registered. This might occur if you have a general JIS encode/decoder but need to instantiate it a little differently for each JIS variant. (Andy also mentioned something about "options" in JIS encoding/decoding) > and provide default implementations from what is provided (stream > handlers from the functions, or functions from the stream handlers) as > required. Excellent idea... "I'll provide the encode/decode functions, but I don't have a spiffy algorithm for streaming -- please provide a stream wrapper for my functions." > Ideally, I should be able to write a module with four well-known > entry points and then provide the module object itself as the > registration entry. Or I could construct a new object that has the > right interface and register that if it made more sense for the > encoding. Mark's idea about throwing these things into a package for on-demand registrations is much better than a "register-beforehand" model. When the module is loaded from the package, it calls a registration function to insert its 4-tuple of registration data. Cheers, -g -- Greg Stein, http://www.lyra.org/ From gstein at lyra.org Wed Nov 17 03:40:07 1999 From: gstein at lyra.org (Greg Stein) Date: Tue, 16 Nov 1999 18:40:07 -0800 (PST) Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: <00f701bf307d$20f0cb00$0501a8c0@bobcat> Message-ID: <Pine.LNX.4.10.9911161830020.10639-100000@nebula.lyra.org> On Wed, 17 Nov 1999, Mark Hammond wrote: >... > Would it be possible to define a "standard package" called > "encodings", and when we need an encoding, we simply attempt to load a > module from that package? The key benefits I see are: >... > Is this worthy of consideration? Absolutely! You will need to provide a way for a module (in the "codec" package) to state *beforehand* that it should be loaded for the X, Y, and Z encodings. This might be in terms of little "info" files that get dropped into the package. The __init__.py module scans the directory for the info files and loads them to build an encoding => module-name mapping. The alternative would be to have stub modules like: iso-8859-1.py: import unicodec def encode_1(...) ... def encode_2(...) ... ... unicodec.register('iso-8859-1', encode_1, decode_1) unicodec.register('iso-8859-2', encode_2, decode_2) ... iso-8859-2.py: import iso-8859-1 I believe that encoding names are legitimate file names, but they aren't necessarily Python identifiers. That kind of bungs up "import codec.iso-8859-1". The codec package would need to programmatically import the modules. Clients should not be directly importing the modules, so I don't see a difficult here. [ if we do decide to allow clients access to the modules, then maybe they have to arrive through a "helper" module that has a nice name, or the codec package provides a "module = code.load('iso-8859-1')" idiom. ] Cheers, -g -- Greg Stein, http://www.lyra.org/ From mhammond at skippinet.com.au Wed Nov 17 03:57:48 1999 From: mhammond at skippinet.com.au (Mark Hammond) Date: Wed, 17 Nov 1999 13:57:48 +1100 Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: <Pine.LNX.4.10.9911161830020.10639-100000@nebula.lyra.org> Message-ID: <010501bf30a7$88c00320$0501a8c0@bobcat> > You will need to provide a way for a module (in the "codec" > package) to > state *beforehand* that it should be loaded for the X, Y, and ... > The alternative would be to have stub modules like: Actually, I was thinking even more radically - drop the codec registry all together, and use modules with "well-known" names (a slight precedent, but Python isnt adverse to well-known names in general) eg: iso-8859-1.py: import unicodec def encode(...): ... def decode(...): ... iso-8859-2.py: from iso-8859-1 import * The codec registry then is trivial, and effectively does not exist (cant get much more trivial than something that doesnt exist :-): def getencoder(encoding): mod = __import__( "encodings." + encoding ) return getattr(mod, "encode") > I believe that encoding names are legitimate file names, but > they aren't > necessarily Python identifiers. That kind of bungs up "import > codec.iso-8859-1". Agreed - clients should never need to import them, and codecs that wish to import other codes could use "__import__" Of course, I am not adverse to the idea of a registry as well and having the modules manually register themselves - but it doesnt seem to buy much, and the logic for getting a codec becomes more complex - ie, it needs to determine the module to import, then look in the registry - if it needs to determine the module anyway, why not just get it from the module and be done with it? Mark. From andy at robanal.demon.co.uk Wed Nov 17 01:18:22 1999 From: andy at robanal.demon.co.uk (Andy Robinson) Date: Wed, 17 Nov 1999 00:18:22 GMT Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: <00f701bf307d$20f0cb00$0501a8c0@bobcat> References: <00f701bf307d$20f0cb00$0501a8c0@bobcat> Message-ID: <3837f379.5166829@post.demon.co.uk> On Wed, 17 Nov 1999 08:54:15 +1100, you wrote: >This is leading me to conclude that our "codec registry" should be the >file system, and Python modules. > >Would it be possible to define a "standard package" called >"encodings", and when we need an encoding, we simply attempt to load a >module from that package? The key benefits I see are: [snip] >Is this worthy of consideration? Exactly what I am aiming for. The real icing on the cake would be a small state machine or some helper functions in C which made it possible to write fast codecs in pure Python, but that can come a bit later when we have examples up and running. - Andy From andy at robanal.demon.co.uk Wed Nov 17 01:08:01 1999 From: andy at robanal.demon.co.uk (Andy Robinson) Date: Wed, 17 Nov 1999 00:08:01 GMT Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <000601bf2ff7$4d8a4c80$042d153f@tim> References: <000601bf2ff7$4d8a4c80$042d153f@tim> Message-ID: <3834f142.4599884@post.demon.co.uk> On Tue, 16 Nov 1999 00:56:18 -0500, you wrote: >[Andy Robinson] >> ... >> I presume no one is actually advocating dropping >> ordinary Python strings, or the ability to do >> rawdata = open('myfile.txt', 'rb').read() >> without any transformations? > >If anyone has advocated either, they've successfully hidden it from me. >Anyone? Well, I hear statements looking forward to when all string-handling is done in Unicode internally. This scares the hell out of me - it is what VB does and that bit us badly on simple stream operations. For encoding work, you will always need raw strings, and often need Unicode ones. - Andy From tim_one at email.msn.com Wed Nov 17 08:33:06 1999 From: tim_one at email.msn.com (Tim Peters) Date: Wed, 17 Nov 1999 02:33:06 -0500 Subject: [Python-Dev] Unicode proposal: %-formatting ? In-Reply-To: <383134AA.4B49D178@lemburg.com> Message-ID: <000001bf30cd$fd6be9c0$a42d153f@tim> [MAL] > ... > This means a new PyUnicode_Format() implementation mapping > Unicode format objects to Unicode objects. It's a bitch, isn't it <0.5 wink>? I hope they're paying you a lot for this! > ... hmm, there is a problem there: how should the PyUnicode_Format() > API deal with '%s' when it sees a Unicode object as argument ? Anything other than taking the Unicode characters as-is would be incomprehensible. I mean, it's a Unicode format string sucking up Unicode strings -- what else could possibly make *sense*? > E.g. what would you get in these cases: > > u = u"%s %s" % (u"abc", "abc") That u"abc" gets substituted as-is seems screamingly necessary to me. I'm more baffled about what "abc" should do. I didn't understand the t#/s# etc arguments, and how those do or don't relate to what str() does. On the face of it, the idea that a gazillion and one distinct encodings all get lumped into "a string object" without remembering their nature makes about as much sense as if Python were to treat all instances of all user-defined classes as being of a single InstanceType type <wink> -- except in the latter case you at least get a __class__ attribute to find your way home again. As an ignorant user, I would hope that u"%s" % string had enough sense to know what string's encoding is all on its own, and promote it correctly to Unicode by magic. > Perhaps we need a new marker for "insert Unicode object here". %s means string, and at this level a Unicode object *is* "a string". If this isn't obvious, it's likely because we're too clever about what non-Unicode string objects do in this context. From captainrobbo at yahoo.com Wed Nov 17 08:53:53 1999 From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=) Date: Tue, 16 Nov 1999 23:53:53 -0800 (PST) Subject: [Python-Dev] Some thoughts on the codecs... Message-ID: <19991117075353.16046.rocketmail@web606.mail.yahoo.com> --- Mark Hammond <mhammond at skippinet.com.au> wrote: > Actually, I was thinking even more radically - drop > the codec registry > all together, and use modules with "well-known" > names (a slight > precedent, but Python isnt adverse to well-known > names in general) > > eg: > iso-8859-1.py: > > import unicodec > def encode(...): > ... > def decode(...): > ... > > iso-8859-2.py: > from iso-8859-1 import * > This is the simplest if each codec really is likely to be implemented in a separate module. But just look at the data! All the iso-8859 encodings need identical functionality, and just have a different mapping table with 256 elements. It would be trivial to implement these in one module. And the wide variety of Japanese encodings (mostly corporate or historical variants of the same character set) are again best treated from one code base with a bunch of mapping tables and routines to generate the variants - basically one can store the deltas. So the choice is between possibly having a lot of almost-dummy modules, or having Python modules which generate and register a logical family of encodings. I may have some time next week and will try to code up a few so we can pound on something. - Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From captainrobbo at yahoo.com Wed Nov 17 08:58:23 1999 From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=) Date: Tue, 16 Nov 1999 23:58:23 -0800 (PST) Subject: [Python-Dev] Unicode proposal: %-formatting ? Message-ID: <19991117075823.6498.rocketmail@web602.mail.yahoo.com> --- Tim Peters <tim_one at email.msn.com> wrote: > I'm more baffled about what "abc" should do. I > didn't understand the t#/s# > etc arguments, and how those do or don't relate to > what str() does. On the > face of it, the idea that a gazillion and one > distinct encodings all get > lumped into "a string object" without remembering > their nature makes about > as much sense as if Python were to treat all > instances of all user-defined > classes as being of a single InstanceType type > <wink> -- except in the > latter case you at least get a __class__ attribute > to find your way home > again. Well said. When the core stuff is done, I'm going to implement a set of "TypedString" helper routines which will remember what they are encoded in and won't let you abuse them by concatenating or otherwise mixing different encodings. If you are consciously working with multi-encoding data, this higher level of abstraction is really useful. But I reckon that can be done in pure Python (just overload '%;, '+' etc. with some encoding checks). - Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From mal at lemburg.com Wed Nov 17 11:03:59 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Wed, 17 Nov 1999 11:03:59 +0100 Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit) References: <000201bf30d3$cb2cb240$a42d153f@tim> Message-ID: <38327D8F.7A5352E6@lemburg.com> Tim Peters wrote: > > [MAL] > > ...demo script... > > It looks like > > r'\\u0000' > > will get translated into a 2-character Unicode string. Right... > That's probably not > good, if for no other reason than that Java would not do this (it would > create the obvious 7-character Unicode string), and having something that > looks like a Java escape that doesn't *work* like the Java escape will be > confusing as heck for JPython users. Keeping track of even-vs-odd number of > backslashes can't be done with a regexp search, but is easy if the code is > simple <wink>: > ...Tim's version of the demo... Guido and I have decided to turn \uXXXX into a standard escape sequence with no further magic applied. \uXXXX will only be expanded in u"" strings. Here's the new scheme: With the 'unicode-escape' encoding being defined as: ? all non-escape characters represent themselves as a Unicode ordinal (e.g. 'a' -> U+0061). ? all existing defined Python escape sequences are interpreted as Unicode ordinals; note that \xXXXX can represent all Unicode ordinals, and \OOO (octal) can represent Unicode ordinals up to U+01FF. ? a new escape sequence, \uXXXX, represents U+XXXX; it is a syntax error to have fewer than 4 digits after \u. Examples: u'abc' -> U+0061 U+0062 U+0063 u'\u1234' -> U+1234 u'abc\u1234\n' -> U+0061 U+0062 U+0063 U+1234 U+05c Now how should we define ur"abc\u1234\n" ... ? -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 44 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From tim_one at email.msn.com Wed Nov 17 10:31:27 1999 From: tim_one at email.msn.com (Tim Peters) Date: Wed, 17 Nov 1999 04:31:27 -0500 Subject: [Python-Dev] Python 1.6 status In-Reply-To: <199911161700.MAA02716@eric.cnri.reston.va.us> Message-ID: <000801bf30de$85bea500$a42d153f@tim> [Guido] > ... > I'm hoping for several kind of responses to this email: > ... > - requests for checkin privileges, preferably with a specific issue > or area of expertise for which the requestor will take responsibility. I'm specifically requesting not to have checkin privileges. So there. I see two problems: 1. When patches go thru you, you at least eyeball them. This catches bugs and design errors early. 2. For a multi-platform app, few people have adequate resources for testing; e.g., I can test under an obsolete version of Win95, and NT if I have to, but that's it. You may not actually do better testing than that, but having patches go thru you allows me the comfort of believing you do <wink>. From mal at lemburg.com Wed Nov 17 11:11:05 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Wed, 17 Nov 1999 11:11:05 +0100 Subject: [Python-Dev] Some thoughts on the codecs... References: <00f701bf307d$20f0cb00$0501a8c0@bobcat> Message-ID: <38327F39.AA381647@lemburg.com> Mark Hammond wrote: > > This is leading me to conclude that our "codec registry" should be the > file system, and Python modules. > > Would it be possible to define a "standard package" called > "encodings", and when we need an encoding, we simply attempt to load a > module from that package? The key benefits I see are: > > * No need to load modules simply to register a codec (which would make > the number of open calls even higher, and the startup time even > slower.) This makes it truly demand-loading of the codecs, rather > than explicit load-and-register. > > * Making language specific distributions becomes simple - simply > select a different set of modules from the "encodings" directory. The > Python source distribution has them all, but (say) the Windows binary > installer selects only a few. The Japanese binary installer for > Windows installs a few more. > > * Installing new codecs becomes trivial - no need to hack site.py > etc - simply copy the new "codec module" to the encodings directory > and you are done. > > * No serious problem for GMcM's installer nor for freeze > > We would probably need to assume that certain codes exist for _all_ > platforms and language - but this is no different to assuming that > "exceptions.py" also exists for all platforms. > > Is this worthy of consideration? Why not... using the new registry scheme I proposed in the thread "Codecs and StreamCodecs" you could implement this via factory_functions and lazy imports (with the encoding name folded to make up a proper Python identifier, e.g. hyphens get converted to '' and spaces to '_'). I'd suggest grouping encodings: [encodings] [iso} [iso88591] [iso88592] [jis] ... [cyrillic] ... [misc] The unicodec registry could then query encodings.get(encoding,action) and the package would take care of the rest. Note that the "walk-me-up-scotty" import patch would probably be nice in this situation too, e.g. to reach the modules in [misc] or in higher levels such the ones in [iso] from [iso88591]. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 44 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Wed Nov 17 10:29:34 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Wed, 17 Nov 1999 10:29:34 +0100 Subject: [Python-Dev] Codecs and StreamCodecs References: <38317FBA.4F3D6B1F@lemburg.com> <199911161620.LAA02643@eric.cnri.reston.va.us> <002e01bf3069$8477b440$f29b12c2@secret.pythonware.com> Message-ID: <3832757E.B9503606@lemburg.com> Fredrik Lundh wrote: > > -------------------------------------------------------------------- > A PIL-like Unicode Codec Proposal > -------------------------------------------------------------------- > > In the PIL model, the codecs are called with a piece of data, and > returns the result to the caller. The codecs maintain internal state > when needed. > > class decoder: > > def decode(self, s, offset=0): > # decode as much data as we possibly can from the > # given string. if there's not enough data in the > # input string to form a full character, return > # what we've got this far (this might be an empty > # string). > > def flush(self): > # flush the decoding buffers. this should usually > # return None, unless the fact that knowing that the > # input stream has ended means that the state can be > # interpreted in a meaningful way. however, if the > # state indicates that there last character was not > # finished, this method should raise a UnicodeError > # exception. Could you explain for reason for having a .flush() method and what it should return. Note that the .decode method is not so much different from my Codec.decode method except that it uses a single offset where my version uses a slice (the offset is probably the better variant, because it avoids data truncation). > class encoder: > > def encode(self, u, offset=0, buffersize=0): > # encode data from the given offset in the input > # unicode string into a buffer of the given size > # (or slightly larger, if required to proceed). > # if the buffer size is 0, the decoder is free > # to pick a suitable size itself (if at all > # possible, it should make it large enough to > # encode the entire input string). returns a > # 2-tuple containing the encoded data, and the > # number of characters consumed by this call. Dito. > def flush(self): > # flush the encoding buffers. returns an ordinary > # string (which may be empty), or None. > > Note that a codec instance can be used for a single string; the codec > registry should hold codec factories, not codec instances. In > addition, you may use a single type or class to implement both > interfaces at once. Perhaps I'm missing something, but how would you define stream codecs using this interface ? > Implementing stream codecs is left as an exercise (see the zlib > material in the eff-bot guide for a decoder example). ...? -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 44 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Wed Nov 17 10:55:05 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Wed, 17 Nov 1999 10:55:05 +0100 Subject: [Python-Dev] Codecs and StreamCodecs References: <38317FBA.4F3D6B1F@lemburg.com> <199911161620.LAA02643@eric.cnri.reston.va.us> <38319A2A.4385D2E7@lemburg.com> <14385.40842.709711.12141@weyr.cnri.reston.va.us> Message-ID: <38327B79.2415786B@lemburg.com> "Fred L. Drake, Jr." wrote: > > M.-A. Lemburg writes: > > Wouldn't it be possible to have the read/write methods set up > > the state when called for the first time ? > > That slows the down; the constructor should handle initialization. > Perhaps what gets registered should be: encoding function, decoding > function, stream encoder factory (can be a class), stream decoder > factory (again, can be a class). Guido proposed the factory approach too, though not seperated into these 4 APIs (note that your proposal looks very much like what I had in the early version of my proposal). Anyway, I think that factory functions are the way to go, because they offer more flexibility w/r to reusing already instantiated codecs, importing modules on-the-fly as was suggested in another thread (thereby making codec module import lazy) or mapping encoder and decoder requests all to one class. So here's a new registry approach: unicodec.register(encoding,factory_function,action) with encoding - name of the supported encoding, e.g. Shift_JIS factory_function - a function that returns an object or function ready to be used for action action - a string stating the supported action: 'encode' 'decode' 'stream write' 'stream read' The factory_function API depends on the implementation of the codec. The returned object's interface on the value of action: Codecs: ------- obj = factory_function_for_<action>(errors='strict') 'encode': obj(u,slice=None) -> Python string 'decode': obj(s,offset=0,chunksize=0) -> (Unicode object, bytes consumed) factory_functions are free to return simple function objects for stateless encodings. StreamCodecs: ------------- obj = factory_function_for_<action>(stream,errors='strict') obj should provide access to all methods defined for the stream object, overriding these: 'stream write': obj.write(u,slice=None) -> bytes written to stream obj.flush() -> ??? 'stream read': obj.read(chunksize=0) -> (Unicode object, bytes read) obj.flush() -> ??? errors is defined like in my Codec spec. The codecs are expected to use this argument to handle error conditions. I'm not sure what Fredrik intended with the .flush() methods, so the definition is still open. I would expect it to do some finalization of state. Perhaps we need another set of actions for the .feed()/.close() approach... As in earlier version of the proposal: The registry should provide default implementations for missing action factory_functions using the other registered functions, e.g. 'stream write' can be emulated using 'encode' and 'stream read' using 'decode'. The same probably holds for feed approach. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 44 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From tim_one at email.msn.com Wed Nov 17 09:14:38 1999 From: tim_one at email.msn.com (Tim Peters) Date: Wed, 17 Nov 1999 03:14:38 -0500 Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit) In-Reply-To: <3831350B.8F69CB6D@lemburg.com> Message-ID: <000201bf30d3$cb2cb240$a42d153f@tim> [MAL] > ... > Here is a sample implementation of what I had in mind: > > """ Demo for 'unicode-escape' encoding. > """ > import struct,string,re > > pack_format = '>H' > > def convert_string(s): > > l = map(None,s) > for i in range(len(l)): > l[i] = struct.pack(pack_format,ord(l[i])) > return l > > u_escape = re.compile(r'\\u([0-9a-fA-F]{0,4})') > > def unicode_unescape(s): > > l = [] > start = 0 > while start < len(s): > m = u_escape.search(s,start) > if not m: > l[len(l):] = convert_string(s[start:]) > break > m_start,m_end = m.span() > if m_start > start: > l[len(l):] = convert_string(s[start:m_start]) > hexcode = m.group(1) > #print hexcode,start,m_start > if len(hexcode) != 4: > raise SyntaxError,'illegal \\uXXXX sequence: \\u%s' % hexcode > ordinal = string.atoi(hexcode,16) > l.append(struct.pack(pack_format,ordinal)) > start = m_end > #print l > return string.join(l,'') > > def hexstr(s,sep=''): > > return string.join(map(lambda x,hex=hex,ord=ord: '%02x' % > ord(x),s),sep) It looks like r'\\u0000' will get translated into a 2-character Unicode string. That's probably not good, if for no other reason than that Java would not do this (it would create the obvious 7-character Unicode string), and having something that looks like a Java escape that doesn't *work* like the Java escape will be confusing as heck for JPython users. Keeping track of even-vs-odd number of backslashes can't be done with a regexp search, but is easy if the code is simple <wink>: def unicode_unescape(s): from string import atoi import array i, n = 0, len(s) result = array.array('H') # unsigned short, native order while i < n: ch = s[i] i = i+1 if ch != "\\": result.append(ord(ch)) continue if i == n: raise ValueError("string ends with lone backslash") ch = s[i] i = i+1 if ch != "u": result.append(ord("\\")) result.append(ord(ch)) continue hexchars = s[i:i+4] if len(hexchars) != 4: raise ValueError("\\u escape at end not followed by " "at least 4 characters") i = i+4 for ch in hexchars: if ch not in "01234567890abcdefABCDEF": raise ValueError("\\u" + hexchars + " contains " "non-hex characters") result.append(atoi(hexchars, 16)) # print result return result.tostring() From tim_one at email.msn.com Wed Nov 17 09:47:48 1999 From: tim_one at email.msn.com (Tim Peters) Date: Wed, 17 Nov 1999 03:47:48 -0500 Subject: [Python-Dev] just say no... In-Reply-To: <383156DF.2209053F@lemburg.com> Message-ID: <000401bf30d8$6cf30bc0$a42d153f@tim> [MAL] > FYI, the next version of the proposal ... > File objects opened in text mode will use "t#" and binary ones use "s#". Am I the only one who sees magical distinctions between text and binary mode as a Really Bad Idea? I wouldn't have guessed the Unix natives here would quietly acquiesce to importing a bit of Windows madness <wink>. From tim_one at email.msn.com Wed Nov 17 09:47:46 1999 From: tim_one at email.msn.com (Tim Peters) Date: Wed, 17 Nov 1999 03:47:46 -0500 Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: <383140F3.EDDB307A@lemburg.com> Message-ID: <000301bf30d8$6bbd4ae0$a42d153f@tim> [Jack Jansen] > I would suggest adding the Dos, Windows and Macintosh standard > 8-bit charsets (their equivalents of latin-1) too, as documents > in these encoding are pretty ubiquitous. But maybe these should > only be added on the respective platforms. [MAL] > Good idea. What code pages would that be ? I'm not clear on what's being suggested; e.g., Windows supports *many* different "code pages". CP 1252 is default in the U.S., and is an extension of Latin-1. See e.g. ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT which appears to be up-to-date (has 0x80 as the euro symbol, Unicode U+20AC -- although whether your version of U.S. Windows actually has this depends on whether you installed the service pack that added it!). See ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP850.TXT for the closest DOS got. From tim_one at email.msn.com Wed Nov 17 10:05:21 1999 From: tim_one at email.msn.com (Tim Peters) Date: Wed, 17 Nov 1999 04:05:21 -0500 Subject: Weak refs (was [Python-Dev] just say no...) In-Reply-To: <14385.33486.855802.187739@weyr.cnri.reston.va.us> Message-ID: <000601bf30da$e069d820$a42d153f@tim> [Fred L. Drake, Jr., pines for some flavor of weak refs; MAL reminds us of his work; & back to Fred] > Yes, but still not in the core. So we have two general examples > (vrefs and mxProxy) and there's WeakDict (or something like that). I > think there really needs to be a core facility for this. This kind of thing certainly belongs in the core (for efficiency and smooth integration) -- if it belongs in the language at all. This was discussed at length here some months ago; that's what prompted MAL to "do something" about it. Guido hasn't shown visible interest, and nobody has been willing to fight him to the death over it. So it languishes. Buy him lunch tomorrow and get him excited <wink>. From tim_one at email.msn.com Wed Nov 17 10:10:24 1999 From: tim_one at email.msn.com (Tim Peters) Date: Wed, 17 Nov 1999 04:10:24 -0500 Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...) In-Reply-To: <1269351119-9152905@hypernet.com> Message-ID: <000701bf30db$94d4ac40$a42d153f@tim> [Gordon McMillan] > ... > Yeah, it's faster. And I can put Python+Tcl/Tk+IDLE on a > diskette with a little room left over. That's truly remarkable (he says while waiting for the Inbox Repair Tool to finish repairing his 50Mb Outlook mail file ...)! > but-since-its-WIndows-it-must-be-tainted-ly y'rs Indeed -- if it runs on Windows, it's a worthless piece o' crap <wink>. From fredrik at pythonware.com Wed Nov 17 12:00:10 1999 From: fredrik at pythonware.com (Fredrik Lundh) Date: Wed, 17 Nov 1999 12:00:10 +0100 Subject: [Python-Dev] Codecs and StreamCodecs References: <38317FBA.4F3D6B1F@lemburg.com> <199911161620.LAA02643@eric.cnri.reston.va.us> <002e01bf3069$8477b440$f29b12c2@secret.pythonware.com> <3832757E.B9503606@lemburg.com> Message-ID: <004101bf30ea$eb3801e0$f29b12c2@secret.pythonware.com> M.-A. Lemburg <mal at lemburg.com> wrote: > > def flush(self): > > # flush the decoding buffers. this should usually > > # return None, unless the fact that knowing that the > > # input stream has ended means that the state can be > > # interpreted in a meaningful way. however, if the > > # state indicates that there last character was not > > # finished, this method should raise a UnicodeError > > # exception. > > Could you explain for reason for having a .flush() method > and what it should return. in most cases, it should either return None, or raise a UnicodeError exception: >>> u = unicode("? i ?a ? e ?", "iso-latin-1") >>> # yes, that's a valid Swedish sentence ;-) >>> s = u.encode("utf-8") >>> d = decoder("utf-8") >>> d.decode(s[:-1]) "? i ?a ? e " >>> d.flush() UnicodeError: last character not complete on the other hand, there are situations where it might actually return a string. consider a "HTML entity decoder" which uses the following pattern to match a character entity: "&\w+;?" (note that the trailing semicolon is optional). >>> u = unicode("? i ?a ? e ?", "iso-latin-1") >>> s = u.encode("html-entities") >>> d = decoder("html-entities") >>> d.decode(s[:-1]) "? i ?a ? e " >>> d.flush() "?" > Perhaps I'm missing something, but how would you define > stream codecs using this interface ? input: read chunks of data, decode, and keep extra data in a local buffer. output: encode data into suitable chunks, and write to the output stream (that's why there's a buffersize argument to encode -- if someone writes a 10mb unicode string to an encoded stream, python shouldn't allocate an extra 10-30 megabytes just to be able to encode the darn thing...) > > Implementing stream codecs is left as an exercise (see the zlib > > material in the eff-bot guide for a decoder example). everybody should have a copy of the eff-bot guide ;-) (but alright, I plan to post a complete utf-8 implementation in a not too distant future). </F> From gstein at lyra.org Wed Nov 17 11:57:36 1999 From: gstein at lyra.org (Greg Stein) Date: Wed, 17 Nov 1999 02:57:36 -0800 (PST) Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: <38327F39.AA381647@lemburg.com> Message-ID: <Pine.LNX.4.10.9911170255060.10639-100000@nebula.lyra.org> On Wed, 17 Nov 1999, M.-A. Lemburg wrote: >... > I'd suggest grouping encodings: > > [encodings] > [iso} > [iso88591] > [iso88592] > [jis] > ... > [cyrillic] > ... > [misc] WHY?!?! This is taking a simple solution and making it complicated. I see no benefit to the creating yet-another-level-of-hierarchy. Why should they be grouped? Leave the modules just under "encodings" and be done with it. Cheers, -g -- Greg Stein, http://www.lyra.org/ From gstein at lyra.org Wed Nov 17 12:14:01 1999 From: gstein at lyra.org (Greg Stein) Date: Wed, 17 Nov 1999 03:14:01 -0800 (PST) Subject: [Python-Dev] Codecs and StreamCodecs In-Reply-To: <38327B79.2415786B@lemburg.com> Message-ID: <Pine.LNX.4.10.9911170259150.10639-100000@nebula.lyra.org> On Wed, 17 Nov 1999, M.-A. Lemburg wrote: >... > Anyway, I think that factory functions are the way to go, > because they offer more flexibility w/r to reusing already > instantiated codecs, importing modules on-the-fly as was > suggested in another thread (thereby making codec module > import lazy) or mapping encoder and decoder requests all > to one class. Why a factory? I've got a simple encode() function. I don't need a factory. "flexibility" at the cost of complexity (IMO). > So here's a new registry approach: > > unicodec.register(encoding,factory_function,action) > > with > encoding - name of the supported encoding, e.g. Shift_JIS > factory_function - a function that returns an object > or function ready to be used for action > action - a string stating the supported action: > 'encode' > 'decode' > 'stream write' > 'stream read' This action thing is subject to error. *if* you're wanting to go this route, then have: unicodec.register_encode(...) unicodec.register_decode(...) unicodec.register_stream_write(...) unicodec.register_stream_read(...) They are equivalent. Guido has also told me in the past that he dislikes parameters that alter semantics -- preferring different functions instead. (this is why there are a good number of PyBufferObject interfaces; I had fewer to start with) This suggested approach is also quite a bit more wordy/annoying than Fred's alternative: unicode.register('iso-8859-1', encoder, decoder, None, None) And don't say "future compatibility allows us to add new actions." Well, those same future changes can add new registration functions or additional parameters to the single register() function. Not that I'm advocating it, but register() could also take a single parameter: if a class, then instantiate it and call methods for each action; if an instance, then just call methods for each action. [ and the third/original variety: a function object as the first param is the actual hook, and params 2 thru 4 (each are optional, or just the stream funcs?) are the other hook functions ] > The factory_function API depends on the implementation of > the codec. The returned object's interface on the value of action: > > Codecs: > ------- > > obj = factory_function_for_<action>(errors='strict') Where does this "errors" value come from? How does a user alter that value? Without an ability to change this, I see no reason for a factory. [ and no: don't tell me it is a thread-state value :-) ] On the other hand: presuming the "errors" thing is valid, *then* I see a need for a factory. Truly... I dislike factories. IMO, they just add code/complexity in many cases where the functionality isn't needed. But that's just me :-) Cheers, -g -- Greg Stein, http://www.lyra.org/ From captainrobbo at yahoo.com Wed Nov 17 12:17:00 1999 From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=) Date: Wed, 17 Nov 1999 03:17:00 -0800 (PST) Subject: [Python-Dev] Rosette i18n API Message-ID: <19991117111700.8831.rocketmail@web603.mail.yahoo.com> There is a very capable C++ library at http://rosette.basistech.com/ It is well worth looking at the things this API actually lets you do for ideas on patterns. - Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From gstein at lyra.org Wed Nov 17 12:21:18 1999 From: gstein at lyra.org (Greg Stein) Date: Wed, 17 Nov 1999 03:21:18 -0800 (PST) Subject: [Python-Dev] just say no... In-Reply-To: <000401bf30d8$6cf30bc0$a42d153f@tim> Message-ID: <Pine.LNX.4.10.9911170316380.10639-100000@nebula.lyra.org> On Wed, 17 Nov 1999, Tim Peters wrote: > [MAL] > > FYI, the next version of the proposal ... > > File objects opened in text mode will use "t#" and binary ones use "s#". > > Am I the only one who sees magical distinctions between text and binary mode > as a Really Bad Idea? I wouldn't have guessed the Unix natives here would > quietly acquiesce to importing a bit of Windows madness <wink>. It's a seductive idea... yes, it feels wrong, but then... it seems kind of right, too... :-) Yes. It is a mode. Is it bad? Not sure. You've already told the system that you want to treat the file differently. Much like you're treating it differently when you specify 'r' vs. 'w'. The real annoying thing would be to assume that opening a file as 'r' means that I *meant* text mode and to start using "t#". In actuality, I typically open files that way since I do most of my coding on Linux. If I now have to pay attention to things and open it as 'rb', then I'll be pissed. And the change in behavior and bugs that interpreting 'r' as text would introduce? Ack! Cheers, -g -- Greg Stein, http://www.lyra.org/ From fredrik at pythonware.com Wed Nov 17 12:36:32 1999 From: fredrik at pythonware.com (Fredrik Lundh) Date: Wed, 17 Nov 1999 12:36:32 +0100 Subject: [Python-Dev] Codecs and StreamCodecs References: <Pine.LNX.4.10.9911170259150.10639-100000@nebula.lyra.org> Message-ID: <001b01bf30ef$ffb08a20$f29b12c2@secret.pythonware.com> Greg Stein <gstein at lyra.org> wrote: > Why a factory? I've got a simple encode() function. I don't need a > factory. "flexibility" at the cost of complexity (IMO). so where do you put the state? how do you reset the state between strings? how do you handle incremental decoding/encoding? etc. (I suggest taking another look at PIL's codec design. it solves all these problems with a minimum of code, and it works -- people have been hammering on PIL for years...) </F> From gstein at lyra.org Wed Nov 17 12:34:30 1999 From: gstein at lyra.org (Greg Stein) Date: Wed, 17 Nov 1999 03:34:30 -0800 (PST) Subject: [Python-Dev] Codecs and StreamCodecs In-Reply-To: <001b01bf30ef$ffb08a20$f29b12c2@secret.pythonware.com> Message-ID: <Pine.LNX.4.10.9911170331560.10639-100000@nebula.lyra.org> On Wed, 17 Nov 1999, Fredrik Lundh wrote: > Greg Stein <gstein at lyra.org> wrote: > > Why a factory? I've got a simple encode() function. I don't need a > > factory. "flexibility" at the cost of complexity (IMO). > > so where do you put the state? encode() is not supposed to retain state. It is supposed to do a complete translation. It is not a stream thingy, which may have received partial characters. > how do you reset the state between > strings? There is none :-) > how do you handle incremental > decoding/encoding? Streams. -g -- Greg Stein, http://www.lyra.org/ From fredrik at pythonware.com Wed Nov 17 12:46:01 1999 From: fredrik at pythonware.com (Fredrik Lundh) Date: Wed, 17 Nov 1999 12:46:01 +0100 Subject: [Python-Dev] Python 1.6 status References: <199911161700.MAA02716@eric.cnri.reston.va.us> Message-ID: <004c01bf30f1$537102b0$f29b12c2@secret.pythonware.com> Guido van Rossum <guido at CNRI.Reston.VA.US> wrote: > - suggestions for new issues that maybe ought to be settled in 1.6 three things: imputil, imputil, imputil </F> From fredrik at pythonware.com Wed Nov 17 12:51:33 1999 From: fredrik at pythonware.com (Fredrik Lundh) Date: Wed, 17 Nov 1999 12:51:33 +0100 Subject: [Python-Dev] Codecs and StreamCodecs References: <Pine.LNX.4.10.9911170331560.10639-100000@nebula.lyra.org> Message-ID: <006201bf30f2$194626f0$f29b12c2@secret.pythonware.com> Greg Stein <gstein at lyra.org> wrote: > > so where do you put the state? > > encode() is not supposed to retain state. It is supposed to do a complete > translation. It is not a stream thingy, which may have received partial > characters. > > > how do you handle incremental > > decoding/encoding? > > Streams. hmm. why have two different mechanisms when you can do the same thing with one? </F> From gstein at lyra.org Wed Nov 17 14:01:47 1999 From: gstein at lyra.org (Greg Stein) Date: Wed, 17 Nov 1999 05:01:47 -0800 (PST) Subject: [Python-Dev] Apache process (was: Python 1.6 status) In-Reply-To: <199911161700.MAA02716@eric.cnri.reston.va.us> Message-ID: <Pine.LNX.4.10.9911170441360.10639-100000@nebula.lyra.org> On Tue, 16 Nov 1999, Guido van Rossum wrote: >... > Greg, I understand you have checkin privileges for Apache. What is > the procedure there for handing out those privileges? What is the > procedure for using them? (E.g. if you made a bogus change to part of > Apache you're not supposed to work on, what happens?) Somebody proposes that a person is added to the list of people with checkin privileges. If nobody else in the group vetoes that, then they're in (their system doesn't require continual participation by each member, so it can only operate at a veto level, rather than a unanimous assent). It is basically determined on the basis of merit -- has the person been active (on the Apache developer's mailing list) and has the person contributed something significant? Further, by providing commit access, will they further the goals of Apache? And, of course, does their temperament seem to fit in with the other group members? I can make any change that I'd like. However, there are about 20 other people who can easily revert or alter my changes if they're bogus. There are no programmatic restrictions.... You could say it is based on mutual respect and a social contract of behavior. Large changes should be discussed before committing to CVS. Bug fixes, doc enhancements, minor functional improvements, etc, all follow a commit-then-review process. I just check the thing in. Others see the diff (emailed to the checkins mailing list (this is different from Python-checkins which only says what files are changed, rather than providing the diff)) and can comment on the change, make their own changes, etc. To be concrete: I added the Expat code that now appears in Apache 1.3.9. Before doing so, I queried the group. There were some issues that I dealt with before finally commiting Expat to the CVS repository. On another occasion, I added a new API to Apache; again, I proposed it first, got an "all OK" and committed it. I've done a couple bug fixes which I just checked in. [ "all OK" means three +1 votes and no vetoes. everybody has veto ability (but the responsibility to explain why and to remove their veto when their concerns are addressed). ] On many occasions, I've reviewed the diffs that were posted to the checkins list, and made comments back to the author. I've caught a few problems this way. For Apache 2.0, even large changes are commit-then-review at this point. At some point, it will switch over to review-then-commit and the project will start moving towards stabilization/release. (bug fixes and stuff will always remain commit-then-review) I'll note that the process works very well given that diffs are emailed. I doubt that it would be effective if people had to fetch CVS diffs themselves. Your note also implies "areas of ownership". This doesn't really exist within Apache. There aren't even "primary authors" or things like that. I have the ability/rights to change any portions: from the low-level networking, to the documentation, to the server-side include processing. Of coures, if I'm going to make a big change, then I'll be posting a patch for review first, and whoever has worked in that area in the past may/will/should comment. Cheers, -g -- Greg Stein, http://www.lyra.org/ From guido at CNRI.Reston.VA.US Wed Nov 17 14:32:05 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Wed, 17 Nov 1999 08:32:05 -0500 Subject: [Python-Dev] Python 1.6 status In-Reply-To: Your message of "Wed, 17 Nov 1999 04:31:27 EST." <000801bf30de$85bea500$a42d153f@tim> References: <000801bf30de$85bea500$a42d153f@tim> Message-ID: <199911171332.IAA03266@kaluha.cnri.reston.va.us> > I'm specifically requesting not to have checkin privileges. So there. I will force nobody to use checkin privileges. However I see that for some contributors, checkin privileges will save me and them time. > I see two problems: > > 1. When patches go thru you, you at least eyeball them. This catches bugs > and design errors early. I will still eyeball them -- only after the fact. Since checkins are pretty public, being slapped on the wrist for a bad checkin is a pretty big embarrassment, so few contributors will check in buggy code more than once. Moreover, there will be more eyeballs. > 2. For a multi-platform app, few people have adequate resources for testing; > e.g., I can test under an obsolete version of Win95, and NT if I have to, > but that's it. You may not actually do better testing than that, but having > patches go thru you allows me the comfort of believing you do <wink>. I expect that the same mechanisms will apply. I have access to Solaris, Linux and Windows (NT + 98) but it's actually a lot easier to check portability after things have been checked in. And again, there will be more testers. --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at CNRI.Reston.VA.US Wed Nov 17 14:34:23 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Wed, 17 Nov 1999 08:34:23 -0500 Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: Your message of "Tue, 16 Nov 1999 23:53:53 PST." <19991117075353.16046.rocketmail@web606.mail.yahoo.com> References: <19991117075353.16046.rocketmail@web606.mail.yahoo.com> Message-ID: <199911171334.IAA03374@kaluha.cnri.reston.va.us> > This is the simplest if each codec really is likely to > be implemented in a separate module. But just look at > the data! All the iso-8859 encodings need identical > functionality, and just have a different mapping table > with 256 elements. It would be trivial to implement > these in one module. And the wide variety of Japanese > encodings (mostly corporate or historical variants of > the same character set) are again best treated from > one code base with a bunch of mapping tables and > routines to generate the variants - basically one can > store the deltas. > > So the choice is between possibly having a lot of > almost-dummy modules, or having Python modules which > generate and register a logical family of encodings. > > I may have some time next week and will try to code up > a few so we can pound on something. I see no problem with having a lot of near-dummy modules if it simplifies the architecture. You can still do code sharing. Files are cheap; APIs are expensive. --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at CNRI.Reston.VA.US Wed Nov 17 14:38:35 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Wed, 17 Nov 1999 08:38:35 -0500 Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: Your message of "Wed, 17 Nov 1999 02:57:36 PST." <Pine.LNX.4.10.9911170255060.10639-100000@nebula.lyra.org> References: <Pine.LNX.4.10.9911170255060.10639-100000@nebula.lyra.org> Message-ID: <199911171338.IAA03511@kaluha.cnri.reston.va.us> > This is taking a simple solution and making it complicated. I see no > benefit to the creating yet-another-level-of-hierarchy. Why should they be > grouped? > > Leave the modules just under "encodings" and be done with it. Agreed. Tim Peters once remarked that Python likes shallow encodings (or perhaps that *I* like them :-). This is one such case where I would strongly urge for the simplicity of a shallow hierarchy. --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at CNRI.Reston.VA.US Wed Nov 17 14:43:44 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Wed, 17 Nov 1999 08:43:44 -0500 Subject: [Python-Dev] Codecs and StreamCodecs In-Reply-To: Your message of "Wed, 17 Nov 1999 03:14:01 PST." <Pine.LNX.4.10.9911170259150.10639-100000@nebula.lyra.org> References: <Pine.LNX.4.10.9911170259150.10639-100000@nebula.lyra.org> Message-ID: <199911171343.IAA03636@kaluha.cnri.reston.va.us> > Why a factory? I've got a simple encode() function. I don't need a > factory. "flexibility" at the cost of complexity (IMO). Unless there are certain cases where factories are useful. But let's read on... > > action - a string stating the supported action: > > 'encode' > > 'decode' > > 'stream write' > > 'stream read' > > This action thing is subject to error. *if* you're wanting to go this > route, then have: > > unicodec.register_encode(...) > unicodec.register_decode(...) > unicodec.register_stream_write(...) > unicodec.register_stream_read(...) > > They are equivalent. Guido has also told me in the past that he dislikes > parameters that alter semantics -- preferring different functions instead. Yes, indeed! (But weren't we going to do away with the whole registry idea in favor of an encodings package?) > Not that I'm advocating it, but register() could also take a single > parameter: if a class, then instantiate it and call methods for each > action; if an instance, then just call methods for each action. Nah, that's bad -- a class is just a factory, and once you are allowing classes it's really good to also allowing factory functions. > [ and the third/original variety: a function object as the first param is > the actual hook, and params 2 thru 4 (each are optional, or just the > stream funcs?) are the other hook functions ] Fine too. They should all be optional. > > obj = factory_function_for_<action>(errors='strict') > > Where does this "errors" value come from? How does a user alter that > value? Without an ability to change this, I see no reason for a factory. > [ and no: don't tell me it is a thread-state value :-) ] > > On the other hand: presuming the "errors" thing is valid, *then* I see a > need for a factory. The idea is that various places that take an encoding name can also take a codec instance. So the user can call the factory function / class constructor. > Truly... I dislike factories. IMO, they just add code/complexity in many > cases where the functionality isn't needed. But that's just me :-) Get over it... In a sense, every Python class is a factory for its own instances! I think you must be confusing Python with Java or C++. :-) --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at CNRI.Reston.VA.US Wed Nov 17 14:56:56 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Wed, 17 Nov 1999 08:56:56 -0500 Subject: [Python-Dev] Apache process (was: Python 1.6 status) In-Reply-To: Your message of "Wed, 17 Nov 1999 05:01:47 PST." <Pine.LNX.4.10.9911170441360.10639-100000@nebula.lyra.org> References: <Pine.LNX.4.10.9911170441360.10639-100000@nebula.lyra.org> Message-ID: <199911171356.IAA04005@kaluha.cnri.reston.va.us> > Somebody proposes that a person is added to the list of people with > checkin privileges. If nobody else in the group vetoes that, then they're > in (their system doesn't require continual participation by each member, > so it can only operate at a veto level, rather than a unanimous assent). > It is basically determined on the basis of merit -- has the person been > active (on the Apache developer's mailing list) and has the person > contributed something significant? Further, by providing commit access, > will they further the goals of Apache? And, of course, does their > temperament seem to fit in with the other group members? This makes sense, but I have one concern: if somebody who isn't liked very much (say a capable hacker who is a real troublemaker) asks for privileges, would people veto this? I'd be reluctant to go on record as veto'ing a particular person. (E.g. there are a few troublemakers in c.l.py, and I would never want them to join python-dev let alone give them commit privileges, but I'm not sure if I would want to discuss this on a publicly archived mailing list -- or even on a privately archived mailing list, given that the number of members might be in the hundreds. [...stuff I like...] > I'll note that the process works very well given that diffs are emailed. I > doubt that it would be effective if people had to fetch CVS diffs > themselves. That's a great idea; I'll see if we can do that to our checkin email, regardless of whether we hand out commit privileges. > Your note also implies "areas of ownership". This doesn't really exist > within Apache. There aren't even "primary authors" or things like that. I > have the ability/rights to change any portions: from the low-level > networking, to the documentation, to the server-side include processing. But that's Apache, which is explicitly run as a collective. In Python, I definitely want to have ownership of certain sections of the code. But I agree that this doesn't need to be formalized by access control lists; the social process you describe sounds like it will work just fine. --Guido van Rossum (home page: http://www.python.org/~guido/) From fdrake at acm.org Wed Nov 17 15:44:25 1999 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Wed, 17 Nov 1999 09:44:25 -0500 (EST) Subject: Weak refs (was [Python-Dev] just say no...) In-Reply-To: <000601bf30da$e069d820$a42d153f@tim> References: <14385.33486.855802.187739@weyr.cnri.reston.va.us> <000601bf30da$e069d820$a42d153f@tim> Message-ID: <14386.48969.630893.119344@weyr.cnri.reston.va.us> Tim Peters writes: > about it. Guido hasn't shown visible interest, and nobody has been willing > to fight him to the death over it. So it languishes. Buy him lunch > tomorrow and get him excited <wink>. Guido has asked me to pursue this topic, so I'll be checking out available implementations and seeing if any are adoptable or if something different is needed to be fully general and well-integrated. -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> Corporation for National Research Initiatives From tim_one at email.msn.com Thu Nov 18 04:21:16 1999 From: tim_one at email.msn.com (Tim Peters) Date: Wed, 17 Nov 1999 22:21:16 -0500 Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit) In-Reply-To: <38327D8F.7A5352E6@lemburg.com> Message-ID: <000101bf3173$f9805340$c0a0143f@tim> [MAL] > Guido and I have decided to turn \uXXXX into a standard > escape sequence with no further magic applied. \uXXXX will > only be expanded in u"" strings. Does that exclude ur"" strings? Not arguing either way, just don't know what all this means. > Here's the new scheme: > > With the 'unicode-escape' encoding being defined as: > > ? all non-escape characters represent themselves as a Unicode ordinal > (e.g. 'a' -> U+0061). Same as before (scream if that's wrong). > ? all existing defined Python escape sequences are interpreted as > Unicode ordinals; Same as before (ditto). > note that \xXXXX can represent all Unicode ordinals, This means that the definition of \xXXXX has changed, then -- as you pointed out just yesterday <wink>, \xABCDq currently acts like \xCDq. Does the new \x definition apply only in u"" strings, or in "" strings too? What is the new \x definition? > and \OOO (octal) can represent Unicode ordinals up to U+01FF. Same as before (ditto). > ? a new escape sequence, \uXXXX, represents U+XXXX; it is a syntax > error to have fewer than 4 digits after \u. Same as before (ditto). IOW, I don't see anything that's changed other than an unspecified new treatment of \x escapes, and possibly that ur"" strings don't expand \u escapes. > Examples: > > u'abc' -> U+0061 U+0062 U+0063 > u'\u1234' -> U+1234 > u'abc\u1234\n' -> U+0061 U+0062 U+0063 U+1234 U+05c The last example is damaged (U+05c isn't legit). Other than that, these look the same as before. > Now how should we define ur"abc\u1234\n" ... ? If strings carried an encoding tag with them, the obvious answer is that this acts exactly like r"abc\u1234\n" acts today except gets a "unicode-escaped" encoding tag instead of a "[whatever the default is today]" encoding tag. If strings don't carry an encoding tag with them, you're in a bit of a pickle: you'll have to convert it to a regular string or a Unicode string, but in either case have no way to communicate that it may need further processing; i.e., no way to distinguish it from a regular or Unicode string produced by any other mechanism. The code I posted yesterday remains my best answer to that unpleasant puzzle (i.e., produce a Unicode string, fiddling with backslashes just enough to get the \u escapes expanded, in the same way Java's (conceptual) preprocessor does it). From tim_one at email.msn.com Thu Nov 18 04:21:19 1999 From: tim_one at email.msn.com (Tim Peters) Date: Wed, 17 Nov 1999 22:21:19 -0500 Subject: [Python-Dev] just say no... In-Reply-To: <Pine.LNX.4.10.9911170316380.10639-100000@nebula.lyra.org> Message-ID: <000201bf3173$fb7f7ea0$c0a0143f@tim> [MAL] > File objects opened in text mode will use "t#" and binary > ones use "s#". [Greg Stein] > ... > The real annoying thing would be to assume that opening a file as 'r' > means that I *meant* text mode and to start using "t#". Isn't that exactly what MAL said would happen? Note that a "t" flag for "text mode" is an MS extension -- C doesn't define "t", and Python doesn't either; a lone "r" has always meant text mode. > In actuality, I typically open files that way since I do most of my > coding on Linux. If I now have to pay attention to things and open it > as 'rb', then I'll be pissed. > > And the change in behavior and bugs that interpreting 'r' as text would > introduce? Ack! 'r' is already intepreted as text mode, but so far, on Unix-like systems, there's been no difference between text and binary modes. Introducing a distinction will certainly cause problems. I don't know what the compensating advantages are thought to be. From tim_one at email.msn.com Thu Nov 18 04:23:00 1999 From: tim_one at email.msn.com (Tim Peters) Date: Wed, 17 Nov 1999 22:23:00 -0500 Subject: [Python-Dev] Python 1.6 status In-Reply-To: <199911171332.IAA03266@kaluha.cnri.reston.va.us> Message-ID: <000301bf3174$37b465c0$c0a0143f@tim> [Guido] > I will force nobody to use checkin privileges. That almost went without saying <wink>. > However I see that for some contributors, checkin privileges will > save me and them time. Then it's Good! Provided it doesn't hurt language stability. I agree that changing the system to mail out diffs addresses what I was worried about there. From tim_one at email.msn.com Thu Nov 18 04:31:38 1999 From: tim_one at email.msn.com (Tim Peters) Date: Wed, 17 Nov 1999 22:31:38 -0500 Subject: [Python-Dev] Apache process (was: Python 1.6 status) In-Reply-To: <199911171356.IAA04005@kaluha.cnri.reston.va.us> Message-ID: <000401bf3175$6c089660$c0a0143f@tim> [Greg] > ... > Somebody proposes that a person is added to the list of people with > checkin privileges. If nobody else in the group vetoes that, then ? they're in ... [Guido] > This makes sense, but I have one concern: if somebody who isn't liked > very much (say a capable hacker who is a real troublemaker) asks for > privileges, would people veto this? It seems that a key point in Greg's description is that people don't propose *themselves* for checkin. They have to talk someone else into proposing them. That should keep Endang out of the running for a few years <wink>. After that, I care more about their code than their personalities. If the stuff they check in is good, fine; if it's not, lock 'em out for direct cause. > I'd be reluctant to go on record as veto'ing a particular person. Secret Ballot run off a web page -- although not so secret you can't see who voted for what <wink>. From tim_one at email.msn.com Thu Nov 18 04:37:18 1999 From: tim_one at email.msn.com (Tim Peters) Date: Wed, 17 Nov 1999 22:37:18 -0500 Subject: Weak refs (was [Python-Dev] just say no...) In-Reply-To: <14386.48969.630893.119344@weyr.cnri.reston.va.us> Message-ID: <000501bf3176$36a5ca00$c0a0143f@tim> [Fred L. Drake, Jr.] > Guido has asked me to pursue this topic [weak refs], so I'll be > checking out available implementations and seeing if any are > adoptable or if something different is needed to be fully general > and well-integrated. Just don't let "fully general" stop anything for its sake alone; e.g., if there's a slick trick that *could* exempt numbers, that's all to the good! Adding a pointer to every object is really unattractive, while adding a flag or two to type objects is dirt cheap. Note in passing that current Java addresses weak refs too (several flavors of 'em! -- very elaborate). From gstein at lyra.org Thu Nov 18 09:09:24 1999 From: gstein at lyra.org (Greg Stein) Date: Thu, 18 Nov 1999 00:09:24 -0800 (PST) Subject: [Python-Dev] just say no... In-Reply-To: <000201bf3173$fb7f7ea0$c0a0143f@tim> Message-ID: <Pine.LNX.4.10.9911180008020.10639-100000@nebula.lyra.org> On Wed, 17 Nov 1999, Tim Peters wrote: >... > 'r' is already intepreted as text mode, but so far, on Unix-like systems, > there's been no difference between text and binary modes. Introducing a > distinction will certainly cause problems. I don't know what the > compensating advantages are thought to be. Wow. "compensating advantages" ... Excellent "power phrase" there. hehe... -g -- Greg Stein, http://www.lyra.org/ From mal at lemburg.com Thu Nov 18 09:15:04 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Thu, 18 Nov 1999 09:15:04 +0100 Subject: [Python-Dev] just say no... References: <000201bf3173$fb7f7ea0$c0a0143f@tim> Message-ID: <3833B588.1E31F01B@lemburg.com> Tim Peters wrote: > > [MAL] > > File objects opened in text mode will use "t#" and binary > > ones use "s#". > > [Greg Stein] > > ... > > The real annoying thing would be to assume that opening a file as 'r' > > means that I *meant* text mode and to start using "t#". > > Isn't that exactly what MAL said would happen? Note that a "t" flag for > "text mode" is an MS extension -- C doesn't define "t", and Python doesn't > either; a lone "r" has always meant text mode. Em, I think you've got something wrong here: "t#" refers to the parsing marker used for writing data to files opened in text mode. Until now, all files used the "s#" parsing marker for writing data, regardeless of being opened in text or binary mode. The new interpretation (new, because there previously was none ;-) of the buffer interface forces this to be changed to regain conformance. > > In actuality, I typically open files that way since I do most of my > > coding on Linux. If I now have to pay attention to things and open it > > as 'rb', then I'll be pissed. > > > > And the change in behavior and bugs that interpreting 'r' as text would > > introduce? Ack! > > 'r' is already intepreted as text mode, but so far, on Unix-like systems, > there's been no difference between text and binary modes. Introducing a > distinction will certainly cause problems. I don't know what the > compensating advantages are thought to be. I guess you won't notice any difference: strings define both interfaces ("s#" and "t#") to mean the same thing. Only other buffer compatible types may now fail to write to text files -- which is not so bad, because it forces the programmer to rethink what he really intended when opening the file in text mode. Besides, if you are writing portable scripts you should pay close attention to "r" vs. "rb" anyway. [Strange, I find myself argueing for a feature that I don't like myself ;-)] -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Thu Nov 18 09:59:21 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Thu, 18 Nov 1999 09:59:21 +0100 Subject: [Python-Dev] Python 1.6 status References: <199911161700.MAA02716@eric.cnri.reston.va.us> <004c01bf30f1$537102b0$f29b12c2@secret.pythonware.com> Message-ID: <3833BFE9.6FD118B1@lemburg.com> Fredrik Lundh wrote: > > Guido van Rossum <guido at CNRI.Reston.VA.US> wrote: > > - suggestions for new issues that maybe ought to be settled in 1.6 > > three things: imputil, imputil, imputil But please don't add the current version as default importer... its strategy is way too slow for real life apps (yes, I've tested this: imports typically take twice as long as with the builtin importer). I'd opt for an import manager which provides a useful API for import hooks to register themselves with. What we really need is not yet another complete reimplementation of what the builtin importer does, but rather a more detailed exposure of the various import aspects: finding modules and loading modules. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Thu Nov 18 09:50:36 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Thu, 18 Nov 1999 09:50:36 +0100 Subject: [Python-Dev] Codecs and StreamCodecs References: <38317FBA.4F3D6B1F@lemburg.com> <199911161620.LAA02643@eric.cnri.reston.va.us> <002e01bf3069$8477b440$f29b12c2@secret.pythonware.com> <3832757E.B9503606@lemburg.com> <004101bf30ea$eb3801e0$f29b12c2@secret.pythonware.com> Message-ID: <3833BDDC.7CD2CC1F@lemburg.com> Fredrik Lundh wrote: > > M.-A. Lemburg <mal at lemburg.com> wrote: > > > def flush(self): > > > # flush the decoding buffers. this should usually > > > # return None, unless the fact that knowing that the > > > # input stream has ended means that the state can be > > > # interpreted in a meaningful way. however, if the > > > # state indicates that there last character was not > > > # finished, this method should raise a UnicodeError > > > # exception. > > > > Could you explain for reason for having a .flush() method > > and what it should return. > > in most cases, it should either return None, or > raise a UnicodeError exception: > > >>> u = unicode("? i ?a ? e ?", "iso-latin-1") > >>> # yes, that's a valid Swedish sentence ;-) > >>> s = u.encode("utf-8") > >>> d = decoder("utf-8") > >>> d.decode(s[:-1]) > "? i ?a ? e " > >>> d.flush() > UnicodeError: last character not complete > > on the other hand, there are situations where it > might actually return a string. consider a "HTML > entity decoder" which uses the following pattern > to match a character entity: "&\w+;?" (note that > the trailing semicolon is optional). > > >>> u = unicode("? i ?a ? e ?", "iso-latin-1") > >>> s = u.encode("html-entities") > >>> d = decoder("html-entities") > >>> d.decode(s[:-1]) > "? i ?a ? e " > >>> d.flush() > "?" Ah, ok. So the .flush() method checks for proper string endings and then either returns the remaining input or raises an error. > > Perhaps I'm missing something, but how would you define > > stream codecs using this interface ? > > input: read chunks of data, decode, and > keep extra data in a local buffer. > > output: encode data into suitable chunks, > and write to the output stream (that's why > there's a buffersize argument to encode -- > if someone writes a 10mb unicode string to > an encoded stream, python shouldn't allocate > an extra 10-30 megabytes just to be able to > encode the darn thing...) So the stream codecs would be wrappers around the string codecs. Have you read my latest version of the Codec interface ? Wouldn't that be a reasonable approach ? Note that I have integrated your ideas into the new API -- it's basically only missing the .flush() methods, which I can add now that I know what you meant. > > > Implementing stream codecs is left as an exercise (see the zlib > > > material in the eff-bot guide for a decoder example). > > everybody should have a copy of the eff-bot guide ;-) Sure, but the format, the format... make it printed and add a CD and you would probably have a good selling book there ;-) > (but alright, I plan to post a complete utf-8 implementation > in a not too distant future). -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Thu Nov 18 09:16:48 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Thu, 18 Nov 1999 09:16:48 +0100 Subject: [Python-Dev] Some thoughts on the codecs... References: <Pine.LNX.4.10.9911170255060.10639-100000@nebula.lyra.org> Message-ID: <3833B5F0.FA4620AD@lemburg.com> Greg Stein wrote: > > On Wed, 17 Nov 1999, M.-A. Lemburg wrote: > >... > > I'd suggest grouping encodings: > > > > [encodings] > > [iso} > > [iso88591] > > [iso88592] > > [jis] > > ... > > [cyrillic] > > ... > > [misc] > > WHY?!?! > > This is taking a simple solution and making it complicated. I see no > benefit to the creating yet-another-level-of-hierarchy. Why should they be > grouped? > > Leave the modules just under "encodings" and be done with it. Nevermind, was just an idea... -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Thu Nov 18 09:43:31 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Thu, 18 Nov 1999 09:43:31 +0100 Subject: [Python-Dev] Codecs and StreamCodecs References: <Pine.LNX.4.10.9911170259150.10639-100000@nebula.lyra.org> <199911171343.IAA03636@kaluha.cnri.reston.va.us> Message-ID: <3833BC33.66E134F@lemburg.com> Guido van Rossum wrote: > > > Why a factory? I've got a simple encode() function. I don't need a > > factory. "flexibility" at the cost of complexity (IMO). > > Unless there are certain cases where factories are useful. But let's > read on... > > > > action - a string stating the supported action: > > > 'encode' > > > 'decode' > > > 'stream write' > > > 'stream read' > > > > This action thing is subject to error. *if* you're wanting to go this > > route, then have: > > > > unicodec.register_encode(...) > > unicodec.register_decode(...) > > unicodec.register_stream_write(...) > > unicodec.register_stream_read(...) > > > > They are equivalent. Guido has also told me in the past that he dislikes > > parameters that alter semantics -- preferring different functions instead. > > Yes, indeed! Ok. > (But weren't we going to do away with the whole registry > idea in favor of an encodings package?) One way or another, the Unicode implementation will have to access a dictionary containing references to the codecs for a particular encoding. You won't get around registering these at some point... be it in a lazy way, on-the-fly or by some other means. What we could do is implement the lookup like this: 1. call encodings.lookup_<action>(encoding) and use the return value for the conversion 2. if all fails, cop out with an error Step 1. would do all the import magic and then register the found codecs in some dictionary for faster access (perhaps this could be done in a way that is directly available to the Unicode implementation, e.g. in a global internal dictionary -- the one I originally had in mind for the unicodec registry). > > Not that I'm advocating it, but register() could also take a single > > parameter: if a class, then instantiate it and call methods for each > > action; if an instance, then just call methods for each action. > > Nah, that's bad -- a class is just a factory, and once you are > allowing classes it's really good to also allowing factory functions. > > > [ and the third/original variety: a function object as the first param is > > the actual hook, and params 2 thru 4 (each are optional, or just the > > stream funcs?) are the other hook functions ] > > Fine too. They should all be optional. Ok. > > > obj = factory_function_for_<action>(errors='strict') > > > > Where does this "errors" value come from? How does a user alter that > > value? Without an ability to change this, I see no reason for a factory. > > [ and no: don't tell me it is a thread-state value :-) ] > > > > On the other hand: presuming the "errors" thing is valid, *then* I see a > > need for a factory. > > The idea is that various places that take an encoding name can also > take a codec instance. So the user can call the factory function / > class constructor. Right. The argument is reachable via: Codec = encodings.lookup_encode('utf-8') codec = Codec(errors='?') s = codec(u"abc????") s would then equal 'abc??'. -- Should I go ahead then and change the registry business to the new strategy (via the encodings package in the above sense) ? -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mhammond at skippinet.com.au Thu Nov 18 11:57:44 1999 From: mhammond at skippinet.com.au (Mark Hammond) Date: Thu, 18 Nov 1999 21:57:44 +1100 Subject: [Python-Dev] Codecs and StreamCodecs In-Reply-To: <3833BC33.66E134F@lemburg.com> Message-ID: <002401bf31b3$bf16c230$0501a8c0@bobcat> [Guido] > > (But weren't we going to do away with the whole registry > > idea in favor of an encodings package?) > [MAL] > One way or another, the Unicode implementation will have to > access a dictionary containing references to the codecs for > a particular encoding. You won't get around registering these > at some point... be it in a lazy way, on-the-fly or by some > other means. What is wrong with my idea of using well-known-names from the encoding module? The dict then is "encodings.<encoding-name>.__dict__". All encodings "just work" because the leverage from the Python module system. Unless Im missing something, there is no need for any extra registry at all. I guess it would actually resolve to 2 dict lookups, but thats OK surely? Mark. From mal at lemburg.com Thu Nov 18 10:39:30 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Thu, 18 Nov 1999 10:39:30 +0100 Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit) References: <000101bf3173$f9805340$c0a0143f@tim> Message-ID: <3833C952.C6F154B1@lemburg.com> Tim Peters wrote: > > [MAL] > > Guido and I have decided to turn \uXXXX into a standard > > escape sequence with no further magic applied. \uXXXX will > > only be expanded in u"" strings. > > Does that exclude ur"" strings? Not arguing either way, just don't know > what all this means. > > > Here's the new scheme: > > > > With the 'unicode-escape' encoding being defined as: > > > > ? all non-escape characters represent themselves as a Unicode ordinal > > (e.g. 'a' -> U+0061). > > Same as before (scream if that's wrong). > > > ? all existing defined Python escape sequences are interpreted as > > Unicode ordinals; > > Same as before (ditto). > > > note that \xXXXX can represent all Unicode ordinals, > > This means that the definition of \xXXXX has changed, then -- as you pointed > out just yesterday <wink>, \xABCDq currently acts like \xCDq. Does the new > \x definition apply only in u"" strings, or in "" strings too? What is the > new \x definition? Guido decided to make \xYYXX return U+YYXX *only* within u"" strings. In "" (Python strings) the same sequence will result in chr(0xXX). > > and \OOO (octal) can represent Unicode ordinals up to U+01FF. > > Same as before (ditto). > > > ? a new escape sequence, \uXXXX, represents U+XXXX; it is a syntax > > error to have fewer than 4 digits after \u. > > Same as before (ditto). > > IOW, I don't see anything that's changed other than an unspecified new > treatment of \x escapes, and possibly that ur"" strings don't expand \u > escapes. The difference is that we no longer take the two step approach. \uXXXX is treated at the same time all other escape sequences are decoded (the previous version first scanned and decoded all standard Python sequences and then turned to the \uXXXX sequences in a second scan). > > Examples: > > > > u'abc' -> U+0061 U+0062 U+0063 > > u'\u1234' -> U+1234 > > u'abc\u1234\n' -> U+0061 U+0062 U+0063 U+1234 U+05c > > The last example is damaged (U+05c isn't legit). Other than that, these > look the same as before. Corrected; thanks. > > Now how should we define ur"abc\u1234\n" ... ? > > If strings carried an encoding tag with them, the obvious answer is that > this acts exactly like r"abc\u1234\n" acts today except gets a > "unicode-escaped" encoding tag instead of a "[whatever the default is > today]" encoding tag. > > If strings don't carry an encoding tag with them, you're in a bit of a > pickle: you'll have to convert it to a regular string or a Unicode string, > but in either case have no way to communicate that it may need further > processing; i.e., no way to distinguish it from a regular or Unicode string > produced by any other mechanism. The code I posted yesterday remains my > best answer to that unpleasant puzzle (i.e., produce a Unicode string, > fiddling with backslashes just enough to get the \u escapes expanded, in the > same way Java's (conceptual) preprocessor does it). They don't have such tags... so I guess we're in trouble ;-) I guess to make ur"" have a meaning at all, we'd need to go the Java preprocessor way here, i.e. scan the string *only* for \uXXXX sequences, decode these and convert the rest as-is to Unicode ordinals. Would that be ok ? -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Thu Nov 18 12:41:32 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Thu, 18 Nov 1999 12:41:32 +0100 Subject: [Python-Dev] Codecs and StreamCodecs References: <002401bf31b3$bf16c230$0501a8c0@bobcat> Message-ID: <3833E5EC.AAFE5016@lemburg.com> Mark Hammond wrote: > > [Guido] > > > (But weren't we going to do away with the whole registry > > > idea in favor of an encodings package?) > > > [MAL] > > One way or another, the Unicode implementation will have to > > access a dictionary containing references to the codecs for > > a particular encoding. You won't get around registering these > > at some point... be it in a lazy way, on-the-fly or by some > > other means. > > What is wrong with my idea of using well-known-names from the encoding > module? The dict then is "encodings.<encoding-name>.__dict__". All > encodings "just work" because the leverage from the Python module > system. Unless Im missing something, there is no need for any extra > registry at all. I guess it would actually resolve to 2 dict lookups, > but thats OK surely? The problem is that the encoding names are not Python identifiers, e.g. iso-8859-1 is allowed as identifier. This and the fact that applications may want to ship their own codecs (which do not get installed under the system wide encodings package) make the registry necessary. I don't see a problem with the registry though -- the encodings package can take care of the registration process without any user interaction. There would only have to be an API for looking up an encoding published by the encodings package for the Unicode implementation to use. The magic behind that API is left to the encodings package... BTW, nothing's wrong with your idea :-) In fact, I like it a lot because it keeps the encoding modules out of the top-level scope which is good. PS: we could probably even take the whole codec idea one step further and also allow other input/output formats to be registered, e.g. stream ciphers or pickle mechanisms. The step in that direction is not a big one: we'd only have to drop the specification of the Unicode object in the spec and replace it with an arbitrary object. Of course, this will still have to be a Unicode object for use by the Unicode implementation. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From gmcm at hypernet.com Thu Nov 18 15:19:48 1999 From: gmcm at hypernet.com (Gordon McMillan) Date: Thu, 18 Nov 1999 09:19:48 -0500 Subject: [Python-Dev] Python 1.6 status In-Reply-To: <3833BFE9.6FD118B1@lemburg.com> Message-ID: <1269187709-18981857@hypernet.com> Marc-Andre wrote: > Fredrik Lundh wrote: > > > > Guido van Rossum <guido at CNRI.Reston.VA.US> wrote: > > > - suggestions for new issues that maybe ought to be settled in 1.6 > > > > three things: imputil, imputil, imputil > > But please don't add the current version as default importer... > its strategy is way too slow for real life apps (yes, I've tested > this: imports typically take twice as long as with the builtin > importer). I think imputil's emulation of the builtin importer is more of a demonstration than a serious implementation. As for speed, it depends on the test. > I'd opt for an import manager which provides a useful API for > import hooks to register themselves with. I think that rather than blindly chain themselves together, there should be a simple minded manager. This could let the programmer prioritize them. > What we really need > is not yet another complete reimplementation of what the > builtin importer does, but rather a more detailed exposure of > the various import aspects: finding modules and loading modules. The first clause I sort of agree with - the current implementation is a fine implementation of a filesystem directory based importer. I strongly disagree with the second clause. The current import hooks are just such a detailed exposure; and they are incomprehensible and unmanagable. I guess you want to tweak the "finding" part of the builtin import mechanism. But that's no reason to ask all importers to break themselves up into "find" and "load" pieces. It's a reason to ask that the standard importer be, in some sense, "subclassable" (ie, expose hooks, or perhaps be an extension class like thingie). - Gordon From jim at interet.com Thu Nov 18 15:39:20 1999 From: jim at interet.com (James C. Ahlstrom) Date: Thu, 18 Nov 1999 09:39:20 -0500 Subject: [Python-Dev] Python 1.6 status References: <1269187709-18981857@hypernet.com> Message-ID: <38340F98.212F61@interet.com> Gordon McMillan wrote: > > Marc-Andre wrote: > > > Fredrik Lundh wrote: > > > > > > Guido van Rossum <guido at CNRI.Reston.VA.US> wrote: > > > > - suggestions for new issues that maybe ought to be settled in 1.6 > > > > > > three things: imputil, imputil, imputil > > > > But please don't add the current version as default importer... > > its strategy is way too slow for real life apps (yes, I've tested > > this: imports typically take twice as long as with the builtin > > importer). > > I think imputil's emulation of the builtin importer is more of a > demonstration than a serious implementation. As for speed, it > depends on the test. IMHO the current import mechanism is good for developers who must work on the library code in the directory tree, but a disaster for sysadmins who must distribute Python applications either internally to a number of machines or commercially. What we need is a standard Python library file like a Java "Jar" file. Imputil can support this as 130 lines of Python. I have also written one in C. I like the imputil approach, but if we want to add a library importer to import.c, I volunteer to write it. I don't want to just add more complicated and unmanageable hooks which people will all use different ways and just add to the confusion. It is easy to install packages by just making them into a library file and throwing it into a directory. So why aren't we doing it? Jim Ahlstrom From guido at CNRI.Reston.VA.US Thu Nov 18 16:30:28 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Thu, 18 Nov 1999 10:30:28 -0500 Subject: [Python-Dev] Import redesign (was: Python 1.6 status) In-Reply-To: Your message of "Thu, 18 Nov 1999 09:19:48 EST." <1269187709-18981857@hypernet.com> References: <1269187709-18981857@hypernet.com> Message-ID: <199911181530.KAA03887@eric.cnri.reston.va.us> Gordon McMillan wrote: > Marc-Andre wrote: > > > Fredrik Lundh wrote: > > > > > Guido van Rossum <guido at CNRI.Reston.VA.US> wrote: > > > > - suggestions for new issues that maybe ought to be settled in 1.6 > > > > > > three things: imputil, imputil, imputil > > > > But please don't add the current version as default importer... > > its strategy is way too slow for real life apps (yes, I've tested > > this: imports typically take twice as long as with the builtin > > importer). > > I think imputil's emulation of the builtin importer is more of a > demonstration than a serious implementation. As for speed, it > depends on the test. Agreed. I like some of imputil's features, but I think the API need to be redesigned. > > I'd opt for an import manager which provides a useful API for > > import hooks to register themselves with. > > I think that rather than blindly chain themselves together, there > should be a simple minded manager. This could let the > programmer prioritize them. Indeed. (A list of importers has been suggested, to replace the list of directories currently used.) > > What we really need > > is not yet another complete reimplementation of what the > > builtin importer does, but rather a more detailed exposure of > > the various import aspects: finding modules and loading modules. > > The first clause I sort of agree with - the current > implementation is a fine implementation of a filesystem > directory based importer. > > I strongly disagree with the second clause. The current import > hooks are just such a detailed exposure; and they are > incomprehensible and unmanagable. Based on how many people have successfully written import hooks, I have to agree. :-( > I guess you want to tweak the "finding" part of the builtin > import mechanism. But that's no reason to ask all importers > to break themselves up into "find" and "load" pieces. It's a > reason to ask that the standard importer be, in some sense, > "subclassable" (ie, expose hooks, or perhaps be an extension > class like thingie). Agreed. Subclassing is a good way towards flexibility. And Jim Ahlstrom writes: > IMHO the current import mechanism is good for developers who must > work on the library code in the directory tree, but a disaster > for sysadmins who must distribute Python applications either > internally to a number of machines or commercially. Unfortunately, you're right. :-( > What we need is a standard Python library file like a Java "Jar" > file. Imputil can support this as 130 lines of Python. I have also > written one in C. I like the imputil approach, but if we want to > add a library importer to import.c, I volunteer to write it. Please volunteer to design or at least review the grand architecture -- see below. > I don't want to just add more complicated and unmanageable hooks > which people will all use different ways and just add to the > confusion. You're so right! > It is easy to install packages by just making them into a library > file and throwing it into a directory. So why aren't we doing it? Rhetorical question. :-) So here's a challenge: redesign the import API from scratch. Let me start with some requirements. Compatibility issues: --------------------- - the core API may be incompatible, as long as compatibility layers can be provided in pure Python - support for rexec functionality - support for freeze functionality - load .py/.pyc/.pyo files and shared libraries from files - support for packages - sys.path and sys.modules should still exist; sys.path might have a slightly different meaning - $PYTHONPATH and $PYTHONHOME should still be supported (I wouldn't mind a splitting up of importdl.c into several platform-specific files, one of which is chosen by the configure script; but that's a bit of a separate issue.) New features: ------------- - Integrated support for Greg Ward's distribution utilities (i.e. a module prepared by the distutil tools should install painlessly) - Good support for prospective authors of "all-in-one" packaging tool authors like Gordon McMillan's win32 installer or /F's squish. (But I *don't* require backwards compatibility for existing tools.) - Standard import from zip or jar files, in two ways: (1) an entry on sys.path can be a zip/jar file instead of a directory; its contents will be searched for modules or packages (2) a file in a directory that's on sys.path can be a zip/jar file; its contents will be considered as a package (note that this is different from (1)!) I don't particularly care about supporting all zip compression schemes; if Java gets away with only supporting gzip compression in jar files, so can we. - Easy ways to subclass or augment the import mechanism along different dimensions. For example, while none of the following features should be part of the core implementation, it should be easy to add any or all: - support for a new compression scheme to the zip importer - support for a new archive format, e.g. tar - a hook to import from URLs or other data sources (e.g. a "module server" imported in CORBA) (this needn't be supported through $PYTHONPATH though) - a hook that imports from compressed .py or .pyc/.pyo files - a hook to auto-generate .py files from other filename extensions (as currently implemented by ILU) - a cache for file locations in directories/archives, to improve startup time - a completely different source of imported modules, e.g. for an embedded system or PalmOS (which has no traditional filesystem) - Note that different kinds of hooks should (ideally, and within reason) properly combine, as follows: if I write a hook to recognize .spam files and automatically translate them into .py files, and you write a hook to support a new archive format, then if both hooks are installed together, it should be possible to find a .spam file in an archive and do the right thing, without any extra action. Right? - It should be possible to write hooks in C/C++ as well as Python - Applications embedding Python may supply their own implementations, default search path, etc., but don't have to if they want to piggyback on an existing Python installation (even though the latter is fraught with risk, it's cheaper and easier to understand). Implementation: --------------- - There must clearly be some code in C that can import certain essential modules (to solve the chicken-or-egg problem), but I don't mind if the majority of the implementation is written in Python. Using Python makes it easy to subclass. - In order to support importing from zip/jar files using compression, we'd at least need the zlib extension module and hence libz itself, which may not be available everywhere. - I suppose that the bootstrap is solved using a mechanism very similar to what freeze currently used (other solutions seem to be platform dependent). - I also want to still support importing *everything* from the filesystem, if only for development. (It's hard enough to deal with the fact that exceptions.py is needed during Py_Initialize(); I want to be able to hack on the import code written in Python without having to rebuild the executable all the time. Let's first complete the requirements gathering. Are these requirements reasonable? Will they make an implementation too complex? Am I missing anything? Finally, to what extent does this impact the desire for dealing differently with the Python bytecode compiler (e.g. supporting optimizers written in Python)? And does it affect the desire to implement the read-eval-print loop (the >>> prompt) in Python? --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at CNRI.Reston.VA.US Thu Nov 18 16:37:49 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Thu, 18 Nov 1999 10:37:49 -0500 Subject: [Python-Dev] Codecs and StreamCodecs In-Reply-To: Your message of "Thu, 18 Nov 1999 12:41:32 +0100." <3833E5EC.AAFE5016@lemburg.com> References: <002401bf31b3$bf16c230$0501a8c0@bobcat> <3833E5EC.AAFE5016@lemburg.com> Message-ID: <199911181537.KAA03911@eric.cnri.reston.va.us> > The problem is that the encoding names are not Python identifiers, > e.g. iso-8859-1 is allowed as identifier. This is easily taken care of by translating each string of consecutive non-identifier-characters to an underscore, so this would import the iso_8859_1.py module. (I also noticed in an earlier post that the official name for Shift_JIS has an underscore, while most other encodings use hyphens.) > This and > the fact that applications may want to ship their own codecs (which > do not get installed under the system wide encodings package) > make the registry necessary. But it could be enough to register a package where to look for encodings (in addition to the system package). Or there could be a registry for encoding search functions. (See the import discussion.) > I don't see a problem with the registry though -- the encodings > package can take care of the registration process without any > user interaction. There would only have to be an API for > looking up an encoding published by the encodings package for > the Unicode implementation to use. The magic behind that API > is left to the encodings package... I think that the collection of encodings will eventually grow large enough to make it a requirement to avoid doing work proportional to the number of supported encodings at startup (or even when an encoding is referenced for the first time). Any "lazy" mechanism (of which module search is an example) will do. > BTW, nothing's wrong with your idea :-) In fact, I like it > a lot because it keeps the encoding modules out of the > top-level scope which is good. Yes. > PS: we could probably even take the whole codec idea one step > further and also allow other input/output formats to be registered, > e.g. stream ciphers or pickle mechanisms. The step in that > direction is not a big one: we'd only have to drop the specification > of the Unicode object in the spec and replace it with an arbitrary > object. Of course, this will still have to be a Unicode object > for use by the Unicode implementation. This is a step towards Java's architecture of stackable streams. But I'm always in favor of tackling what we know we need before tackling the most generalized version of the problem. --Guido van Rossum (home page: http://www.python.org/~guido/) From mal at lemburg.com Thu Nov 18 16:52:26 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Thu, 18 Nov 1999 16:52:26 +0100 Subject: [Python-Dev] Python 1.6 status References: <1269187709-18981857@hypernet.com> <38340F98.212F61@interet.com> Message-ID: <383420BA.EF8A6AC5@lemburg.com> [imputil and friends] "James C. Ahlstrom" wrote: > > IMHO the current import mechanism is good for developers who must > work on the library code in the directory tree, but a disaster > for sysadmins who must distribute Python applications either > internally to a number of machines or commercially. What we > need is a standard Python library file like a Java "Jar" file. > Imputil can support this as 130 lines of Python. I have also > written one in C. I like the imputil approach, but if we want > to add a library importer to import.c, I volunteer to write it. > > I don't want to just add more complicated and unmanageable hooks > which people will all use different ways and just add to the > confusion. > > It is easy to install packages by just making them into a library > file and throwing it into a directory. So why aren't we doing it? Perhaps we ought to rethink the strategy under a different light: what are the real requirement we have for Python imports ? Perhaps the outcome is only the addition of say one or two features and those can probably easily be added to the builtin system... then we can just forget about the whole import hook dilema for quite a while (AFAIK, this is how we got packages into the core -- people weren't happy with the import hook). Well, just an idea... I have other threads to follow :-) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From fdrake at acm.org Thu Nov 18 17:01:47 1999 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Thu, 18 Nov 1999 11:01:47 -0500 (EST) Subject: [Python-Dev] Codecs and StreamCodecs In-Reply-To: <3833E5EC.AAFE5016@lemburg.com> References: <002401bf31b3$bf16c230$0501a8c0@bobcat> <3833E5EC.AAFE5016@lemburg.com> Message-ID: <14388.8939.911928.41746@weyr.cnri.reston.va.us> M.-A. Lemburg writes: > The problem is that the encoding names are not Python identifiers, > e.g. iso-8859-1 is allowed as identifier. This and > the fact that applications may want to ship their own codecs (which > do not get installed under the system wide encodings package) > make the registry necessary. This isn't a substantial problem. Try this on for size (probably not too different from what everyone is already thinking, but let's make it clear). This could be in encodings/__init__.py; I've tried to be really clear on the names. (No testing, only partially complete.) ------------------------------------------------------------------------ import string import sys try: from cStringIO import StringIO except ImportError: from StringIO import StringIO class EncodingError(Exception): def __init__(self, encoding, error): self.encoding = encoding self.strerror = "%s %s" % (error, `encoding`) self.error = error Exception.__init__(self, encoding, error) _registry = {} def registerEncoding(encoding, encode=None, decode=None, make_stream_encoder=None, make_stream_decoder=None): encoding = encoding.lower() if _registry.has_key(encoding): info = _registry[encoding] else: info = _registry[encoding] = Codec(encoding) info._update(encode, decode, make_stream_encoder, make_stream_decoder) def getCodec(encoding): encoding = encoding.lower() if _registry.has_key(encoding): return _registry[encoding] # load the module modname = "encodings." + encoding.replace("-", "_") try: __import__(modname) except ImportError: raise EncodingError("unknown uncoding " + `encoding`) # if the module registered, use the codec as-is: if _registry.has_key(encoding): return _registry[encoding] # nothing registered, use well-known names module = sys.modules[modname] codec = _registry[encoding] = Codec(encoding) encode = getattr(module, "encode", None) decode = getattr(module, "decode", None) make_stream_encoder = getattr(module, "make_stream_encoder", None) make_stream_decoder = getattr(module, "make_stream_decoder", None) codec._update(encode, decode, make_stream_encoder, make_stream_decoder) class Codec: __encode = None __decode = None __stream_encoder_factory = None __stream_decoder_factory = None def __init__(self, name): self.name = name def encode(self, u): if self.__stream_encoder_factory: sio = StringIO() encoder = self.__stream_encoder_factory(sio) encoder.write(u) encoder.flush() return sio.getvalue() else: raise EncodingError("no encoder available for " + `self.name`) # similar for decode()... def make_stream_encoder(self, target): if self.__stream_encoder_factory: return self.__stream_encoder_factory(target) elif self.__encode: return DefaultStreamEncoder(target, self.__encode) else: raise EncodingError("no encoder available for " + `self.name`) # similar for make_stream_decoder()... def _update(self, encode, decode, make_stream_encoder, make_stream_decoder): self.__encode = encode or self.__encode self.__decode = decode or self.__decode self.__stream_encoder_factory = ( make_stream_encoder or self.__stream_encoder_factory) self.__stream_decoder_factory = ( make_stream_decoder or self.__stream_decoder_factory) ------------------------------------------------------------------------ > I don't see a problem with the registry though -- the encodings > package can take care of the registration process without any No problem at all; we just need to make sure the right magic is there for the "normal" case. > PS: we could probably even take the whole codec idea one step > further and also allow other input/output formats to be registered, File formats are different from text encodings, so let's keep them separate. Yes, a registry can be a good approach whenever the various things being registered are sufficiently similar semantically, but the behavior of the registry/lookup can be very different for each type of thing. Let's not over-generalize. -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> Corporation for National Research Initiatives From fdrake at acm.org Thu Nov 18 17:02:45 1999 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Thu, 18 Nov 1999 11:02:45 -0500 (EST) Subject: [Python-Dev] Codecs and StreamCodecs In-Reply-To: <3833E5EC.AAFE5016@lemburg.com> References: <002401bf31b3$bf16c230$0501a8c0@bobcat> <3833E5EC.AAFE5016@lemburg.com> Message-ID: <14388.8997.703108.401808@weyr.cnri.reston.va.us> Er, I should note that the sample code I just sent makes use of string methods. ;) -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> Corporation for National Research Initiatives From mal at lemburg.com Thu Nov 18 17:23:09 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Thu, 18 Nov 1999 17:23:09 +0100 Subject: [Python-Dev] Codecs and StreamCodecs References: <002401bf31b3$bf16c230$0501a8c0@bobcat> <3833E5EC.AAFE5016@lemburg.com> <199911181537.KAA03911@eric.cnri.reston.va.us> Message-ID: <383427ED.45A01BBB@lemburg.com> Guido van Rossum wrote: > > > The problem is that the encoding names are not Python identifiers, > > e.g. iso-8859-1 is allowed as identifier. > > This is easily taken care of by translating each string of consecutive > non-identifier-characters to an underscore, so this would import the > iso_8859_1.py module. (I also noticed in an earlier post that the > official name for Shift_JIS has an underscore, while most other > encodings use hyphens.) Right. That's one way of doing it. > > This and > > the fact that applications may want to ship their own codecs (which > > do not get installed under the system wide encodings package) > > make the registry necessary. > > But it could be enough to register a package where to look for > encodings (in addition to the system package). > > Or there could be a registry for encoding search functions. (See the > import discussion.) Like a path of search functions ? Not a bad idea... I will still want the internal dict for caching purposes though. I'm not sure how often these encodings will be, but even a few hundred function call will slow down the Unicode implementation quite a bit. The implementation could proceed as follows: def lookup(encoding): codecs = _internal_dict.get(encoding,None) if codecs: return codecs for query in sys.encoders: codecs = query(encoding) if codecs: break else: raise UnicodeError,'unkown encoding: %s' % encoding _internal_dict[encoding] = codecs return codecs For simplicity, codecs should be a tuple (encoder,decoder, stream_writer,stream_reader) of factory functions. ...that is if we can agree on these 4 APIs :-) Here are my current versions: ----------------------------------------------------------------------- class Codec: """ Defines the interface for stateless encoders/decoders. """ def __init__(self,errors='strict'): """ Creates a Codec instance. The Codec may implement different error handling schemes by providing the errors argument. These parameters are defined: 'strict' - raise an UnicodeError (or a subclass) 'ignore' - ignore the character and continue with the next (a single character) - replace errorneous characters with the given character (may also be a Unicode character) """ self.errors = errors def encode(self,u,slice=None): """ Return the Unicode object u encoded as Python string. If slice is given (as slice object), only the sliced part of the Unicode object is encoded. The method may not store state in the Codec instance. Use SteamCodec for codecs which have to keep state in order to make encoding/decoding efficient. """ ... def decode(self,s,offset=0): """ Decodes data from the Python string s and returns a tuple (Unicode object, bytes consumed). If offset is given, the decoding process starts at s[offset]. It defaults to 0. The method may not store state in the Codec instance. Use SteamCodec for codecs which have to keep state in order to make encoding/decoding efficient. """ ... StreamWriter and StreamReader define the interface for stateful encoders/decoders: class StreamWriter(Codec): def __init__(self,stream,errors='strict'): """ Creates a StreamWriter instance. stream must be a file-like object open for writing (binary) data. The StreamWriter may implement different error handling schemes by providing the errors argument. These parameters are defined: 'strict' - raise an UnicodeError (or a subclass) 'ignore' - ignore the character and continue with the next (a single character) - replace errorneous characters with the given character (may also be a Unicode character) """ self.stream = stream def write(self,u,slice=None): """ Writes the Unicode object's contents encoded to self.stream and returns the number of bytes written. If slice is given (as slice object), only the sliced part of the Unicode object is written. """ ... the base class should provide a default implementation of this method using self.encode ... def flush(self): """ Flushed the codec buffers used for keeping state. Returns values are not defined. Implementations are free to return None, raise an exception (in case there is pending data in the buffers which could not be decoded) or return any remaining data from the state buffers used. """ pass class StreamReader(Codec): def __init__(self,stream,errors='strict'): """ Creates a StreamReader instance. stream must be a file-like object open for reading (binary) data. The StreamReader may implement different error handling schemes by providing the errors argument. These parameters are defined: 'strict' - raise an UnicodeError (or a subclass) 'ignore' - ignore the character and continue with the next (a single character) - replace errorneous characters with the given character (may also be a Unicode character) """ self.stream = stream def read(self,chunksize=0): """ Decodes data from the stream self.stream and returns a tuple (Unicode object, bytes consumed). chunksize indicates the approximate maximum number of bytes to read from the stream for decoding purposes. The decoder can modify this setting as appropriate. The default value 0 indicates to read and decode as much as possible. The chunksize is intended to prevent having to decode huge files in one step. """ ... the base class should provide a default implementation of this method using self.decode ... def flush(self): """ Flushed the codec buffers used for keeping state. Returns values are not defined. Implementations are free to return None, raise an exception (in case there is pending data in the buffers which could not be decoded) or return any remaining data from the state buffers used. """ In addition to the above methods, the StreamWriter and StreamReader instances should also provide access to all other methods defined for the stream object. Stream codecs are free to combine the StreamWriter and StreamReader interfaces into one class. ----------------------------------------------------------------------- > > I don't see a problem with the registry though -- the encodings > > package can take care of the registration process without any > > user interaction. There would only have to be an API for > > looking up an encoding published by the encodings package for > > the Unicode implementation to use. The magic behind that API > > is left to the encodings package... > > I think that the collection of encodings will eventually grow large > enough to make it a requirement to avoid doing work proportional to > the number of supported encodings at startup (or even when an encoding > is referenced for the first time). Any "lazy" mechanism (of which > module search is an example) will do. Right. The list of search functions should provide this kind of lazyness. It also provides ways to implement other strategies to look for codecs, e.g. PIL could provide such a search function for its codecs, mxCrypto for the included ciphers, etc. > > BTW, nothing's wrong with your idea :-) In fact, I like it > > a lot because it keeps the encoding modules out of the > > top-level scope which is good. > > Yes. > > > PS: we could probably even take the whole codec idea one step > > further and also allow other input/output formats to be registered, > > e.g. stream ciphers or pickle mechanisms. The step in that > > direction is not a big one: we'd only have to drop the specification > > of the Unicode object in the spec and replace it with an arbitrary > > object. Of course, this will still have to be a Unicode object > > for use by the Unicode implementation. > > This is a step towards Java's architecture of stackable streams. > > But I'm always in favor of tackling what we know we need before > tackling the most generalized version of the problem. Well, I just wanted to mention the possibility... might be something to look into next year. I find it rather thrilling to be able to create encrypted streams by just hooking together a few stream codecs... f = open('myfile.txt','w') CipherWriter = sys.codec('rc5-cipher')[3] sf = StreamWriter(f,key='xxxxxxxx') UTF8Writer = sys.codec('utf-8')[3] sfx = UTF8Writer(sf) sfx.write('asdfasdfasdfasdf') sfx.close() Hmm, we should probably define the additional constructor arguments to be keyword arguments... writers/readers other than Unicode ones will probably need different kinds of parameters (such as the key in the above example). Ahem, ...I'm getting distracted here :-) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From bwarsaw at cnri.reston.va.us Thu Nov 18 17:23:41 1999 From: bwarsaw at cnri.reston.va.us (Barry A. Warsaw) Date: Thu, 18 Nov 1999 11:23:41 -0500 (EST) Subject: [Python-Dev] Codecs and StreamCodecs References: <002401bf31b3$bf16c230$0501a8c0@bobcat> <3833E5EC.AAFE5016@lemburg.com> <14388.8997.703108.401808@weyr.cnri.reston.va.us> Message-ID: <14388.10253.902424.904199@anthem.cnri.reston.va.us> >>>>> "Fred" == Fred L Drake, Jr <fdrake at acm.org> writes: Fred> Er, I should note that the sample code I just sent makes Fred> use of string methods. ;) Yay! From guido at CNRI.Reston.VA.US Thu Nov 18 17:37:08 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Thu, 18 Nov 1999 11:37:08 -0500 Subject: [Python-Dev] Codecs and StreamCodecs In-Reply-To: Your message of "Thu, 18 Nov 1999 17:23:09 +0100." <383427ED.45A01BBB@lemburg.com> References: <002401bf31b3$bf16c230$0501a8c0@bobcat> <3833E5EC.AAFE5016@lemburg.com> <199911181537.KAA03911@eric.cnri.reston.va.us> <383427ED.45A01BBB@lemburg.com> Message-ID: <199911181637.LAA04260@eric.cnri.reston.va.us> > Like a path of search functions ? Not a bad idea... I will still > want the internal dict for caching purposes though. I'm not sure > how often these encodings will be, but even a few hundred function > call will slow down the Unicode implementation quite a bit. Of course. (It's like sys.modules caching the results of an import). [...] > def flush(self): > > """ Flushed the codec buffers used for keeping state. > > Returns values are not defined. Implementations are free to > return None, raise an exception (in case there is pending > data in the buffers which could not be decoded) or > return any remaining data from the state buffers used. > > """ I don't know where this came from, but a flush() should work like flush() on a file. It doesn't return a value, it just sends any remaining data to the underlying stream (for output). For input it shouldn't be supported at all. The idea is that flush() should do the same to the encoder state that close() followed by a reopen() would do. Well, more or less. But if the process were to be killed right after a flush(), the data written to disk should be a complete encoding, and not have a lingering shift state. --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at CNRI.Reston.VA.US Thu Nov 18 17:59:06 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Thu, 18 Nov 1999 11:59:06 -0500 Subject: [Python-Dev] Codecs and StreamCodecs In-Reply-To: Your message of "Thu, 18 Nov 1999 09:50:36 +0100." <3833BDDC.7CD2CC1F@lemburg.com> References: <38317FBA.4F3D6B1F@lemburg.com> <199911161620.LAA02643@eric.cnri.reston.va.us> <002e01bf3069$8477b440$f29b12c2@secret.pythonware.com> <3832757E.B9503606@lemburg.com> <004101bf30ea$eb3801e0$f29b12c2@secret.pythonware.com> <3833BDDC.7CD2CC1F@lemburg.com> Message-ID: <199911181659.LAA04303@eric.cnri.reston.va.us> [Responding to some lingering mails] [/F] > > >>> u = unicode("? i ?a ? e ?", "iso-latin-1") > > >>> s = u.encode("html-entities") > > >>> d = decoder("html-entities") > > >>> d.decode(s[:-1]) > > "? i ?a ? e " > > >>> d.flush() > > "?" [MAL] > Ah, ok. So the .flush() method checks for proper > string endings and then either returns the remaining > input or raises an error. No, please. See my previous post on flush(). > > input: read chunks of data, decode, and > > keep extra data in a local buffer. > > > > output: encode data into suitable chunks, > > and write to the output stream (that's why > > there's a buffersize argument to encode -- > > if someone writes a 10mb unicode string to > > an encoded stream, python shouldn't allocate > > an extra 10-30 megabytes just to be able to > > encode the darn thing...) > > So the stream codecs would be wrappers around the > string codecs. No -- the other way around. Think of the stream encoder as a little FSM engine that you feed with unicode characters and which sends bytes to the backend stream. When a unicode character comes in that requires a particular shift state, and the FSM isn't in that shift state, it emits the escape sequence to enter that shift state first. It should use standard buffered writes to the output stream; i.e. one call to feed the encoder could cause several calls to write() on the output stream, or vice versa (if you fed the encoder a single character it might keep it in its own buffer). That's all up to the codec implementation. The flush() forces the FSM into the "neutral" shift state, possibly writing an escape sequence to leave the current shift state, and empties the internal buffer. The string codec CONCEPTUALLY uses the stream codec to a cStringIO object, using flush() to force the final output. However the implementation may take a shortcut. For stateless encodings the stream codec may call on the string codec, but that's all an implementation issue. For input, things are slightly different (you don't know how much encoded data you must read to give you N Unicode characters, so you may have to make a guess and hold on to some data that you read unnecessarily -- either in encoded form or in Unicode form, at the discretion of the implementation. Using seek() on the input stream is forbidden (it could be a pipe or socket). --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at CNRI.Reston.VA.US Thu Nov 18 18:11:51 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Thu, 18 Nov 1999 12:11:51 -0500 Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit) In-Reply-To: Your message of "Thu, 18 Nov 1999 10:39:30 +0100." <3833C952.C6F154B1@lemburg.com> References: <000101bf3173$f9805340$c0a0143f@tim> <3833C952.C6F154B1@lemburg.com> Message-ID: <199911181711.MAA04342@eric.cnri.reston.va.us> > > > Now how should we define ur"abc\u1234\n" ... ? > > > > If strings carried an encoding tag with them, the obvious answer is that > > this acts exactly like r"abc\u1234\n" acts today except gets a > > "unicode-escaped" encoding tag instead of a "[whatever the default is > > today]" encoding tag. > > > > If strings don't carry an encoding tag with them, you're in a bit of a > > pickle: you'll have to convert it to a regular string or a Unicode string, > > but in either case have no way to communicate that it may need further > > processing; i.e., no way to distinguish it from a regular or Unicode string > > produced by any other mechanism. The code I posted yesterday remains my > > best answer to that unpleasant puzzle (i.e., produce a Unicode string, > > fiddling with backslashes just enough to get the \u escapes expanded, in the > > same way Java's (conceptual) preprocessor does it). > > They don't have such tags... so I guess we're in trouble ;-) > > I guess to make ur"" have a meaning at all, we'd need to go > the Java preprocessor way here, i.e. scan the string *only* > for \uXXXX sequences, decode these and convert the rest as-is > to Unicode ordinals. > > Would that be ok ? Read Tim's code (posted about 40 messages ago in this list). Like Java, it interprets \u.... when the number of backslashes is odd, but not when it's even. So \\u.... returns exactly that, while \\\u.... returns two backslashes and a unicode character. This is nice and can be done regardless of whether we are going to interpret other \ escapes or not. --Guido van Rossum (home page: http://www.python.org/~guido/) From skip at mojam.com Thu Nov 18 18:34:51 1999 From: skip at mojam.com (Skip Montanaro) Date: Thu, 18 Nov 1999 11:34:51 -0600 (CST) Subject: [Python-Dev] just say no... In-Reply-To: <000401bf30d8$6cf30bc0$a42d153f@tim> References: <383156DF.2209053F@lemburg.com> <000401bf30d8$6cf30bc0$a42d153f@tim> Message-ID: <14388.14523.158050.594595@dolphin.mojam.com> >> FYI, the next version of the proposal ... File objects opened in >> text mode will use "t#" and binary ones use "s#". Tim> Am I the only one who sees magical distinctions between text and Tim> binary mode as a Really Bad Idea? No. Tim> I wouldn't have guessed the Unix natives here would quietly Tim> acquiesce to importing a bit of Windows madness <wink>. We figured you and Guido would come to our rescue... ;-) Skip Montanaro | http://www.mojam.com/ skip at mojam.com | http://www.musi-cal.com/ 847-971-7098 | Python: Programming the way Guido indented... From mal at lemburg.com Thu Nov 18 19:15:54 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Thu, 18 Nov 1999 19:15:54 +0100 Subject: [Python-Dev] Unicode Proposal: Version 0.7 References: <382C0A54.E6E8328D@lemburg.com> <382D625B.DC14DBDE@lemburg.com> <38316685.7977448D@lemburg.com> Message-ID: <3834425A.8E9C3B7E@lemburg.com> FYI, I've uploaded a new version of the proposal which includes new codec APIs, a new codec search mechanism and some minor fixes here and there. The latest version of the proposal is available at: http://starship.skyport.net/~lemburg/unicode-proposal.txt Older versions are available as: http://starship.skyport.net/~lemburg/unicode-proposal-X.X.txt Some POD (points of discussion) that are still open: ? Unicode objects support for %-formatting ? Design of the internal C API and the Python API for the Unicode character properties database -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Thu Nov 18 19:32:49 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Thu, 18 Nov 1999 19:32:49 +0100 Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit) References: <000101bf3173$f9805340$c0a0143f@tim> <3833C952.C6F154B1@lemburg.com> <199911181711.MAA04342@eric.cnri.reston.va.us> Message-ID: <38344651.960878A2@lemburg.com> Guido van Rossum wrote: > > > I guess to make ur"" have a meaning at all, we'd need to go > > the Java preprocessor way here, i.e. scan the string *only* > > for \uXXXX sequences, decode these and convert the rest as-is > > to Unicode ordinals. > > > > Would that be ok ? > > Read Tim's code (posted about 40 messages ago in this list). I did, but wasn't sure whether he was argueing for going the Java way... > Like Java, it interprets \u.... when the number of backslashes is odd, > but not when it's even. So \\u.... returns exactly that, while > \\\u.... returns two backslashes and a unicode character. > > This is nice and can be done regardless of whether we are going to > interpret other \ escapes or not. So I'll take that as: this is what we want in Python too :-) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Thu Nov 18 19:38:41 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Thu, 18 Nov 1999 19:38:41 +0100 Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit) References: <000101bf3173$f9805340$c0a0143f@tim> <3833C952.C6F154B1@lemburg.com> <199911181711.MAA04342@eric.cnri.reston.va.us> Message-ID: <383447B1.1B7B594C@lemburg.com> Would this definition be fine ? """ u = ur'<raw-unicode-escape encoded Python string>' The 'raw-unicode-escape' encoding is defined as follows: ? \uXXXX sequence represent the U+XXXX Unicode character if and only if the number of leading backslashes is odd ? all other characters represent themselves as Unicode ordinal (e.g. 'b' -> U+0062) """ -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From guido at CNRI.Reston.VA.US Thu Nov 18 19:46:35 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Thu, 18 Nov 1999 13:46:35 -0500 Subject: [Python-Dev] just say no... In-Reply-To: Your message of "Thu, 18 Nov 1999 11:34:51 CST." <14388.14523.158050.594595@dolphin.mojam.com> References: <383156DF.2209053F@lemburg.com> <000401bf30d8$6cf30bc0$a42d153f@tim> <14388.14523.158050.594595@dolphin.mojam.com> Message-ID: <199911181846.NAA04547@eric.cnri.reston.va.us> > >> FYI, the next version of the proposal ... File objects opened in > >> text mode will use "t#" and binary ones use "s#". > > Tim> Am I the only one who sees magical distinctions between text and > Tim> binary mode as a Really Bad Idea? > > No. > > Tim> I wouldn't have guessed the Unix natives here would quietly > Tim> acquiesce to importing a bit of Windows madness <wink>. > > We figured you and Guido would come to our rescue... ;-) Don't count on me. My brain is totally cross-platform these days, and writing "rb" or "wb" for files containing binary data is second nature for me. I actually *like* it. Anyway, the Unicode stuff ought to have a wrapper open(filename, mode, encoding) where the 'b' will be added to the mode if you don't give it and it's needed. --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at CNRI.Reston.VA.US Thu Nov 18 19:50:20 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Thu, 18 Nov 1999 13:50:20 -0500 Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit) In-Reply-To: Your message of "Thu, 18 Nov 1999 19:32:49 +0100." <38344651.960878A2@lemburg.com> References: <000101bf3173$f9805340$c0a0143f@tim> <3833C952.C6F154B1@lemburg.com> <199911181711.MAA04342@eric.cnri.reston.va.us> <38344651.960878A2@lemburg.com> Message-ID: <199911181850.NAA04576@eric.cnri.reston.va.us> > > Like Java, it interprets \u.... when the number of backslashes is odd, > > but not when it's even. So \\u.... returns exactly that, while > > \\\u.... returns two backslashes and a unicode character. > > > > This is nice and can be done regardless of whether we are going to > > interpret other \ escapes or not. > > So I'll take that as: this is what we want in Python too :-) I'll reserve judgement until we've got some experience with it in the field, but it seems the best compromise. It also gives a clear explanation about why we have \uXXXX when we already have \xXXXX. --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at CNRI.Reston.VA.US Thu Nov 18 19:57:36 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Thu, 18 Nov 1999 13:57:36 -0500 Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit) In-Reply-To: Your message of "Thu, 18 Nov 1999 19:38:41 +0100." <383447B1.1B7B594C@lemburg.com> References: <000101bf3173$f9805340$c0a0143f@tim> <3833C952.C6F154B1@lemburg.com> <199911181711.MAA04342@eric.cnri.reston.va.us> <383447B1.1B7B594C@lemburg.com> Message-ID: <199911181857.NAA04617@eric.cnri.reston.va.us> > Would this definition be fine ? > """ > > u = ur'<raw-unicode-escape encoded Python string>' > > The 'raw-unicode-escape' encoding is defined as follows: > > ? \uXXXX sequence represent the U+XXXX Unicode character if and > only if the number of leading backslashes is odd > > ? all other characters represent themselves as Unicode ordinal > (e.g. 'b' -> U+0062) > > """ Yes. --Guido van Rossum (home page: http://www.python.org/~guido/) From skip at mojam.com Thu Nov 18 20:09:46 1999 From: skip at mojam.com (Skip Montanaro) Date: Thu, 18 Nov 1999 13:09:46 -0600 (CST) Subject: [Python-Dev] Unicode Proposal: Version 0.7 In-Reply-To: <3834425A.8E9C3B7E@lemburg.com> References: <382C0A54.E6E8328D@lemburg.com> <382D625B.DC14DBDE@lemburg.com> <38316685.7977448D@lemburg.com> <3834425A.8E9C3B7E@lemburg.com> Message-ID: <14388.20218.294814.234327@dolphin.mojam.com> I haven't been following this discussion closely at all, and have no previous experience with Unicode, so please pardon a couple stupid questions from the peanut gallery: 1. What does U+0061 mean (other than 'a')? That is, what is U? 2. I saw nothing about encodings in the Codec/StreamReader/StreamWriter description. Given a Unicode object with encoding e1, how do I write it to a file that is to be encoded with encoding e2? Seems like I would do something like u1 = unicode(s, encoding=e1) f = open("somefile", "wb") u2 = unicode(u1, encoding=e2) f.write(u2) Is that how it would be done? Does this question even make sense? 3. What will the impact be on programmers such as myself currently living with blinders on (that is, writing in plain old 7-bit ASCII)? Thx, Skip Montanaro | http://www.mojam.com/ skip at mojam.com | http://www.musi-cal.com/ 847-971-7098 | Python: Programming the way Guido indented... From jim at interet.com Thu Nov 18 20:23:53 1999 From: jim at interet.com (James C. Ahlstrom) Date: Thu, 18 Nov 1999 14:23:53 -0500 Subject: [Python-Dev] Import redesign (was: Python 1.6 status) References: <1269187709-18981857@hypernet.com> <199911181530.KAA03887@eric.cnri.reston.va.us> Message-ID: <38345249.4AFD91DA@interet.com> Guido van Rossum wrote: > > Let's first complete the requirements gathering. Yes. > Are these > requirements reasonable? Will they make an implementation too > complex? I think you can get 90% of where you want to be with something much simpler. And the simpler implementation will be useful in the 100% solution, so it is not wasted time. How about if we just design a Python archive file format; provide code in the core (in Python or C) to import from it; provide a Python program to create archive files; and provide a Standard Directory to put archives in so they can be found quickly. For extensibility and control, we add functions to the imp module. Detailed comments follow: > Compatibility issues: > --------------------- > [list of current features...] Easily met by keeping the current C code. > > New features: > ------------- > > - Integrated support for Greg Ward's distribution utilities (i.e. a > module prepared by the distutil tools should install painlessly) > > - Good support for prospective authors of "all-in-one" packaging tool > authors like Gordon McMillan's win32 installer or /F's squish. (But > I *don't* require backwards compatibility for existing tools.) These tools go well beyond just an archive file format, but hopefully a file format will help. Greg and Gordon should be able to control the format so it meets their needs. We need a standard format. > - Standard import from zip or jar files, in two ways: > > (1) an entry on sys.path can be a zip/jar file instead of a directory; > its contents will be searched for modules or packages > > (2) a file in a directory that's on sys.path can be a zip/jar file; > its contents will be considered as a package (note that this is > different from (1)!) I don't like sys.path at all. It is currently part of the problem. I suggest that archive files MUST be put into a known directory. On Windows this is the directory of the executable, sys.executable. On Unix this $PREFIX plus version, namely "%s/lib/python%s/" % (sys.prefix, sys.version[0:3]). Other platforms can have different rules. We should also have the ability to append archive files to the executable or a shared library assuming the OS allows this (Windows and Linux do allow it). This is the first location searched, nails the archive to the interpreter, insulates us from an erroneous sys.path, and enables single-file Python programs. > I don't particularly care about supporting all zip compression > schemes; if Java gets away with only supporting gzip compression > in jar files, so can we. We don't need compression. The whole ./Lib is 1.2 Meg, and if we compress it to zero we save a Meg. Irrelevant. Installers provide compression anyway so when Python programs are shipped, they will be compressed then. Problems are that Python does not ship with compression, we will have to add it, we will have to support it and its current method of compression forever, and it adds complexity. > - Easy ways to subclass or augment the import mechanism along > different dimensions. For example, while none of the following > features should be part of the core implementation, it should be > easy to add any or all: > > [ List of new features including hooks...] Sigh, this proposal does not provide for this. It seems like a job for imputil. But if the file format and import code is available from the imp module, it can be used as part of the solution. > - support for a new compression scheme to the zip importer I guess compression should be easy to add if Python ships with a compression module. > - a cache for file locations in directories/archives, to improve > startup time If the Python library is available as an archive, I think startup will be greatly improved anyway. > Implementation: > --------------- > > - There must clearly be some code in C that can import certain > essential modules (to solve the chicken-or-egg problem), but I don't > mind if the majority of the implementation is written in Python. > Using Python makes it easy to subclass. Yes. > - In order to support importing from zip/jar files using compression, > we'd at least need the zlib extension module and hence libz itself, > which may not be available everywhere. That's a good reason to omit compression. At least for now. > - I suppose that the bootstrap is solved using a mechanism very > similar to what freeze currently used (other solutions seem to be > platform dependent). Yes, except that we need to be careful to preserve the freeze feature for users. We don't want to take it over. > - I also want to still support importing *everything* from the > filesystem, if only for development. (It's hard enough to deal with > the fact that exceptions.py is needed during Py_Initialize(); > I want to be able to hack on the import code written in Python > without having to rebuild the executable all the time. Yes, we need a function in imp to turn archives off: import imp imp.archiveEnable(0) > Finally, to what extent does this impact the desire for dealing > differently with the Python bytecode compiler (e.g. supporting > optimizers written in Python)? And does it affect the desire to > implement the read-eval-print loop (the >>> prompt) in Python? I don't think it impacts these at all. Jim Ahlstrom From guido at CNRI.Reston.VA.US Thu Nov 18 20:55:02 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Thu, 18 Nov 1999 14:55:02 -0500 Subject: [Python-Dev] Import redesign (was: Python 1.6 status) In-Reply-To: Your message of "Thu, 18 Nov 1999 14:23:53 EST." <38345249.4AFD91DA@interet.com> References: <1269187709-18981857@hypernet.com> <199911181530.KAA03887@eric.cnri.reston.va.us> <38345249.4AFD91DA@interet.com> Message-ID: <199911181955.OAA04830@eric.cnri.reston.va.us> > I think you can get 90% of where you want to be with something > much simpler. And the simpler implementation will be useful in > the 100% solution, so it is not wasted time. Agreed, but I'm not sure that it addresses the problems that started this thread. I can't really tell, since the message starting the thread just requested imputil, without saying which parts of it were needed. A followup claimed that imputil was a fine prototype but too slow for real work. I inferred that flexibility was requested. But maybe that was projection since that was on my own list. (I'm happy with the performance and find manipulating zip or jar files clumsy, so I'm not too concerned about all the nice things you can *do* with that flexibility. :-) > How about if we just design a Python archive file format; provide > code in the core (in Python or C) to import from it; provide a > Python program to create archive files; and provide a Standard > Directory to put archives in so they can be found quickly. For > extensibility and control, we add functions to the imp module. > Detailed comments follow: > These tools go well beyond just an archive file format, but hopefully > a file format will help. Greg and Gordon should be able to control the > format so it meets their needs. We need a standard format. I think the standard format should be a subclass of zip or jar (which is itself a subclass of zip). We have already written (at CNRI, as yet unreleased) the necessary Python tools to manipulate zip archives; moreover 3rd party tools are abundantly available, both on Unix and on Windows (as well as in Java). Zip files also lend themselves to self-extracting archives and similar things, because the file index is at the end, so I think that Greg & Gordon should be happy. > I don't like sys.path at all. It is currently part of the problem. Eh? That's the first thing I hear something bad about it. Maybe that's because you live on Windows -- on Unix, search paths are ubiquitous. > I suggest that archive files MUST be put into a known directory. Why? Maybe this works on Windows; on Unix this is asking for trouble because it prevents users from augmenting the installation provided by the sysadmin. Even on newer Windows versions, users without admin perms may not be allowed to add files to that privileged directory. > On Windows this is the directory of the executable, sys.executable. > On Unix this $PREFIX plus version, namely > "%s/lib/python%s/" % (sys.prefix, sys.version[0:3]). > Other platforms can have different rules. > > We should also have the ability to append archive files to the > executable or a shared library assuming the OS allows this > (Windows and Linux do allow it). This is the first location > searched, nails the archive to the interpreter, insulates us > from an erroneous sys.path, and enables single-file Python programs. OK for the executable. I'm not sure what the point is of appending an archive to the shared library? Anyway, does it matter (on Windows) if you add it to python16.dll or to python.exe? > We don't need compression. The whole ./Lib is 1.2 Meg, and if we > compress > it to zero we save a Meg. Irrelevant. Installers provide compression > anyway so when Python programs are shipped, they will be compressed > then. > > Problems are that Python does not ship with compression, we will > have to add it, we will have to support it and its current method > of compression forever, and it adds complexity. OK, OK. I think most zip tools have a way to turn off the compression. (Anyway, it's a matter of more I/O time vs. more CPU time; hardare for both is getting better faster than we can tweak the code :-) > Sigh, this proposal does not provide for this. It seems > like a job for imputil. But if the file format and import code > is available from the imp module, it can be used as part of the > solution. Well, the question is really if we want flexibility or archive files. I care more about the flexibility. If we get a clear vote for archive files, I see no problem with implementing that first. > If the Python library is available as an archive, I think > startup will be greatly improved anyway. Really? I know about all the system calls it makes, but I don't really see much of a delay -- I have a prompt in well under 0.1 second. --Guido van Rossum (home page: http://www.python.org/~guido/) From gstein at lyra.org Thu Nov 18 23:03:55 1999 From: gstein at lyra.org (Greg Stein) Date: Thu, 18 Nov 1999 14:03:55 -0800 (PST) Subject: [Python-Dev] file modes (was: just say no...) In-Reply-To: <3833B588.1E31F01B@lemburg.com> Message-ID: <Pine.LNX.4.10.9911181358380.10639-100000@nebula.lyra.org> On Thu, 18 Nov 1999, M.-A. Lemburg wrote: > Tim Peters wrote: > > [MAL] > > > File objects opened in text mode will use "t#" and binary > > > ones use "s#". > > > > [Greg Stein] > > > ... > > > The real annoying thing would be to assume that opening a file as 'r' > > > means that I *meant* text mode and to start using "t#". > > > > Isn't that exactly what MAL said would happen? Note that a "t" flag for > > "text mode" is an MS extension -- C doesn't define "t", and Python doesn't > > either; a lone "r" has always meant text mode. > > Em, I think you've got something wrong here: "t#" refers to the > parsing marker used for writing data to files opened in text mode. Nope. We've got it right :-) Tim and I used 'r' and "t" to refer to file-open modes. I used "t#" to refer to the parse marker. >... > I guess you won't notice any difference: strings define both > interfaces ("s#" and "t#") to mean the same thing. Only other > buffer compatible types may now fail to write to text files > -- which is not so bad, because it forces the programmer to > rethink what he really intended when opening the file in text > mode. It *is* bad if it breaks my existing programs in subtle ways that are a bitch to track down. > Besides, if you are writing portable scripts you should pay > close attention to "r" vs. "rb" anyway. I'm not writing portable scripts. I mentioned that once before. I don't want a difference between 'r' and 'rb' on my Linux box. It was never there before, I'm lazy, and I don't want to see it added :-). Honestly, I don't know offhand of any Python types that repond to "s#" and "t#" in different ways, such that changing file.write would end up writing something different (and thereby breaking existing code). I just don't like introduce text/binary to *nix platforms where it didn't exist before. Cheers, -g -- Greg Stein, http://www.lyra.org/ From skip at mojam.com Thu Nov 18 23:15:43 1999 From: skip at mojam.com (Skip Montanaro) Date: Thu, 18 Nov 1999 16:15:43 -0600 (CST) Subject: [Python-Dev] file modes (was: just say no...) In-Reply-To: <Pine.LNX.4.10.9911181358380.10639-100000@nebula.lyra.org> References: <3833B588.1E31F01B@lemburg.com> <Pine.LNX.4.10.9911181358380.10639-100000@nebula.lyra.org> Message-ID: <14388.31375.296388.973848@dolphin.mojam.com> Greg> I'm not writing portable scripts. I mentioned that once before. I Greg> don't want a difference between 'r' and 'rb' on my Linux box. It Greg> was never there before, I'm lazy, and I don't want to see it added Greg> :-). ... Greg> I just don't like introduce text/binary to *nix platforms where it Greg> didn't exist before. I'll vote with Greg, Guido's cross-platform conversion not withstanding. If I haven't been writing portable scripts up to this point because I only care about a single target platform, why break my scripts for me? Forcing me to use "rb" or "wb" on my open calls isn't going to make them portable anyway. There are probably many other harder to identify and correct portability issues than binary file access anyway. Seems like requiring "b" is just going to cause gratuitous breakage with no obvious increase in portability. porta-nanny.py-anyone?-ly y'rs, Skip Montanaro | http://www.mojam.com/ skip at mojam.com | http://www.musi-cal.com/ 847-971-7098 | Python: Programming the way Guido indented... From jim at interet.com Thu Nov 18 23:40:05 1999 From: jim at interet.com (James C. Ahlstrom) Date: Thu, 18 Nov 1999 17:40:05 -0500 Subject: [Python-Dev] Import redesign (was: Python 1.6 status) References: <1269187709-18981857@hypernet.com> <199911181530.KAA03887@eric.cnri.reston.va.us> <38345249.4AFD91DA@interet.com> <199911181955.OAA04830@eric.cnri.reston.va.us> Message-ID: <38348045.BB95F783@interet.com> Guido van Rossum wrote: > I think the standard format should be a subclass of zip or jar (which > is itself a subclass of zip). We have already written (at CNRI, as > yet unreleased) the necessary Python tools to manipulate zip archives; > moreover 3rd party tools are abundantly available, both on Unix and on > Windows (as well as in Java). Zip files also lend themselves to > self-extracting archives and similar things, because the file index is > at the end, so I think that Greg & Gordon should be happy. Think about multiple packages in multiple zip files. The zip files store file directories. That means we would need a sys.zippath to search the zip files. I don't want another PYTHONPATH phenomenon. Greg Stein and I once discussed this (and Gordon I think). They argued that the directories should be flattened. That is, think of all directories which can be reached on PYTHONPATH. Throw away all initial paths. The resultant archive has *.pyc at the top level, as well as package directories only. The search path is "." in every archive file. No directory information is stored, only module names, some with dots. > > I don't like sys.path at all. It is currently part of the problem. > > Eh? That's the first thing I hear something bad about it. Maybe > that's because you live on Windows -- on Unix, search paths are > ubiquitous. On windows, just print sys.path. It is junk. A commercial distribution has to "just work", and it fails if a second installation (by someone else) changes PYTHONPATH to suit their app. I am trying to get to "just works", no excuses, no complications. > > I suggest that archive files MUST be put into a known directory. > > Why? Maybe this works on Windows; on Unix this is asking for trouble > because it prevents users from augmenting the installation provided by > the sysadmin. Even on newer Windows versions, users without admin > perms may not be allowed to add files to that privileged directory. It works on Windows because programs install themselves in their own subdirectories, and can put files there instead of /windows/system32. This holds true for Windows 2000 also. A Unix-style installation to /windows/system32 would (may?) require "administrator" privilege. On Unix you are right. I didn't think of that because I am the Unix sysadmin here, so I can put things where I want. The Windows solution doesn't fit with Unix, because executables go in a ./bin directory and putting library files there is a no-no. Hmmmm... This needs more thought. Anyone else have ideas?? > > We should also have the ability to append archive files to the > > executable or a shared library assuming the OS allows this > > OK for the executable. I'm not sure what the point is of appending an > archive to the shared library? Anyway, does it matter (on Windows) if > you add it to python16.dll or to python.exe? The point of using python16.dll is to append the Python library to it, and append to python.exe (or use files) for everything else. That way, the 1.6 interpreter is linked to the 1.6 Lib, upgrading to 1.7 means replacing only one file, and there is no wasted storage in multiple Lib's. I am thinking of multiple Python programs in different directories. But maybe you are right. On Windows, if python.exe can be put in /windows/system32 then it really doesn't matter. > OK, OK. I think most zip tools have a way to turn off the > compression. (Anyway, it's a matter of more I/O time vs. more CPU > time; hardare for both is getting better faster than we can tweak the > code :-) Well, if Python now has its own compression that is built in and comes with it, then that is different. Maybe compression is OK. > Well, the question is really if we want flexibility or archive files. > I care more about the flexibility. If we get a clear vote for archive > files, I see no problem with implementing that first. I don't like flexibility, I like standardization and simplicity. Flexibility just encourages users to do the wrong thing. Everyone vote please. I don't have a solid feeling about what people want, only what they don't like. > > If the Python library is available as an archive, I think > > startup will be greatly improved anyway. > > Really? I know about all the system calls it makes, but I don't > really see much of a delay -- I have a prompt in well under 0.1 > second. So do I. I guess I was just echoing someone else's complaint. JimA From mal at lemburg.com Fri Nov 19 00:28:31 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Fri, 19 Nov 1999 00:28:31 +0100 Subject: [Python-Dev] file modes (was: just say no...) References: <Pine.LNX.4.10.9911181358380.10639-100000@nebula.lyra.org> Message-ID: <38348B9F.A31B09C4@lemburg.com> Greg Stein wrote: > > On Thu, 18 Nov 1999, M.-A. Lemburg wrote: > > Tim Peters wrote: > > > [MAL] > > > > File objects opened in text mode will use "t#" and binary > > > > ones use "s#". > > > > > > [Greg Stein] > > > > ... > > > > The real annoying thing would be to assume that opening a file as 'r' > > > > means that I *meant* text mode and to start using "t#". > > > > > > Isn't that exactly what MAL said would happen? Note that a "t" flag for > > > "text mode" is an MS extension -- C doesn't define "t", and Python doesn't > > > either; a lone "r" has always meant text mode. > > > > Em, I think you've got something wrong here: "t#" refers to the > > parsing marker used for writing data to files opened in text mode. > > Nope. We've got it right :-) > > Tim and I used 'r' and "t" to refer to file-open modes. I used "t#" to > refer to the parse marker. Ah, ok. But "t" as file opener is non-portable anyways, so I'll skip it here :-) > >... > > I guess you won't notice any difference: strings define both > > interfaces ("s#" and "t#") to mean the same thing. Only other > > buffer compatible types may now fail to write to text files > > -- which is not so bad, because it forces the programmer to > > rethink what he really intended when opening the file in text > > mode. > > It *is* bad if it breaks my existing programs in subtle ways that are a > bitch to track down. > > > Besides, if you are writing portable scripts you should pay > > close attention to "r" vs. "rb" anyway. > > I'm not writing portable scripts. I mentioned that once before. I don't > want a difference between 'r' and 'rb' on my Linux box. It was never there > before, I'm lazy, and I don't want to see it added :-). > > Honestly, I don't know offhand of any Python types that repond to "s#" and > "t#" in different ways, such that changing file.write would end up writing > something different (and thereby breaking existing code). > > I just don't like introduce text/binary to *nix platforms where it didn't > exist before. Please remember that up until now you were probably only using strings to write to files. Python strings don't differentiate between "t#" and "s#" so you wont see any change in function or find subtle errors being introduced. If you are already using the buffer feature for e.g. array which also implement "s#" but don't support "t#" for obvious reasons you'll run into trouble, but then: arrays are binary data, so changing from text mode to binary mode is well worth the effort even if you just consider it a nuisance. Since the buffer interface and its consequences haven't published yet, there are probably very few users out there who would actually run into any problems. And even if they do, its a good chance to catch subtle bugs which would only have shown up when trying to port to another platform. I'll leave the rest for Guido to answer, since it was his idea ;-) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Fri Nov 19 00:41:32 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Fri, 19 Nov 1999 00:41:32 +0100 Subject: [Python-Dev] Unicode Proposal: Version 0.7 References: <382C0A54.E6E8328D@lemburg.com> <382D625B.DC14DBDE@lemburg.com> <38316685.7977448D@lemburg.com> <3834425A.8E9C3B7E@lemburg.com> <14388.20218.294814.234327@dolphin.mojam.com> Message-ID: <38348EAC.82B41A4D@lemburg.com> Skip Montanaro wrote: > > I haven't been following this discussion closely at all, and have no > previous experience with Unicode, so please pardon a couple stupid questions > from the peanut gallery: > > 1. What does U+0061 mean (other than 'a')? That is, what is U? U+XXXX means Unicode character with ordinal hex number XXXX. It is basically just another way to say, hey I want the Unicode character at position 0xXXXX in the Unicode spec. > 2. I saw nothing about encodings in the Codec/StreamReader/StreamWriter > description. Given a Unicode object with encoding e1, how do I write > it to a file that is to be encoded with encoding e2? Seems like I > would do something like > > u1 = unicode(s, encoding=e1) > f = open("somefile", "wb") > u2 = unicode(u1, encoding=e2) > f.write(u2) > > Is that how it would be done? Does this question even make sense? The unicode() constructor converts all input to Unicode as basis for other conversions. In the above example, s would be converted to Unicode using the assumption that the bytes in s represent characters encoded using the encoding given in e1. The line with u2 would raise a TypeError, because u1 is not a string. To convert a Unicode object u1 to another encoding, you would have to call the .encode() method with the intended new encoding. The Unicode object will then take care of the conversion of its internal Unicode data into a string using the given encoding, e.g. you'd write: f.write(u1.encode(e2)) > 3. What will the impact be on programmers such as myself currently > living with blinders on (that is, writing in plain old 7-bit ASCII)? If you don't want your scripts to know about Unicode, nothing will really change. In case you do use e.g. Latin-1 characters in your scripts for strings, you are asked to include a pragma in the comment lines at the beginning of the script (so that programmers viewing your code using other encoding have a chance to figure out what you've written). Here's the text from the proposal: """ Note that you should provide some hint to the encoding you used to write your programs as pragma line in one the first few comment lines of the source file (e.g. '# source file encoding: latin-1'). If you only use 7-bit ASCII then everything is fine and no such notice is needed, but if you include Latin-1 characters not defined in ASCII, it may well be worthwhile including a hint since people in other countries will want to be able to read you source strings too. """ Other than that you can continue to use normal strings like you always have. Hope that clarifies things at least a bit, -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mhammond at skippinet.com.au Fri Nov 19 01:27:09 1999 From: mhammond at skippinet.com.au (Mark Hammond) Date: Fri, 19 Nov 1999 11:27:09 +1100 Subject: [Python-Dev] file modes (was: just say no...) In-Reply-To: <38348B9F.A31B09C4@lemburg.com> Message-ID: <003401bf3224$d231be30$0501a8c0@bobcat> [MAL] > If you are already using the buffer feature for e.g. array which > also implement "s#" but don't support "t#" for obvious reasons > you'll run into trouble, but then: arrays are binary data, > so changing from text mode to binary mode is well worth the > effort even if you just consider it a nuisance. Breaking existing code that works should be considered more than a nuisance. However, one answer would be to have "t#" _prefer_ to use the text buffer, but not insist on it. eg, the logic for processing "t#" could check if the text buffer is supported, and if not move back to the blob buffer. This should mean that all existing code still works, except for objects that support both buffers to mean different things. AFAIK there are no objects that qualify today, so it should work fine. Unix users _will_ need to revisit their thinking about "text mode" vs "binary mode" when writing these new objects (such as Unicode), but IMO that is more than reasonable - Unix users dont bother qualifying the open mode of their files, simply because it has no effect on their files. If for certain objects or requirements there _is_ a distinction, then new code can start to think these issues through. "Portable File IO" will simply be extended from simply "portable among all platforms" to "portable among all platforms and objects". Mark. From gmcm at hypernet.com Fri Nov 19 03:23:44 1999 From: gmcm at hypernet.com (Gordon McMillan) Date: Thu, 18 Nov 1999 21:23:44 -0500 Subject: [Python-Dev] Import redesign (was: Python 1.6 status) In-Reply-To: <38348045.BB95F783@interet.com> Message-ID: <1269144272-21594530@hypernet.com> [Guido] > > I think the standard format should be a subclass of zip or jar > > (which is itself a subclass of zip). We have already written > > (at CNRI, as yet unreleased) the necessary Python tools to > > manipulate zip archives; moreover 3rd party tools are > > abundantly available, both on Unix and on Windows (as well as > > in Java). Zip files also lend themselves to self-extracting > > archives and similar things, because the file index is at the > > end, so I think that Greg & Gordon should be happy. No problem (I created my own formats for relatively minor reasons). [JimA] > Think about multiple packages in multiple zip files. The zip > files store file directories. That means we would need a > sys.zippath to search the zip files. I don't want another > PYTHONPATH phenomenon. What if sys.path looked like: [DirImporter('.'), ZlibImporter('c:/python/stdlib.pyz'), ...] > Greg Stein and I once discussed this (and Gordon I think). They > argued that the directories should be flattened. That is, think > of all directories which can be reached on PYTHONPATH. Throw > away all initial paths. The resultant archive has *.pyc at the > top level, as well as package directories only. The search path > is "." in every archive file. No directory information is > stored, only module names, some with dots. While I do flat archives (no dots, but that's a different story), there's no reason the archive couldn't be structured. Flat archives are definitely simpler. [JimA] > > > I don't like sys.path at all. It is currently part of the > > > problem. [Guido] > > Eh? That's the first thing I hear something bad about it. > > Maybe that's because you live on Windows -- on Unix, search > > paths are ubiquitous. > > On windows, just print sys.path. It is junk. A commercial > distribution has to "just work", and it fails if a second > installation (by someone else) changes PYTHONPATH to suit their > app. I am trying to get to "just works", no excuses, no > complications. Py_Initialize (); PyRun_SimpleString ("import sys; del sys.path[1:]"); Yeah, there's a hole there. Fixable if you could do a little pre- Py_Initialize twiddling. > > > I suggest that archive files MUST be put into a known > > > directory. No way. Hard code a directory? Overwrite someone else's Python "standalone"? Write to a C: partition that is deliberately sized to hold nothing but Windows? Make network installations impossible? > > Why? Maybe this works on Windows; on Unix this is asking for > > trouble because it prevents users from augmenting the > > installation provided by the sysadmin. Even on newer Windows > > versions, users without admin perms may not be allowed to add > > files to that privileged directory. > > It works on Windows because programs install themselves in their > own subdirectories, and can put files there instead of > /windows/system32. This holds true for Windows 2000 also. A > Unix-style installation to /windows/system32 would (may?) require > "administrator" privilege. There's nothing Unix-style about installing to /Windows/system32. 'Course *they* have symbolic links that actually work... > On Unix you are right. I didn't think of that because I am the > Unix sysadmin here, so I can put things where I want. The > Windows solution doesn't fit with Unix, because executables go in > a ./bin directory and putting library files there is a no-no. > Hmmmm... This needs more thought. Anyone else have ideas?? The official Windows solution is stuff in registry about app paths and such. Putting the dlls in the exe's directory is a workaround which works and is more managable than the official solution. > > > We should also have the ability to append archive files to > > > the executable or a shared library assuming the OS allows > > > this That's a handy trick on Windows, but it's got nothing to do with Python. > > Well, the question is really if we want flexibility or archive > > files. I care more about the flexibility. If we get a clear > > vote for archive files, I see no problem with implementing that > > first. > > I don't like flexibility, I like standardization and simplicity. > Flexibility just encourages users to do the wrong thing. I've noticed that the people who think there should only be one way to do things never agree on what it is. > Everyone vote please. I don't have a solid feeling about > what people want, only what they don't like. Flexibility. You can put Christian's favorite Einstein quote here too. > > > If the Python library is available as an archive, I think > > > startup will be greatly improved anyway. > > > > Really? I know about all the system calls it makes, but I > > don't really see much of a delay -- I have a prompt in well > > under 0.1 second. > > So do I. I guess I was just echoing someone else's complaint. Install some stuff. Deinstall some of it. Repeat (mixing up the order) until your registry and hard drive are shattered into tiny little fragments. It doesn't take long (there's lots of stuff a defragmenter can't touch once it's there). - Gordon From mal at lemburg.com Fri Nov 19 10:08:44 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Fri, 19 Nov 1999 10:08:44 +0100 Subject: [Python-Dev] file modes (was: just say no...) References: <003401bf3224$d231be30$0501a8c0@bobcat> Message-ID: <3835139C.344F3EEE@lemburg.com> Mark Hammond wrote: > > [MAL] > > > If you are already using the buffer feature for e.g. array which > > also implement "s#" but don't support "t#" for obvious reasons > > you'll run into trouble, but then: arrays are binary data, > > so changing from text mode to binary mode is well worth the > > effort even if you just consider it a nuisance. > > Breaking existing code that works should be considered more than a > nuisance. Its an error that pretty easy to fix... that's what I was referring to with "nuisance". All you have to do is open the file in binary mode and you're done. BTW, the change will only effect platforms that don't differ between text and binary mode, e.g. Unix ones. > However, one answer would be to have "t#" _prefer_ to use the text > buffer, but not insist on it. eg, the logic for processing "t#" could > check if the text buffer is supported, and if not move back to the > blob buffer. I doubt that this is conform to what the buffer interface want's to reflect: if the getcharbuf slot is not implemented this means "I am not text". If you would write non-text to a text file, this may cause line breaks to be interpreted in ways that are incompatible with the binary data, i.e. when you read the data back in, it may fail to load because e.g. '\n' was converted to '\r\n'. > This should mean that all existing code still works, except for > objects that support both buffers to mean different things. AFAIK > there are no objects that qualify today, so it should work fine. Well, even though the code would work, it might break badly someday for the above reasons. Better fix that now when there aren't too many possible cases around than at some later point where the user has to figure out the problem for himself due to the system not warning him about this. > Unix users _will_ need to revisit their thinking about "text mode" vs > "binary mode" when writing these new objects (such as Unicode), but > IMO that is more than reasonable - Unix users dont bother qualifying > the open mode of their files, simply because it has no effect on their > files. If for certain objects or requirements there _is_ a > distinction, then new code can start to think these issues through. > "Portable File IO" will simply be extended from simply "portable among > all platforms" to "portable among all platforms and objects". Right. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 42 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Fri Nov 19 10:56:03 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Fri, 19 Nov 1999 10:56:03 +0100 Subject: [Python-Dev] Codecs and StreamCodecs References: <002401bf31b3$bf16c230$0501a8c0@bobcat> <3833E5EC.AAFE5016@lemburg.com> <199911181537.KAA03911@eric.cnri.reston.va.us> <383427ED.45A01BBB@lemburg.com> <199911181637.LAA04260@eric.cnri.reston.va.us> Message-ID: <38351EB3.153FCDFC@lemburg.com> Guido van Rossum wrote: > > > Like a path of search functions ? Not a bad idea... I will still > > want the internal dict for caching purposes though. I'm not sure > > how often these encodings will be, but even a few hundred function > > call will slow down the Unicode implementation quite a bit. > > Of course. (It's like sys.modules caching the results of an import). I've fixed the "path of search functions" approach in the latest version of the spec. > [...] > > def flush(self): > > > > """ Flushed the codec buffers used for keeping state. > > > > Returns values are not defined. Implementations are free to > > return None, raise an exception (in case there is pending > > data in the buffers which could not be decoded) or > > return any remaining data from the state buffers used. > > > > """ > > I don't know where this came from, but a flush() should work like > flush() on a file. It came from Fredrik's proposal. > It doesn't return a value, it just sends any > remaining data to the underlying stream (for output). For input it > shouldn't be supported at all. > > The idea is that flush() should do the same to the encoder state that > close() followed by a reopen() would do. Well, more or less. But if > the process were to be killed right after a flush(), the data written > to disk should be a complete encoding, and not have a lingering shift > state. Ok. I've modified the API as follows: StreamWriter: def flush(self): """ Flushes and resets the codec buffers used for keeping state. Calling this method should ensure that the data on the output is put into a clean state, that allows appending of new fresh data without having to rescan the whole stream to recover state. """ pass StreamReader: def read(self,chunksize=0): """ Decodes data from the stream self.stream and returns a tuple (Unicode object, bytes consumed). chunksize indicates the approximate maximum number of bytes to read from the stream for decoding purposes. The decoder can modify this setting as appropriate. The default value 0 indicates to read and decode as much as possible. The chunksize is intended to prevent having to decode huge files in one step. The method should use a greedy read strategy meaning that it should read as much data as is allowed within the definition of the encoding and the given chunksize, e.g. if optional encoding endings or state markers are available on the stream, these should be read too. """ ... the base class should provide a default implementation of this method using self.decode ... def reset(self): """ Resets the codec buffers used for keeping state. Note that no stream repositioning should take place. This method is primarely intended to recover from decoding errors. """ pass The .reset() method replaces the .flush() method on StreamReaders. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 42 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Fri Nov 19 10:22:48 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Fri, 19 Nov 1999 10:22:48 +0100 Subject: [Python-Dev] Import redesign (was: Python 1.6 status) References: <1269187709-18981857@hypernet.com> <199911181530.KAA03887@eric.cnri.reston.va.us> Message-ID: <383516E8.EE66B527@lemburg.com> Guido van Rossum wrote: > > Let's first complete the requirements gathering. Are these > requirements reasonable? Will they make an implementation too > complex? Am I missing anything? Since you were asking: I would like functionality equivalent to my latest import patch for a slightly different lookup scheme for module import inside packages to become a core feature. If it becomes a core feature I promise to never again start threads about relative imports :-) Here's the summary again: """ [The patch] changes the default import mechanism to work like this: >>> import d # from directory a/b/c/ try a.b.c.d try a.b.d try a.d try d fail instead of just doing the current two-level lookup: >>> import d # from directory a/b/c/ try a.b.c.d try d fail As a result, relative imports referring to higher level packages work out of the box without any ugly underscores in the import name. Plus the whole scheme is pretty simple to explain and straightforward. """ You can find the patch attached to the message "Walking up the package hierarchy" in the python-dev mailing list archive. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 42 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From captainrobbo at yahoo.com Fri Nov 19 14:01:04 1999 From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=) Date: Fri, 19 Nov 1999 05:01:04 -0800 (PST) Subject: [Python-Dev] Codecs and StreamCodecs Message-ID: <19991119130104.21726.rocketmail@ web605.yahoomail.com> --- "M.-A. Lemburg" <mal at lemburg.com> wrote: > Guido van Rossum wrote: > > I don't know where this came from, but a flush() > should work like > > flush() on a file. > > It came from Fredrik's proposal. > > > It doesn't return a value, it just sends any > > remaining data to the underlying stream (for > output). For input it > > shouldn't be supported at all. > > > > The idea is that flush() should do the same to the > encoder state that > > close() followed by a reopen() would do. Well, > more or less. But if > > the process were to be killed right after a > flush(), the data written > > to disk should be a complete encoding, and not > have a lingering shift > > state. > This could be useful in real life. For example, iso-2022-jp has a 'single-byte-mode' and a 'double-byte-mode' with shift-sequences to separate them. The rule is that each line in the text file or email message or whatever must begin and end in single-byte mode. So I would take flush() to mean 'shift back to ASCII now'. Calling flush and reopen would thus "almost" get the same data across. I'm trying to think if it would be dangerous. Do web and ftp servers often call flush() in the middle of transmitting a block of text? - Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From fredrik at pythonware.com Fri Nov 19 14:33:50 1999 From: fredrik at pythonware.com (Fredrik Lundh) Date: Fri, 19 Nov 1999 14:33:50 +0100 Subject: [Python-Dev] Codecs and StreamCodecs References: <19991119130104.21726.rocketmail@ web605.yahoomail.com> Message-ID: <000701bf3292$b7c49130$f29b12c2@secret.pythonware.com> Andy Robinson <captainrobbo at yahoo.com> wrote: > So I would take flush() to mean 'shift back to > ASCII now'. if we're still talking about my "just one codec, please" proposal, that's exactly what encoder.flush should do. while decoder.flush should raise an ex- ception if you're still in double byte mode (at least if running in 'strict' mode). > Calling flush and reopen would thus "almost" get the > same data across. > > I'm trying to think if it would be dangerous. Do web > and ftp servers often call flush() in the middle of > transmitting a block of text? again, if we're talking about my proposal, these flush methods are only called by the string or stream wrappers, never by the applications. see the original post for de- tails. </F> From gstein at lyra.org Fri Nov 19 14:29:50 1999 From: gstein at lyra.org (Greg Stein) Date: Fri, 19 Nov 1999 05:29:50 -0800 (PST) Subject: [Python-Dev] Import redesign [LONG] In-Reply-To: <199911181530.KAA03887@eric.cnri.reston.va.us> Message-ID: <Pine.LNX.4.10.9911190404580.10639-100000@nebula.lyra.org> On Thu, 18 Nov 1999, Guido van Rossum wrote: > Gordon McMillan wrote: >... > > I think imputil's emulation of the builtin importer is more of a > > demonstration than a serious implementation. As for speed, it > > depends on the test. > > Agreed. I like some of imputil's features, but I think the API > need to be redesigned. It what ways? It sounds like you've applied some thought. Do you have any concrete ideas yet, or "just a feeling" :-) I'm working through some changes from JimA right now, and would welcome other suggestions. I think there may be some outstanding stuff from MAL, but I'm not sure (Marc?) >... > So here's a challenge: redesign the import API from scratch. I would suggest starting with imputil and altering as necessary. I'll use that viewpoint below. > Let me start with some requirements. > > Compatibility issues: > --------------------- > > - the core API may be incompatible, as long as compatibility layers > can be provided in pure Python Which APIs are you referring to? The "imp" module? The C functions? The __import__ and reload builtins? I'm guessing some of imp, the two builtins, and only one or two C functions. > - support for rexec functionality No problem. I can think of a number of ways to do this. > - support for freeze functionality No problem. A function in "imp" must be exposed to Python to support this within the imputil framework. > - load .py/.pyc/.pyo files and shared libraries from files No problem. Again, a function is needed for platform-specific loading of shared libraries. > - support for packages No problem. Demo's in current imputil. > - sys.path and sys.modules should still exist; sys.path might > have a slightly different meaning I would suggest that both retain their *exact* meaning. We introduce sys.importers -- a list of importers to check, in sequence. The first importer on that list uses sys.path to look for and load modules. The second importer loads builtins and frozen code (i.e. modules not on sys.path). Users can insert/append new importers or alter sys.path as before. sys.modules continues to record name:module mappings. > - $PYTHONPATH and $PYTHONHOME should still be supported No problem. > (I wouldn't mind a splitting up of importdl.c into several > platform-specific files, one of which is chosen by the configure > script; but that's a bit of a separate issue.) Easy enough. The standard importer can select the appropriate platform-specific module/function to perform the load. i.e. these can move to Modules/ and be split into a module-per-platform. > New features: > ------------- > > - Integrated support for Greg Ward's distribution utilities (i.e. a > module prepared by the distutil tools should install painlessly) I don't know the specific requirements/functionality that would be required here (does Greg? :-), but I can't imagine any problem with this. > - Good support for prospective authors of "all-in-one" packaging tool > authors like Gordon McMillan's win32 installer or /F's squish. (But > I *don't* require backwards compatibility for existing tools.) Um. *No* problem. :-) > - Standard import from zip or jar files, in two ways: > > (1) an entry on sys.path can be a zip/jar file instead of a directory; > its contents will be searched for modules or packages While this could easily be done, I might argue against it. Old apps/modules that process sys.path might get confused. If compatibility is not an issue, then "No problem." An alternative would be an Importer instance added to sys.importers that is configured for a specific archive (in other words, don't add the zip file to sys.path, add ZipImporter(file) to sys.importers). Another alternative is an Importer that looks at a "sys.py_archives" list. Or an Importer that has a py_archives instance attribute. > (2) a file in a directory that's on sys.path can be a zip/jar file; > its contents will be considered as a package (note that this is > different from (1)!) No problem. This will slow things down, as a stat() for *.zip and/or *.jar must be done, in addition to *.py, *.pyc, and *.pyo. > I don't particularly care about supporting all zip compression > schemes; if Java gets away with only supporting gzip compression > in jar files, so can we. I presume we would support whatever zlib gives us, and no more. > - Easy ways to subclass or augment the import mechanism along > different dimensions. For example, while none of the following > features should be part of the core implementation, it should be > easy to add any or all: > > - support for a new compression scheme to the zip importer Presuming ZipImporter is a class (derived from Importer), then this ability is wholly dependent upon the author of ZipImporter providing the hook. The Importer class is already designed for subclassing (and its interface is very narrow, which means delegation is also *very* easy; see imputil.FuncImporter). > - support for a new archive format, e.g. tar A cakewalk. Gordon, JimA, and myself each have archive formats. :-) > - a hook to import from URLs or other data sources (e.g. a > "module server" imported in CORBA) (this needn't be supported > through $PYTHONPATH though) No problem at all. > - a hook that imports from compressed .py or .pyc/.pyo files No problem at all. > - a hook to auto-generate .py files from other filename > extensions (as currently implemented by ILU) No problem at all. > - a cache for file locations in directories/archives, to improve > startup time No problem at all. > - a completely different source of imported modules, e.g. for an > embedded system or PalmOS (which has no traditional filesystem) No problem at all. In each of the above cases, the Importer.get_code() method just needs to grab the byte codes from the XYZ data source. That data source can be cmopressed, across a network, on-the-fly generated, or whatever. Each importer can certainly create a cache based on its concept of "location". In some cases, that would be a mapping from module name to filesystem path, or to a URL, or to a compiled-in, frozen module. > - Note that different kinds of hooks should (ideally, and within > reason) properly combine, as follows: if I write a hook to recognize > .spam files and automatically translate them into .py files, and you > write a hook to support a new archive format, then if both hooks are > installed together, it should be possible to find a .spam file in an > archive and do the right thing, without any extra action. Right? Ack. Very, very difficult. The imputil scheme combines the concept of locating/loading into one step. There is only one "hook" in the imputil system. Its semantic is "map this name to a code/module object and return it; if you don't have it, then return None." Your compositing example is based on the capabilities of the find-then-load paradigm of the existing "ihooks.py". One module finds something (foo.spam) and the other module loads it (by generating a .py). All is not lost, however. I can easily envision the get_code() hook as allowing any kind of return type. If it isn't a code or module object, then another hook is called to transform it. [ actually, I'd design it similarly: a *series* of hooks would be called until somebody transforms the foo.spam into a code/module object. ] The compositing would be limited ony by the (Python-based) Importer classes. For example, my ZipImporter might expect to zip up .pyc files *only*. Obviously, you would want to alter this to support zipping any file, then use the suffic to determine what to do at unzip time. > - It should be possible to write hooks in C/C++ as well as Python Use FuncImporter to delegate to an extension module. This is one of the benefits of imputil's single/narrow interface. > - Applications embedding Python may supply their own implementations, > default search path, etc., but don't have to if they want to piggyback > on an existing Python installation (even though the latter is > fraught with risk, it's cheaper and easier to understand). An application would have full control over the contents of sys.importers. For a restricted execution app, it might install an Importer that loads files from *one* directory only which is configured from a specific Win32 Registry entry. That importer could also refuse to load shared modules. The BuiltinImporter would still be present (although the app would certainly omit all but the necessary builtins from the build). Frozen modules could be excluded. > Implementation: > --------------- > > - There must clearly be some code in C that can import certain > essential modules (to solve the chicken-or-egg problem), but I don't > mind if the majority of the implementation is written in Python. > Using Python makes it easy to subclass. I posited once before that the cost of import is mostly I/O rather than CPU, so using Python should not be an issue. MAL demonstrated that a good design for the Importer classes is also required. Based on this, I'm a *strong* advocate of moving as much as possible into Python (to get Python's ease-of-coding with little relative cost). The (core) C code should be able to search a path for a module and import it. It does not require dynamic loading or packages. This will be used to import exceptions.py, then imputil.py, then site.py. The platform-specific module that perform dynamic-loading must be a statically linked module (in Modules/ ... it doesn't have to be in the Python/ directory). site.py can complete the bootstrap by setting up sys.importers with the appropriate Importer instances (this is where an application can define its own policy). sys.path was initially set by the import.c bootstrap code (from the compiled-in path and environment variables). Note that imputil.py would not install any hooks when it is loaded. That is up to site.py. This implies the core C code will import a total of three modules using its builtin system. After that, the imputil mechanism would be importing everything (site.py would .install() an Importer which then takes over the __import__ hook). Further note that the "import" Python statement could be simplified to use only the hook. However, this would require the core importer to inject some module names into the imputil module's namespace (since it couldn't use an import statement until a hook was installed). While this simplification is "neat", it complicates the run-time system (the import statement is broken until a hook is installed). Therefore, the core C code must also support importing builtins. "sys" and "imp" are needed by imputil to bootstrap. The core importer should not need to deal with dynamic-load modules. To support frozen apps, the core importer would need to support loading the three modules as frozen modules. The builtin/frozen importing would be exposed thru "imp" for use by imputil for future imports. imputil would load and use the (builtin) platform-specific module to do dynamic-load imports. > - In order to support importing from zip/jar files using compression, > we'd at least need the zlib extension module and hence libz itself, > which may not be available everywhere. Yes. I don't see this as a requirement, though. We wouldn't start to use these by default, would we? Or insist on zlib being present? I see this as more along the lines of "we have provided a standardized Importer to do this, *provided* you have zlib support." > - I suppose that the bootstrap is solved using a mechanism very > similar to what freeze currently used (other solutions seem to be > platform dependent). The bootstrap that I outlined above could be done in C code. The import code would be stripped down dramatically because you'll drop package support and dynamic loading. Alternatively, you could probably do the path-scanning in Python and freeze that into the interpreter. Personally, I don't like this idea as it would not buy you much at all (it would still need to return to C for accessing a number of scanning functions and module importing funcs). > - I also want to still support importing *everything* from the > filesystem, if only for development. (It's hard enough to deal with > the fact that exceptions.py is needed during Py_Initialize(); > I want to be able to hack on the import code written in Python > without having to rebuild the executable all the time. My outline above does not freeze anything. Everything resides in the filesystem. The C code merely needs a path-scanning loop and functions to import .py*, builtin, and frozen types of modules. If somebody nukes their imputil.py or site.py, then they return to Python 1.4 behavior where the core interpreter uses a path for importing (i.e. no packages). They lose dynamically-loaded module support. > Let's first complete the requirements gathering. Are these > requirements reasonable? Will they make an implementation too > complex? Am I missing anything? I'm not a fan of the compositing due to it requiring a change to semantics that I believe are very useful and very clean. However, I outlined a possible, clean solution to do that (a secondary set of hooks for transforming get_code() return values). The requirements are otherwise reasonable to me, as I see that they can all be readily solved (i.e. they aren't burdensome). While this email may be long, I do not believe the resulting system would be complex. From the user-visible side of things, nothing would be changed. sys.path is still present and operates as before. They *do* have new functionality they can grow into, though (sys.importers). The underlying C code is simplified, and the platform-specific dynamic-load stuff can be distributed to distinct modules, as needed (e.g. BeOS/dynloadmodule.c and PC/dynloadmodule.c). > Finally, to what extent does this impact the desire for dealing > differently with the Python bytecode compiler (e.g. supporting > optimizers written in Python)? And does it affect the desire to > implement the read-eval-print loop (the >>> prompt) in Python? If the three startup files require byte-compilation, then you could have some issues (i.e. the byte-compiler must be present). Once you hit site.py, you have a "full" environment and can easily detect and import a read-eval-print loop module (i.e. why return to Python? just start things up right there). site.py can also install new optimizers as desired, a new Python-based parser or compiler, or whatever... If Python is built without a parser or compiler (I hope that's an option!), then the three startup modules would simply be frozen into the executable. Cheers, -g -- Greg Stein, http://www.lyra.org/ From bwarsaw at cnri.reston.va.us Fri Nov 19 17:30:15 1999 From: bwarsaw at cnri.reston.va.us (Barry A. Warsaw) Date: Fri, 19 Nov 1999 11:30:15 -0500 (EST) Subject: [Python-Dev] CVS log messages with diffs References: <199911161700.MAA02716@eric.cnri.reston.va.us> Message-ID: <14389.31511.706588.20840@anthem.cnri.reston.va.us> There was a suggestion to start augmenting the checkin emails to include the diffs of the checkin. This would let you keep a current snapshot of the tree without having to do a direct `cvs update'. I think I can add this without a ton of pain. It would not be optional however, and the emails would get larger (and some checkins could be very large). There's also the question of whether to generate unified or context diffs. Personally, I find context diffs easier to read; unified diffs are smaller but not by enough to really matter. So here's an informal poll. If you don't care either way, you don't need to respond. Otherwise please just respond to me and not to the list. 1. Would you like to start receiving diffs in the checkin messages? 2. If you answer `yes' to #1 above, would you prefer unified or context diffs? -Barry From bwarsaw at cnri.reston.va.us Fri Nov 19 18:04:51 1999 From: bwarsaw at cnri.reston.va.us (Barry A. Warsaw) Date: Fri, 19 Nov 1999 12:04:51 -0500 (EST) Subject: [Python-Dev] Another 1.6 wish Message-ID: <14389.33587.947368.547023@anthem.cnri.reston.va.us> We had some discussion a while back about enabling thread support by default, if the underlying OS supports it obviously. I'd like to see that happen for 1.6. IIRC, this shouldn't be too hard -- just a few tweaks of the configure script (and who knows what for those minority platforms that don't use configure :). -Barry From akuchlin at mems-exchange.org Fri Nov 19 18:07:07 1999 From: akuchlin at mems-exchange.org (Andrew M. Kuchling) Date: Fri, 19 Nov 1999 12:07:07 -0500 (EST) Subject: [Python-Dev] Another 1.6 wish In-Reply-To: <14389.33587.947368.547023@anthem.cnri.reston.va.us> References: <14389.33587.947368.547023@anthem.cnri.reston.va.us> Message-ID: <14389.33723.270207.374259@amarok.cnri.reston.va.us> Barry A. Warsaw writes: >We had some discussion a while back about enabling thread support by >default, if the underlying OS supports it obviously. I'd like to see That reminds me... what about the free threading patches? Perhaps they should be added to the list of issues to consider for 1.6. -- A.M. Kuchling http://starship.python.net/crew/amk/ Oh, my fingers! My arms! My legs! My everything! Argh... -- The Doctor, in "Nightmare of Eden" From petrilli at amber.org Fri Nov 19 18:23:02 1999 From: petrilli at amber.org (Christopher Petrilli) Date: Fri, 19 Nov 1999 12:23:02 -0500 Subject: [Python-Dev] Another 1.6 wish In-Reply-To: <14389.33723.270207.374259@amarok.cnri.reston.va.us>; from akuchlin@mems-exchange.org on Fri, Nov 19, 1999 at 12:07:07PM -0500 References: <14389.33587.947368.547023@anthem.cnri.reston.va.us> <14389.33723.270207.374259@amarok.cnri.reston.va.us> Message-ID: <19991119122302.B23400@trump.amber.org> Andrew M. Kuchling [akuchlin at mems-exchange.org] wrote: > Barry A. Warsaw writes: > >We had some discussion a while back about enabling thread support by > >default, if the underlying OS supports it obviously. I'd like to see Yes pretty please! One of the biggest problems we have in the Zope world is that for some unknown reason, most of hte Linux RPMs don't have threading on in them, so people end up having to compile it anyway... while this is a silly thing, it does create problems, and means that we deal with a lot of "dumb" problems. > That reminds me... what about the free threading patches? Perhaps > they should be added to the list of issues to consider for 1.6. My recolection was that unfortunately MOST of the time, they actually slowed down things because of the number of locks involved... Guido can no doubt shed more light onto this, but... there was a reason. Chris -- | Christopher Petrilli | petrilli at amber.org From gmcm at hypernet.com Fri Nov 19 19:22:37 1999 From: gmcm at hypernet.com (Gordon McMillan) Date: Fri, 19 Nov 1999 13:22:37 -0500 Subject: [Python-Dev] Import redesign (was: Python 1.6 status) In-Reply-To: <199911181530.KAA03887@eric.cnri.reston.va.us> References: Your message of "Thu, 18 Nov 1999 09:19:48 EST." <1269187709-18981857@hypernet.com> Message-ID: <1269086690-25057991@hypernet.com> [Guido] > Compatibility issues: > --------------------- > > - the core API may be incompatible, as long as compatibility > layers can be provided in pure Python Good idea. Question: we have keyword import, __import__, imp and PyImport_*. Which of those (if any) define the "core API"? [rexec, freeze: yes] > - load .py/.pyc/.pyo files and shared libraries from files Shared libraries? Might that not involve some rather shady platform-specific magic? If it can be kept kosher, I'm all for it; but I'd say no if it involved, um, undocumented features. > support for packages Absolutely. I'll just comment that the concept of package.__path__ is also affected by the next point. > > - sys.path and sys.modules should still exist; sys.path might > have a slightly different meaning > > - $PYTHONPATH and $PYTHONHOME should still be supported If sys.path changes meaning, should not $PYTHONPATH also? > New features: > ------------- > > - Integrated support for Greg Ward's distribution utilities (i.e. > a > module prepared by the distutil tools should install > painlessly) I assume that this is mostly a matter of $PYTHONPATH and other path manipulation mechanisms? > - Good support for prospective authors of "all-in-one" packaging > tool > authors like Gordon McMillan's win32 installer or /F's squish. > (But I *don't* require backwards compatibility for existing > tools.) I guess you've forgotten: I'm that *really* tall guy <wink>. > - Standard import from zip or jar files, in two ways: > > (1) an entry on sys.path can be a zip/jar file instead of a > directory; > its contents will be searched for modules or packages I don't mind this, but it depends on whether sys.path changes meaning. > (2) a file in a directory that's on sys.path can be a zip/jar > file; > its contents will be considered as a package (note that > this is different from (1)!) But it's affected by the same considerations (eg, do we start with filesystem names and wrap them in importers, or do we just start with importer instances / specifications for importer instances). > I don't particularly care about supporting all zip compression > schemes; if Java gets away with only supporting gzip > compression in jar files, so can we. I think this is a matter of what zip compression is officially blessed. I don't mind if it's none; providing / creating zipped versions for platforms that support it is nearly trivial. > - Easy ways to subclass or augment the import mechanism along > different dimensions. For example, while none of the following > features should be part of the core implementation, it should > be easy to add any or all: > > - support for a new compression scheme to the zip importer > > - support for a new archive format, e.g. tar > > - a hook to import from URLs or other data sources (e.g. a > "module server" imported in CORBA) (this needn't be supported > through $PYTHONPATH though) Which begs the question of the meaning of sys.path; and if it's still filesystem names, how do you get one of these in there? > - a hook that imports from compressed .py or .pyc/.pyo files > > - a hook to auto-generate .py files from other filename > extensions (as currently implemented by ILU) > > - a cache for file locations in directories/archives, to > improve > startup time > > - a completely different source of imported modules, e.g. for > an > embedded system or PalmOS (which has no traditional > filesystem) > > - Note that different kinds of hooks should (ideally, and within > reason) properly combine, as follows: if I write a hook to > recognize .spam files and automatically translate them into .py > files, and you write a hook to support a new archive format, > then if both hooks are installed together, it should be > possible to find a .spam file in an archive and do the right > thing, without any extra action. Right? A bit of discussion: I've got 2 kinds of archives. One can contain anything & is much like a zip (and probably should be a zip). The other contains only compressed .pyc or .pyo. The latter keys contents by logical name, not filesystem name. No extensions, and when a package is imported, the code object returned is the __init__ code object, (vs returning None and letting the import mechanism come back and ask for package.__init__). When you're building an archive, you have to go thru the .py / .pyc / .pyo / is it a package / maybe compile logic anyway. Why not get it all over with, so that at runtime there's no choices to be made. Which means (for this kind of archive) that including somebody's .spam in your archive isn't a matter of a hook, but a matter of adding to the archive's build smarts. > - It should be possible to write hooks in C/C++ as well as Python > > - Applications embedding Python may supply their own > implementations, > default search path, etc., but don't have to if they want to > piggyback on an existing Python installation (even though the > latter is fraught with risk, it's cheaper and easier to > understand). A way of tweaking that which will become sys.path before Py_Initialize would be *most* welcome. > Implementation: > --------------- > > - There must clearly be some code in C that can import certain > essential modules (to solve the chicken-or-egg problem), but I > don't mind if the majority of the implementation is written in > Python. Using Python makes it easy to subclass. > > - In order to support importing from zip/jar files using > compression, > we'd at least need the zlib extension module and hence libz > itself, which may not be available everywhere. > > - I suppose that the bootstrap is solved using a mechanism very > similar to what freeze currently used (other solutions seem to > be platform dependent). There are other possibilites here, but I have only half- formulated ideas at the moment. The critical part for embedding is to be able to *completely* control all path related logic. > - I also want to still support importing *everything* from the > filesystem, if only for development. (It's hard enough to deal > with the fact that exceptions.py is needed during > Py_Initialize(); I want to be able to hack on the import code > written in Python without having to rebuild the executable all > the time. > > Let's first complete the requirements gathering. Are these > requirements reasonable? Will they make an implementation too > complex? Am I missing anything? I'll summarize as follows: 1) What "sys.path" means (and how it's construction can be manipulated) is critical. 2) See 1. > Finally, to what extent does this impact the desire for dealing > differently with the Python bytecode compiler (e.g. supporting > optimizers written in Python)? And does it affect the desire to > implement the read-eval-print loop (the >>> prompt) in Python? I can assure you that code.py runs fine out of an archive :-). - Gordon From gstein at lyra.org Fri Nov 19 22:06:14 1999 From: gstein at lyra.org (Greg Stein) Date: Fri, 19 Nov 1999 13:06:14 -0800 (PST) Subject: [Python-Dev] Import redesign [LONG] In-Reply-To: <Pine.WNT.4.04.9911190928100.315-100000@rigoletto.ski.org> Message-ID: <Pine.LNX.4.10.9911191258180.10639-100000@nebula.lyra.org> [ taking the liberty to CC: this back to python-dev ] On Fri, 19 Nov 1999, David Ascher wrote: > > > (2) a file in a directory that's on sys.path can be a zip/jar file; > > > its contents will be considered as a package (note that this is > > > different from (1)!) > > > > No problem. This will slow things down, as a stat() for *.zip and/or *.jar > > must be done, in addition to *.py, *.pyc, and *.pyo. > > Aside: it strikes me that for Python programs which import lots of files, > 'front-loading' the stat calls could make sense. When you first look at a > directory in sys.path, you read the entire directory in memory, and > successive imports do a stat on the directory to see if it's changed, and > if not use the in-memory data. Or am I completely off my rocker here? Not at all. I thought of this last night after my email. Since the Importer can easily retain state, it can hold a cache of the directory listings. If it doesn't find the file in its cached state, then it can reload the information from disk. If it finds it in the cache, but not on disk, then it can remove the item from its cache. The problem occurs when you path is [A, B], the file is in B, and you add something to A on-the-fly. The cache might direct the importer at B, missing your file. Of course, with the appropriate caveats/warnings, the system would work quite well. It really only breaks during development (which is one reason why I didn't accept some caching changes to imputil from MAL; but that was for the Importer in there; Python's new Importer could have a cache). I'm also not quite sure what the cost of reading a directory is, compared to issuing a bunch of stat() calls. Each directory read is an opendir/readdir(s)/closedir. Note that the DBM approach is kind of similar, but will amortize this cost over many processes. Cheers, -g -- Greg Stein, http://www.lyra.org/ From Jasbahr at origin.EA.com Fri Nov 19 21:59:11 1999 From: Jasbahr at origin.EA.com (Asbahr, Jason) Date: Fri, 19 Nov 1999 14:59:11 -0600 Subject: [Python-Dev] Another 1.6 wish Message-ID: <11A17AA2B9EAD111BCEA00A0C9B4179303385C08@molach.origin.ea.com> My first Python-Dev post. :-) >We had some discussion a while back about enabling thread support by >default, if the underlying OS supports it obviously. What's the consensus about Python microthreads -- a likely candidate for incorporation in 1.6 (or later)? Also, we have a couple minor convenience functions for Python in an MSDEV environment, an exposure of OutputDebugString for writing to the DevStudio log window and a means of tripping DevStudio C/C++ layer breakpoints from Python code (currently experimental). The msvcrt module seems like a likely candidate for these, would these be welcome additions? Thanks, Jason Asbahr Origin Systems, Inc. jasbahr at origin.ea.com From gstein at lyra.org Fri Nov 19 22:35:34 1999 From: gstein at lyra.org (Greg Stein) Date: Fri, 19 Nov 1999 13:35:34 -0800 (PST) Subject: [Python-Dev] Re: [Python-checkins] CVS log messages with diffs In-Reply-To: <14389.31511.706588.20840@anthem.cnri.reston.va.us> Message-ID: <Pine.LNX.4.10.9911191310510.10639-101000@nebula.lyra.org> On Fri, 19 Nov 1999, Barry A. Warsaw wrote: > There was a suggestion to start augmenting the checkin emails to > include the diffs of the checkin. This would let you keep a current > snapshot of the tree without having to do a direct `cvs update'. I've been using diffs-in-checkin for review, rather than to keep a local snapshot updated. I guess you use the email for this (procmail truly is frightening), but I think for most people it would be for purposes of review. >...context vs unifed... > So here's an informal poll. If you don't care either way, you don't > need to respond. Otherwise please just respond to me and not to the > list. > > 1. Would you like to start receiving diffs in the checkin messages? Absolutely. > 2. If you answer `yes' to #1 above, would you prefer unified or > context diffs? Don't care. I've attached an archive of the files that I use in my CVS repository to do emailed diffs. These came from Ken Coar (an Apache guy) as an extraction from the Apache repository. Yes, they do use Perl. I'm not a Perl guy, so I probably would break things if I tried to "fix" the scripts by converting them to Python (in fact, Greg Ward helped to improve log_accum.pl for me!). I certainly would not be adverse to Python versions of these files, or other cleanups. I trimmed down the "avail" file, leaving a few examples. It works with cvs_acls.pl to provide per-CVS-module read/write access control. I'm currently running mod_dav, PyOpenGL, XML-SIG, PyWin32, and two other small projects out of this repository. It has been working quite well. Cheers, -g -- Greg Stein, http://www.lyra.org/ -------------- next part -------------- A non-text attachment was scrubbed... Name: cvs-for-barry.tar.gz Type: application/octet-stream Size: 9668 bytes Desc: URL: <http://mail.python.org/pipermail/python-dev/attachments/19991119/45a7f916/attachment.obj> From bwarsaw at cnri.reston.va.us Fri Nov 19 22:45:14 1999 From: bwarsaw at cnri.reston.va.us (Barry A. Warsaw) Date: Fri, 19 Nov 1999 16:45:14 -0500 (EST) Subject: [Python-Dev] Re: [Python-checkins] CVS log messages with diffs References: <14389.31511.706588.20840@anthem.cnri.reston.va.us> <Pine.LNX.4.10.9911191310510.10639-101000@nebula.lyra.org> Message-ID: <14389.50410.358686.637483@anthem.cnri.reston.va.us> >>>>> "GS" == Greg Stein <gstein at lyra.org> writes: GS> I've been using diffs-in-checkin for review, rather than to GS> keep a local snapshot updated. Interesting; I hadn't though about this use for the diffs. GS> I've attached an archive of the files that I use in my CVS GS> repository to do emailed diffs. These came from Ken Coar (an GS> Apache guy) as an extraction from the Apache repository. Yes, GS> they do use Perl. I'm not a Perl guy, so I probably would GS> break things if I tried to "fix" the scripts by converting GS> them to Python (in fact, Greg Ward helped to improve GS> log_accum.pl for me!). I certainly would not be adverse to GS> Python versions of these files, or other cleanups. Well, we all know Greg Ward's one of those subversive types, but then again it's great to have (hopefully now-loyal) defectors in our camp, just to keep us honest :) Anyway, thanks for sending the code, it'll come in handy if I get stuck. Of course, my P**l skills are so rusted I don't think even an oilcan-armed Dorothy could lube 'em up, so I'm not sure how much use I can put them to. Besides, I already have a huge kludge that gets run on each commit, and I don't think it'll be too hard to add diff generation... IF the informal vote goes that way. -Barry From gmcm at hypernet.com Fri Nov 19 22:56:20 1999 From: gmcm at hypernet.com (Gordon McMillan) Date: Fri, 19 Nov 1999 16:56:20 -0500 Subject: [Python-Dev] Import redesign [LONG] In-Reply-To: <Pine.LNX.4.10.9911191258180.10639-100000@nebula.lyra.org> References: <Pine.WNT.4.04.9911190928100.315-100000@rigoletto.ski.org> Message-ID: <1269073918-25826188@hypernet.com> [David Ascher got involuntarily forwarded] > > Aside: it strikes me that for Python programs which import lots > > of files, 'front-loading' the stat calls could make sense. > > When you first look at a directory in sys.path, you read the > > entire directory in memory, and successive imports do a stat on > > the directory to see if it's changed, and if not use the > > in-memory data. Or am I completely off my rocker here? I posted something here about dircache not too long ago. Essentially, I found it completely unreliable on NT and on Linux to stat the directory. There was some test code attached. - Gordon From gstein at lyra.org Fri Nov 19 23:09:36 1999 From: gstein at lyra.org (Greg Stein) Date: Fri, 19 Nov 1999 14:09:36 -0800 (PST) Subject: [Python-Dev] Another 1.6 wish In-Reply-To: <19991119122302.B23400@trump.amber.org> Message-ID: <Pine.LNX.4.10.9911191359370.10639-100000@nebula.lyra.org> On Fri, 19 Nov 1999, Christopher Petrilli wrote: > Andrew M. Kuchling [akuchlin at mems-exchange.org] wrote: > > Barry A. Warsaw writes: > > >We had some discussion a while back about enabling thread support by > > >default, if the underlying OS supports it obviously. I'd like to see Definitely. I think you still want a --disable-threads option, but the default really ought to include them. > Yes pretty please! One of the biggest problems we have in the Zope world > is that for some unknown reason, most of hte Linux RPMs don't have threading > on in them, so people end up having to compile it anyway... while this > is a silly thing, it does create problems, and means that we deal with > a lot of "dumb" problems. Yah. It's a pain. My RedHat 6.1 box has 1.5.2 with threads. I haven't actually had to build my own Python(!). Man... imagine that. After almost five years of using Linux/Python, I can actually rely on the OS getting it right! :-) > > That reminds me... what about the free threading patches? Perhaps > > they should be added to the list of issues to consider for 1.6. > > My recolection was that unfortunately MOST of the time, they actually > slowed down things because of the number of locks involved... Guido > can no doubt shed more light onto this, but... there was a reason. Yes, there were problems in the first round with locks and lock contention. The main issue is that a list must always use a lock to keep itself consistent. Always. There is no way for an application to say "hey, list object! I've got a higher-level construct here that guarantees there will be no cross-thread use of this list. Ignore the locking." Another issue that can't be avoided is using atomic increment/decrement for the object refcounts. Guido has already asked me about free threading patches for 1.6. I don't know if his intent was to include them, or simply to have them available for those who need them. Certainly, this time around they will be simpler since Guido folded in some of the support stuff (e.g. PyThreadState and per-thread exceptions). There are some other supporting changes that could definitely go into the core interpreter. The slow part comes when you start to add integrity locks to list, dict, etc. That is when the question on whether to include free threading comes up. Design-wise, there is a change or two that I would probably make. Note that shoving free-threading into the standard interpreter would get more eyeballs at the thing, and that people may have great ideas for reducing the overheads. Cheers, -g -- Greg Stein, http://www.lyra.org/ From gstein at lyra.org Fri Nov 19 23:11:02 1999 From: gstein at lyra.org (Greg Stein) Date: Fri, 19 Nov 1999 14:11:02 -0800 (PST) Subject: [Python-Dev] Another 1.6 wish In-Reply-To: <11A17AA2B9EAD111BCEA00A0C9B4179303385C08@molach.origin.ea.com> Message-ID: <Pine.LNX.4.10.9911191409570.10639-100000@nebula.lyra.org> On Fri, 19 Nov 1999, Asbahr, Jason wrote: > >We had some discussion a while back about enabling thread support by > >default, if the underlying OS supports it obviously. > > What's the consensus about Python microthreads -- a likely candidate > for incorporation in 1.6 (or later)? microthreads? eh? > Also, we have a couple minor convenience functions for Python in an > MSDEV environment, an exposure of OutputDebugString for writing to > the DevStudio log window and a means of tripping DevStudio C/C++ layer > breakpoints from Python code (currently experimental). The msvcrt > module seems like a likely candidate for these, would these be > welcome additions? Sure. I don't see why not. I know that I've use OutputDebugString a bazillion times from the Python layer. The breakpoint thingy... dunno, but I don't see a reason to exclude it. Cheers, -g -- Greg Stein, http://www.lyra.org/ From skip at mojam.com Fri Nov 19 23:11:38 1999 From: skip at mojam.com (Skip Montanaro) Date: Fri, 19 Nov 1999 16:11:38 -0600 (CST) Subject: [Python-Dev] Import redesign [LONG] In-Reply-To: <Pine.LNX.4.10.9911191258180.10639-100000@nebula.lyra.org> References: <Pine.WNT.4.04.9911190928100.315-100000@rigoletto.ski.org> <Pine.LNX.4.10.9911191258180.10639-100000@nebula.lyra.org> Message-ID: <14389.51994.809130.22062@dolphin.mojam.com> Greg> The problem occurs when you path is [A, B], the file is in B, and Greg> you add something to A on-the-fly. The cache might direct the Greg> importer at B, missing your file. Typically your path will be relatively short (< 20 directories), right? Just stat the directories before consulting the cache. If any changed since the last time the cache was built, then invalidate the entire cache (or that portion of the cached information that is downstream from the first modified directory). It's still going to be cheaper than performing listdir for each directory in the path, and like you said, only require flushes during development or installation actions. Skip Montanaro | http://www.mojam.com/ skip at mojam.com | http://www.musi-cal.com/ 847-971-7098 | Python: Programming the way Guido indented... From skip at mojam.com Fri Nov 19 23:15:14 1999 From: skip at mojam.com (Skip Montanaro) Date: Fri, 19 Nov 1999 16:15:14 -0600 (CST) Subject: [Python-Dev] Import redesign [LONG] In-Reply-To: <1269073918-25826188@hypernet.com> References: <Pine.WNT.4.04.9911190928100.315-100000@rigoletto.ski.org> <1269073918-25826188@hypernet.com> Message-ID: <14389.52210.833368.249942@dolphin.mojam.com> Gordon> I posted something here about dircache not too long ago. Gordon> Essentially, I found it completely unreliable on NT and on Linux Gordon> to stat the directory. There was some test code attached. The modtime of the directory's stat info should only change if you add or delete entries in the directory. Were you perhaps expecting changes when other operations took place, like rewriting an existing file? Skip Montanaro | http://www.mojam.com/ skip at mojam.com | http://www.musi-cal.com/ 847-971-7098 | Python: Programming the way Guido indented... From skip at mojam.com Fri Nov 19 23:34:42 1999 From: skip at mojam.com (Skip Montanaro) Date: Fri, 19 Nov 1999 16:34:42 -0600 Subject: [Python-Dev] Import redesign [LONG] In-Reply-To: <1269073918-25826188@hypernet.com> References: <Pine.WNT.4.04.9911190928100.315-100000@rigoletto.ski.org> <1269073918-25826188@hypernet.com> Message-ID: <199911192234.QAA24710@dolphin.mojam.com> Gordon wrote: Gordon> I posted something here about dircache not too long ago. Gordon> Essentially, I found it completely unreliable on NT and on Linux Gordon> to stat the directory. There was some test code attached. to which I replied: Skip> The modtime of the directory's stat info should only change if you Skip> add or delete entries in the directory. Were you perhaps Skip> expecting changes when other operations took place, like rewriting Skip> an existing file? I took a couple minutes to write a simple script to check things. It created a file, changed its mode, then unlinked it. I was a bit surprised that deleting a file didn't appear to change the directory's mod time. Then I realized that since file times are only recorded with one-second precision, you might see no change to the directory's mtime in some circumstances. Adding a sleep to the script between directory operations resolved the apparent inconsistency. Still, as Gordon stated, you probably can't count on directory modtimes to tell you when to invalidate the cache. It's consistent, just not reliable... if-we-slow-import-down-enough-we-can-use-this-trick-though-ly y'rs, Skip Montanaro | http://www.mojam.com/ skip at mojam.com | http://www.musi-cal.com/ 847-971-7098 | Python: Programming the way Guido indented... From mhammond at skippinet.com.au Sat Nov 20 01:04:28 1999 From: mhammond at skippinet.com.au (Mark Hammond) Date: Sat, 20 Nov 1999 11:04:28 +1100 Subject: [Python-Dev] Another 1.6 wish In-Reply-To: <11A17AA2B9EAD111BCEA00A0C9B4179303385C08@molach.origin.ea.com> Message-ID: <005f01bf32ea$d0b82b90$0501a8c0@bobcat> > Also, we have a couple minor convenience functions for Python in an > MSDEV environment, an exposure of OutputDebugString for writing to > the DevStudio log window and a means of tripping DevStudio C/C++ layer > breakpoints from Python code (currently experimental). The msvcrt > module seems like a likely candidate for these, would these be > welcome additions? These are both available in the win32api module. They dont really fit in the "msvcrt" module, as they are not part of the C runtime library, but the win32 API itself. This is really a pointer to the fact that some or all of the win32api should be moved into the core - registry access is the thing people most want, but there are plenty of other useful things that people reguarly use... Guido objects to the coding style, but hopefully that wont be a big issue. IMO, the coding style isnt "bad" - it is just more an "MS" flavour than a "Python" flavour - presumably people reading the code will have some experience with Windows, so it wont look completely foreign to them. The good thing about taking it "as-is" is that it has been fairly well bashed on over a few years, so is really quite stable. The final "coding style" issue is that there are no "doc strings" - all documentation is embedded in C comments, and extracted using a tool called "autoduck" (similar to "autodoc"). However, Im sure we can arrange something there, too. Mark. From jcw at equi4.com Sat Nov 20 01:21:43 1999 From: jcw at equi4.com (Jean-Claude Wippler) Date: Sat, 20 Nov 1999 01:21:43 +0100 Subject: [Python-Dev] Import redesign [LONG] References: <Pine.WNT.4.04.9911190928100.315-100000@rigoletto.ski.org> <1269073918-25826188@hypernet.com> <199911192234.QAA24710@dolphin.mojam.com> Message-ID: <3835E997.8A4F5BC5@equi4.com> Skip Montanaro wrote: > [dir stat cache times] > I took a couple minutes to write a simple script to check things. It > created a file, changed its mode, then unlinked it. I was a bit > surprised that deleting a file didn't appear to change the directory's > mod time. Then I realized that since file times are only recorded > with one-second Or two, on Windows with older (FAT, as opposed to VFAT) file systems. > precision, you might see no change to the directory's mtime in some > circumstances. Adding a sleep to the script between directory > operations resolved the apparent inconsistency. Still, as Gordon > stated, you probably can't count on directory modtimes to tell you > when to invalidate the cache. It's consistent, just not reliable... > > if-we-slow-import-down-enough-we-can-use-this-trick-though-ly y'rs, If the dir stat time is less than 2 seconds ago, flush - always. If the dir stat time says it hasn't been changed for at least 2 seconds then you can cache all entries and trust that any change is detected. In other words: take the *current* time into account, then it can work. I think. Maybe. Until you get into network drives and clock skew... -- Jean-Claude From gmcm at hypernet.com Sat Nov 20 04:43:32 1999 From: gmcm at hypernet.com (Gordon McMillan) Date: Fri, 19 Nov 1999 22:43:32 -0500 Subject: [Python-Dev] Import redesign [LONG] In-Reply-To: <3835E997.8A4F5BC5@equi4.com> Message-ID: <1269053086-27079185@hypernet.com> Jean-Claude wrote: > Skip Montanaro wrote: > > > [dir stat cache times] > > ... Then I realized that since > > file times are only recorded with one-second > > Or two, on Windows with older (FAT, as opposed to VFAT) file > systems. Oh lordy, it gets worse. With a time.sleep(1.0) between new files, Linux detects the change in the dir's mtime immediately. Cool. On NT, I get an average 2.0 sec delay. But sometimes it doesn't detect a delay in 100 secs (and my script quits). Then I added a stat of some file in the directory before the stat of the directory, (not the file I added). Now it acts just like Linux - no delay (on both FAT and NTFS partitions). OK... > I think. Maybe. Until you get into network drives and clock > skew... No success whatsoever in either direction across Samba. In fact the mtime of my Linux home directory as seen from NT is Jan 1, 1980. - Gordon From gstein at lyra.org Sat Nov 20 13:06:48 1999 From: gstein at lyra.org (Greg Stein) Date: Sat, 20 Nov 1999 04:06:48 -0800 (PST) Subject: [Python-Dev] updated imputil Message-ID: <Pine.LNX.4.10.9911200356050.10639-100000@nebula.lyra.org> I've updated imputil... The main changes is that I added SysPathImporter and BuiltinImporter. I also did some restructing to help with bootstrapping the module (remove dependence on os.py). For testing a revamped Python import system, you can importing the thing and call imputil._test_revamp() to set it up. This will load normal, builtin, and frozen modules via imputil. Dynamic modules are still handled by Python, however. I ran a timing comparisons of importing all modules in /usr/lib/python1.5 (using standard and imputil-based importing). The standard mechanism can do it in about 8.8 seconds. Through imputil, it does it in about 13.0 seconds. Note that I haven't profiled/optimized any of the Importer stuff (yet). The point about dynamic modules actually discovered a basic problem that I need to resolve now. The current imputil assumes that if a particular Importer loaded the top-level module in a package, then that Importer is responsible for loading all other modules within that package. In my particular test, I tried to import "xml.parsers.pyexpat". The two package modules were handled by SysPathImporter. The pyexpat module is a dynamic load module, so it is *not* handled by the Importer -- bam. Failure. Basically, each part of "xml.parsers.pyexpat" may need to use a different Importer... Off to ponder, -g -- Greg Stein, http://www.lyra.org/ From gstein at lyra.org Sat Nov 20 13:11:37 1999 From: gstein at lyra.org (Greg Stein) Date: Sat, 20 Nov 1999 04:11:37 -0800 (PST) Subject: [Python-Dev] updated imputil In-Reply-To: <Pine.LNX.4.10.9911200356050.10639-100000@nebula.lyra.org> Message-ID: <Pine.LNX.4.10.9911200411060.10639-100000@nebula.lyra.org> oops... forgot: http://www.lyra.org/greg/python/imputil.py -g On Sat, 20 Nov 1999, Greg Stein wrote: > I've updated imputil... The main changes is that I added SysPathImporter > and BuiltinImporter. I also did some restructing to help with > bootstrapping the module (remove dependence on os.py). > > For testing a revamped Python import system, you can importing the thing > and call imputil._test_revamp() to set it up. This will load normal, > builtin, and frozen modules via imputil. Dynamic modules are still > handled by Python, however. > > I ran a timing comparisons of importing all modules in /usr/lib/python1.5 > (using standard and imputil-based importing). The standard mechanism can > do it in about 8.8 seconds. Through imputil, it does it in about 13.0 > seconds. Note that I haven't profiled/optimized any of the Importer stuff > (yet). > > The point about dynamic modules actually discovered a basic problem that I > need to resolve now. The current imputil assumes that if a particular > Importer loaded the top-level module in a package, then that Importer is > responsible for loading all other modules within that package. In my > particular test, I tried to import "xml.parsers.pyexpat". The two package > modules were handled by SysPathImporter. The pyexpat module is a dynamic > load module, so it is *not* handled by the Importer -- bam. Failure. > > Basically, each part of "xml.parsers.pyexpat" may need to use a different > Importer... > > Off to ponder, > -g > > -- > Greg Stein, http://www.lyra.org/ > > > _______________________________________________ > Python-Dev maillist - Python-Dev at python.org > http://www.python.org/mailman/listinfo/python-dev > -- Greg Stein, http://www.lyra.org/ From skip at mojam.com Sat Nov 20 15:16:58 1999 From: skip at mojam.com (Skip Montanaro) Date: Sat, 20 Nov 1999 08:16:58 -0600 (CST) Subject: [Python-Dev] Import redesign [LONG] In-Reply-To: <1269053086-27079185@hypernet.com> References: <3835E997.8A4F5BC5@equi4.com> <1269053086-27079185@hypernet.com> Message-ID: <14390.44378.83128.546732@dolphin.mojam.com> Gordon> No success whatsoever in either direction across Samba. In fact Gordon> the mtime of my Linux home directory as seen from NT is Jan 1, Gordon> 1980. Ain't life grand? :-( Ah, well, it was a nice idea... S From jim at interet.com Mon Nov 22 17:43:39 1999 From: jim at interet.com (James C. Ahlstrom) Date: Mon, 22 Nov 1999 11:43:39 -0500 Subject: [Python-Dev] Import redesign [LONG] References: <Pine.LNX.4.10.9911190404580.10639-100000@nebula.lyra.org> Message-ID: <383972BB.C65DEB26@interet.com> Greg Stein wrote: > > I would suggest that both retain their *exact* meaning. We introduce > sys.importers -- a list of importers to check, in sequence. The first > importer on that list uses sys.path to look for and load modules. The > second importer loads builtins and frozen code (i.e. modules not on > sys.path). We should retain the current order. I think is is: first builtin, next frozen, next sys.path. I really think frozen modules should be loaded in preference to sys.path. After all, they are compiled in. > Users can insert/append new importers or alter sys.path as before. I agree with Greg that sys.path should remain as it is. A list of importers can add the extra functionality. Users will probably want to adjust the order of the list. > > Implementation: > > --------------- > > > > - There must clearly be some code in C that can import certain > > essential modules (to solve the chicken-or-egg problem), but I don't > > mind if the majority of the implementation is written in Python. > > Using Python makes it easy to subclass. > > I posited once before that the cost of import is mostly I/O rather than > CPU, so using Python should not be an issue. MAL demonstrated that a good > design for the Importer classes is also required. Based on this, I'm a > *strong* advocate of moving as much as possible into Python (to get > Python's ease-of-coding with little relative cost). Yes, I agree. And I think the main() should be written in Python. Lots of Python should be written in Python. > The (core) C code should be able to search a path for a module and import > it. It does not require dynamic loading or packages. This will be used to > import exceptions.py, then imputil.py, then site.py. But these can be frozen in (as you mention below). I dislike depending on sys.path to load essential modules. If they are not frozen in, then we need a command line argument to specify their path, with sys.path used otherwise. Jim Ahlstrom From jim at interet.com Mon Nov 22 18:25:46 1999 From: jim at interet.com (James C. Ahlstrom) Date: Mon, 22 Nov 1999 12:25:46 -0500 Subject: [Python-Dev] Import redesign (was: Python 1.6 status) References: <1269144272-21594530@hypernet.com> Message-ID: <38397C9A.DF6B7112@interet.com> Gordon McMillan wrote: > [JimA] > > Think about multiple packages in multiple zip files. The zip > > files store file directories. That means we would need a > > sys.zippath to search the zip files. I don't want another > > PYTHONPATH phenomenon. > > What if sys.path looked like: > [DirImporter('.'), ZlibImporter('c:/python/stdlib.pyz'), ...] Well, that changes the current meaning of sys.path. > > > > I suggest that archive files MUST be put into a known > > > > directory. > > No way. Hard code a directory? Overwrite someone else's > Python "standalone"? Write to a C: partition that is > deliberately sized to hold nothing but Windows? Make > network installations impossible? Ooops. I didn't mean a known directory you couldn't change. But I did mean a directory you shouldn't change. But you are right. The directory should be configurable. But I would still like to see a highly encouraged directory. I don't yet have a good design for this. Anyone have ideas on an official way to find library files? I think a Python library file is a Good Thing, but it is not useful if the archive can't be found. I am thinking of a busy SysAdmin with someone nagging him/her to install Python. SysAdmin doesn't want another headache. What if Python becomes popular and users want it on Unix and PC's? More work! There should be a standard way to do this that just works and is dumb-stupid-simple. This is a Python promotion issue. Yes everyone here can make sys.path work, but that is not the point. > The official Windows solution is stuff in registry about app > paths and such. Putting the dlls in the exe's directory is a > workaround which works and is more managable than the > official solution. I agree completely. > > > > We should also have the ability to append archive files to > > > > the executable or a shared library assuming the OS allows > > > > this > > That's a handy trick on Windows, but it's got nothing to do > with Python. It also works on Linux. I don't know about other systems. > Flexibility. You can put Christian's favorite Einstein quote here > too. I hope we can still have ease of use with all this flexibility. As I said, we need to promote Python. Jim Ahlstrom From mal at lemburg.com Tue Nov 23 14:32:42 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Tue, 23 Nov 1999 14:32:42 +0100 Subject: [Python-Dev] Unicode Proposal: Version 0.8 References: <382C0A54.E6E8328D@lemburg.com> <382D625B.DC14DBDE@lemburg.com> <38316685.7977448D@lemburg.com> <3834425A.8E9C3B7E@lemburg.com> Message-ID: <383A977A.C20E6518@lemburg.com> FYI, I've uploaded a new version of the proposal which includes the encodings package, definition of the 'raw unicode escape' encoding (available via e.g. ur""), Unicode format strings and a new method .breaklines(). The latest version of the proposal is available at: http://starship.skyport.net/~lemburg/unicode-proposal.txt Older versions are available as: http://starship.skyport.net/~lemburg/unicode-proposal-X.X.txt Some POD (points of discussion) that are still open: ? Stream readers: What about .readline(), .readlines() ? These could be implemented using .read() as generic functions instead of requiring their implementation by all codecs. Also see Line Breaks. ? Python interface for the Unicode property database ? What other special Unicode formatting characters should be enhanced to work with Unicode input ? Currently only the following special semantics are defined: u"%s %s" % (u"abc", "abc") should return u"abc abc". Pretty quiet around here lately... -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 38 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From jcw at equi4.com Tue Nov 23 16:17:36 1999 From: jcw at equi4.com (Jean-Claude Wippler) Date: Tue, 23 Nov 1999 16:17:36 +0100 Subject: [Python-Dev] New thread ideas in Perl-land Message-ID: <383AB010.DD46A1FB@equi4.com> Just got a note about a paper on a new way of dealing with threads, as presented to the Perl-Porters list. The idea is described in: http://www.cpan.org/modules/by-authors/id/G/GB/GBARTELS/thread_0001.txt I have no time to dive in, comment, or even judge the relevance of this, but perhaps someone else on this list wishes to check it out. The author of this is Greg London <bartels at pixelmagic.com>. -- Jean-Claude From mhammond at skippinet.com.au Tue Nov 23 23:45:14 1999 From: mhammond at skippinet.com.au (Mark Hammond) Date: Wed, 24 Nov 1999 09:45:14 +1100 Subject: [Python-Dev] Unicode Proposal: Version 0.8 In-Reply-To: <383A977A.C20E6518@lemburg.com> Message-ID: <002301bf3604$68fd8f00$0501a8c0@bobcat> > Pretty quiet around here lately... My guess is that most positions and opinions have been covered. It is now probably time for less talk, and more code! It is time to start an implementation plan? Do we start with /F's Unicode implementation (which /G *smirk* seemed to approve of)? Who does what? When can we start to play with it? And a key point that seems to have been thrust in our faces at the start and hardly mentioned recently - does the proposal as it stands meet our sponsor's (HP) requirements? Mark. From gstein at lyra.org Wed Nov 24 01:40:44 1999 From: gstein at lyra.org (Greg Stein) Date: Tue, 23 Nov 1999 16:40:44 -0800 (PST) Subject: [Python-Dev] Re: updated imputil In-Reply-To: <Pine.LNX.4.10.9911200356050.10639-100000@nebula.lyra.org> Message-ID: <Pine.LNX.4.10.9911231549120.10639-100000@nebula.lyra.org> <enable-ramble-mode> :-) On Sat, 20 Nov 1999, Greg Stein wrote: >... > The point about dynamic modules actually discovered a basic problem that I > need to resolve now. The current imputil assumes that if a particular > Importer loaded the top-level module in a package, then that Importer is > responsible for loading all other modules within that package. In my > particular test, I tried to import "xml.parsers.pyexpat". The two package > modules were handled by SysPathImporter. The pyexpat module is a dynamic > load module, so it is *not* handled by the Importer -- bam. Failure. > > Basically, each part of "xml.parsers.pyexpat" may need to use a different > Importer... I've thought about this and decided the issue is with my particular Importer, rather than the imputil design. The PathImporter traverses a set of paths and establishes a package hierarchy based on a filesystem layout. It should be able to load dynamic modules from within that filesystem area. A couple alternatives, and why I don't believe they work as well: * A separate importer to just load dynamic libraries: this would need to replicate PathImporter's mapping of Python module/package hierarchy onto the filesystem. There would also be a sequencing issue because one Importer's paths would be searched before the other's paths. Current Python import rules establishes that a module earlier in sys.path (whether a dyn-lib or not) is loaded before one later in the path. This behavior could be broken if two Importers were used. * A design whereby other types of modules can be placed into the filesystem and multiple Importers are used to load parts of the path (e.g. PathImporter for xml.parsers and DynLibImporter for pyexpat). This design doesn't work well because the mapping of Python module/package to the filesystem is established by PathImporter -- try to mix a "private" mapping design among Importers creates too much coupling. There is also an argument that the design is fundamentally incorrect :-). I would argue against that, however. I'm not sure what form an argument *against* imputil would be, so I'm not sure how to preempty it :-). But we can get an idea of various arguments by hypothesizing different scenarios and requireing that the imputil design satisifies them. In the above two alternatives, they were examing the use of a secondary Importer to load things out of the filesystem (and it explained why two Importers in whatever configuration is not a good thing). Let's state for argument's sake that files of some type T must be placable within the filesystem (i.e. according to the layout defined by PathImporter). We'll also say that PathImporter doesn't understand T, since the latter was designed later or is private to some app. The way to solve this is to allow PathImporter to recognize it through some configuration of the instance (e.g. self.recognized_types). A set of hooks in the PathImporter would then understand how to map files of type T to a code or module object. (alternatively, a generalized set of hooks at the Importer class level) Note that you could easily have a utility function that scans sys.importers for a PathImporter instance and adds the data to recognize a new type -- this would allow for simple installation of new types. Note that PathImporter inherently defines a 1:1 mapping from a module to a file. Archives (zip or jar files) cannot be recognized and handled by PathImporter. An archive defines an entirely different style of mapping between a module/package and a file in the filesystem. Of course, an Importer that uses archives can certainly look for them in sys.path. The imputil design is derived directly from the "import" statement. "Here is a module/package name, give me a module." (this is embodied in the get_code() method in Importer) The find/load design established by ihooks is very filesystem-based. In many situations, a find/load is very intertwined. If you want to take the URL case, then just examine the actual network activity -- preferably, you want a single transaction (e.g. one HTTP GET). Find/load implies two transactions. With nifty context handling between the two steps, you can get away with a single transaction. But the point is that the design requires you to get work around its inherent two-step mechanism and establish a single step. This is weird, of course, because importing is never *just* a find or a load, but always both. Well... since I've satisfied to myself that PathImporter needs to load dynamic lib modules, I'm off to code it... Cheers, -g -- Greg Stein, http://www.lyra.org/ From gstein at lyra.org Wed Nov 24 02:45:29 1999 From: gstein at lyra.org (Greg Stein) Date: Tue, 23 Nov 1999 17:45:29 -0800 (PST) Subject: [Python-Dev] breaking out code for dynamic loading Message-ID: <Pine.LNX.4.10.9911231731000.10639-100000@nebula.lyra.org> Guido, I can't find the message, but it seems that at some point you mentioned wanting to break out importdl.c into separate files. The configure process could then select the appropriate one to use for the platform. Sounded great until I looked at importdl.c. There are a 13 variants of dynamic loading. That would imply 13 separate files/modules. I'd be happy to break these out, but are you actually interested in that many resulting modules? If so, then any suggestions for naming? (e.g. aix_dynload, win32_dynload, mac_dynload) Here are the variants: * NeXT, using FVM shlibs (USE_RLD) * NeXT, using frameworks (USE_DYLD) * dl / GNU dld (USE_DL) * SunOS, IRIX 5 shared libs (USE_SHLIB) * AIX dynamic linking (_AIX) * Win32 platform (MS_WIN32) * Win16 platform (MS_WIN16) * OS/2 dynamic linking (PYOS_OS2) * Mac CFM (USE_MAC_DYNAMIC_LOADING) * HP/UX dyn linking (hpux) * NetBSD shared libs (__NetBSD__) * FreeBSD shared libs (__FreeBSD__) * BeOS shared libs (__BEOS__) Could I suggest a new top-level directory in the Python distribution named "Platform"? Move BeOS, PC, and PCbuild in there (bring back Mac?). Add new directories for each of the above platforms and move the appropriate portion of importdl.c into there as a Python C Extension Module. (the module would still be statically linked into the interpreter!) ./configure could select the module and write a Setup.dynload, much like it does with Setup.thread. Cheers, -g -- Greg Stein, http://www.lyra.org/ From gstein at lyra.org Wed Nov 24 03:43:50 1999 From: gstein at lyra.org (Greg Stein) Date: Tue, 23 Nov 1999 18:43:50 -0800 (PST) Subject: [Python-Dev] another round of imputil work completed In-Reply-To: <Pine.LNX.4.10.9911231549120.10639-100000@nebula.lyra.org> Message-ID: <Pine.LNX.4.10.9911231837480.10639-100000@nebula.lyra.org> On Tue, 23 Nov 1999, Greg Stein wrote: >... > Well... since I've satisfied to myself that PathImporter needs to load > dynamic lib modules, I'm off to code it... All right. imputil.py now comes with code to emulate the builtin Python import mechanism. It loads all the same types of files, uses sys.path, and (pointed out by JimA) loads builtins before looking on the path. The only "feature" it doesn't support is using package.__path__ to look for submodules. I never liked that thing, so it isn't in there. (imputil *does* set the __path__ attribute, tho) Code is available at: http://www.lyra.org/greg/python/imputil.py Next step is to add a "standard" library/archive format. JimA and I have been tossing some stuff back and forth on this. Cheers, -g -- Greg Stein, http://www.lyra.org/ From mal at lemburg.com Wed Nov 24 09:34:52 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Wed, 24 Nov 1999 09:34:52 +0100 Subject: [Python-Dev] Unicode Proposal: Version 0.8 References: <002301bf3604$68fd8f00$0501a8c0@bobcat> Message-ID: <383BA32C.2E6F4780@lemburg.com> Mark Hammond wrote: > > > Pretty quiet around here lately... > > My guess is that most positions and opinions have been covered. It is > now probably time for less talk, and more code! Or that everybody is on holidays... like Guido. > It is time to start an implementation plan? Do we start with /F's > Unicode implementation (which /G *smirk* seemed to approve of)? Who > does what? When can we start to play with it? This depends on whether HP agrees on the current specs. If they do, there should be code by mid December, I guess. > And a key point that seems to have been thrust in our faces at the > start and hardly mentioned recently - does the proposal as it stands > meet our sponsor's (HP) requirements? Haven't heard anything from them yet (this is probably mainly due to Guido being offline). -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 37 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Wed Nov 24 10:32:46 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Wed, 24 Nov 1999 10:32:46 +0100 Subject: [Python-Dev] Import Design Message-ID: <383BB0BE.BF116A28@lemburg.com> Before hooking on to some more PathBuiltinImporters ;-), I'd like to spawn a thread leading in a different direction... There has been some discussion on what we really expect of the import mechanism to be able to do. Here's a summary of what I think we need: * compatibility with the existing import mechanism * imports from library archives (e.g. .pyl or .par-files) * a modified intra package import lookup scheme (the thingy which I call "walk-me-up-Scotty" patch -- see previous posts) And for some fancy stuff: * imports from URLs (e.g. these could be put on the path for automatic inclusion in the import scan or be passed explicitly to __import__) * a (file based) static lookup cache to enhance lookup performance which is enabled via a command line switch (rather than being enabled per default), so that the user can decide whether to apply this optimization or not The point I want to make is: there aren't all that many features we are really looking for, so why not incorporate these into the builtin importer and only *then* start thinking about schemes for hooks, managers, etc. ?! -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 37 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From captainrobbo at yahoo.com Wed Nov 24 12:40:16 1999 From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=) Date: Wed, 24 Nov 1999 03:40:16 -0800 (PST) Subject: [Python-Dev] Unicode Proposal: Version 0.8 Message-ID: <19991124114016.7706.rocketmail@web601.mail.yahoo.com> --- Mark Hammond <mhammond at skippinet.com.au> wrote: > > Pretty quiet around here lately... > > My guess is that most positions and opinions have > been covered. It is > now probably time for less talk, and more code! > > It is time to start an implementation plan? Do we > start with /F's > Unicode implementation (which /G *smirk* seemed to > approve of)? Who > does what? When can we start to play with it? > > And a key point that seems to have been thrust in > our faces at the > start and hardly mentioned recently - does the > proposal as it stands > meet our sponsor's (HP) requirements? > > Mark. I had a long chat with them on Friday :-) They want it done, but nobody is actively working on it now as far as I can tell, and they are very busy. The per-thread thing was a red herring - they just want to be able to do (for example) web servers handling different encodings from a central unicode database, so per-output-stream works just fine. They will be at IPC8; I'd suggest that a round of prototyping, we insist they read it and then discuss it at IPC8, and be prepared to rework things thereafter are important. Hopefully then we'll have a plan on how to tackle the much larger (but less interesting to python-dev) job of writing and verifying all the codecs and utilities. Andy Robinson ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Thousands of Stores. Millions of Products. All in one place. Yahoo! Shopping: http://shopping.yahoo.com From jim at interet.com Wed Nov 24 15:43:57 1999 From: jim at interet.com (James C. Ahlstrom) Date: Wed, 24 Nov 1999 09:43:57 -0500 Subject: [Python-Dev] Re: updated imputil References: <Pine.LNX.4.10.9911231549120.10639-100000@nebula.lyra.org> Message-ID: <383BF9AD.E183FB98@interet.com> Greg Stein wrote: > * A separate importer to just load dynamic libraries: this would need to > replicate PathImporter's mapping of Python module/package hierarchy onto > the filesystem. There would also be a sequencing issue because one > Importer's paths would be searched before the other's paths. Current > Python import rules establishes that a module earlier in sys.path > (whether a dyn-lib or not) is loaded before one later in the path. This > behavior could be broken if two Importers were used. I would like to argue that on Windows, import of dynamic libraries is broken. If a file something.pyd is imported, then sys.path is searched to find the module. If a file something.dll is imported, the same thing happens. But Windows defines its own search order for *.dll files which Python ignores. I would suggest that this is wrong for files named *.dll, but OK for files named *.pyd. A SysAdmin should be able to install and maintain *.dll as she has been trained to do. This makes maintaining Python installations simpler and more un-surprising. I have no solution to the backward compatibilty problem. But the code is only a couple lines. A LoadLibrary() call does its own path searching. Jim Ahlstrom From jim at interet.com Wed Nov 24 16:06:17 1999 From: jim at interet.com (James C. Ahlstrom) Date: Wed, 24 Nov 1999 10:06:17 -0500 Subject: [Python-Dev] Import Design References: <383BB0BE.BF116A28@lemburg.com> Message-ID: <383BFEE9.B4FE1F19@interet.com> "M.-A. Lemburg" wrote: > The point I want to make is: there aren't all that many features > we are really looking for, so why not incorporate these into > the builtin importer and only *then* start thinking about > schemes for hooks, managers, etc. ?! Marc has made this point before, and I think it should be considered carefully. It is a lot of work to re-create the current import logic in Python and it is almost guaranteed to be slower. So why do it? I like imputil.py because it leads to very simple Python installations. I view this as a Python promotion issue. If we have a boot mechanism plus archive files, we can have few-file Python installations with package addition being just adding another file. But at least some of this code must be in C. I volunteer to write the rest of it in C if that is what people want. But it would add two hundred more lines of code to import.c. So maybe now is the time to switch to imputil, instead of waiting for later. But I am indifferent as long as I can tell a Python user to just put an archive file libpy.pyl in his Python directory and everything will Just Work. Jim Ahlstrom From bwarsaw at python.org Tue Nov 30 21:23:40 1999 From: bwarsaw at python.org (Barry Warsaw) Date: Tue, 30 Nov 1999 15:23:40 -0500 (EST) Subject: [Python-Dev] CFP Developers' Day - 8th International Python Conference Message-ID: <14404.12876.847116.288848@anthem.cnri.reston.va.us> Hello Python Developers! Thursday January 27 2000, the final day of the 8th International Python Conference is Developers' Day, where Python hackers get together to discuss and reach agreements on the outstanding issues facing Python. This is also your once-a-year chance for face-to-face interactions with Python's creator Guido van Rossum and other experienced Python developers. To make Developers' Day a success, we need you! We're looking for a few good champions to lead topic sessions. As a champion, you will choose a topic that fires you up and write a short position paper for publication on the web prior to the conference. You'll also prepare introductory material for the topic overview session, and lead a 90 minute topic breakout group. We've had great champions and topics in previous years, and many features of today's Python had their start at past Developers' Days. This is your chance to help shape the future of Python for 1.6, 2.0 and beyond. If you are interested in becoming a topic champion, you must email me by Wednesday December 15, 1999. For more information, please visit the IPC8 Developers' Day web page at <http://www.python.org/workshops/2000-01/devday.html> This page has more detail on schedule, suggested topics, important dates, etc. To volunteer as a champion, or to ask other questions, you can email me at bwarsaw at python.org. -Barry From mal at lemburg.com Mon Nov 1 00:00:55 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Mon, 01 Nov 1999 00:00:55 +0100 Subject: [Python-Dev] Misleading syntax error text References: <1270838575-13870925@hypernet.com> Message-ID: <381CCA27.59506CF6@lemburg.com> [Extracted from the psa-members list...] Gordon McMillan wrote: > > Chris Fama wrote, > > And now the rub: the exact same function definition has passed > > through byte-compilation perfectly OK many times before with no > > problems... of course, this points rather clearly to the > > preceding code, but it illustrates a failing in Python's syntax > > error messages, and IMHO a fairly serious one at that, if this is > > indeed so. > > My simple experiments refuse to compile a "del getattr(..)" at > all. Hmm, it seems to be a failry generic error: >>> del f(x,y) SyntaxError: can't assign to function call How about chainging the com_assign_trailer function in Python/compile.c to: static void com_assign_trailer(c, n, assigning) struct compiling *c; node *n; int assigning; { REQ(n, trailer); switch (TYPE(CHILD(n, 0))) { case LPAR: /* '(' [exprlist] ')' */ com_error(c, PyExc_SyntaxError, assigning ? "can't assign to function call": "can't delete expression"); break; case DOT: /* '.' NAME */ com_assign_attr(c, CHILD(n, 1), assigning); break; case LSQB: /* '[' subscriptlist ']' */ com_subscriptlist(c, CHILD(n, 1), assigning); break; default: com_error(c, PyExc_SystemError, "unknown trailer type"); } } or something along those lines... BTW, has anybody tried my import patch recently ? I haven't heard any citicism since posting it and wonder what made the list fall asleep over the topic :-) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 61 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mhammond at skippinet.com.au Mon Nov 1 02:51:56 1999 From: mhammond at skippinet.com.au (Mark Hammond) Date: Mon, 1 Nov 1999 12:51:56 +1100 Subject: [Python-Dev] Benevolent dictator versus the bureaucratic committee? Message-ID: <002301bf240b$ae61fa00$0501a8c0@bobcat> I have for some time been wondering about the usefulness of this mailing list. It seems to have produced staggeringly few results since inception. This is not a critisism of any individual, but of the process. It is proof in my mind of how effective the benevolent dictator model is, and how ineffective a language run by committee would be. This "committee" never seems to be capable of reaching a consensus on anything. A number of issues dont seem to provoke any responses. As a result, many things seem to die a slow and lingering death. Often there is lots of interesting discussion, but still precious few results. In the pre python-dev days, the process seemed easier - we mailed Guido directly, and he either stated "yea" or "nay" - maybe we didnt get the response we hoped for, but at least we got a response. Now, we have the result that even if Guido does enter into a thread, the noise seems to drown out any hope of getting anything done. Guido seems to be faced with the dilemma of asserting his dictatorship in the face of many dissenting opinions from many people he respects, or putting it in the too hard basket. I fear the latter is the easiest option. At the end of this mail I list some of the major threads over the last few months, and can't see a single thread that has resulted in a CVS checkin, and only one that has resulted in agreement. This, to my mind at least, is proof that things are really not working. I long for the "good old days" - take the replacement of "ni" with built-in functionality, for example. I posit that if this was discussed on python-dev, it would have caused a huge flood of mail, and nothing remotely resembling a consensus. Instead, Guido simply wrote an essay and implemented some code that he personally liked. No debate, no discussion. Still an excellent result. Maybe not a perfect result, but a result nonetheless. However, Guido's time is becoming increasingly limited. So should we consider moving to a "benevolent lieutenent" model, in conjunction with re-ramping up the SIGS? This would provide 2 ways to get things done: * A new SIG. Take relative imports, for example. If we really do need a change in this fairly fundamental area, a SIG would be justified ("import-sig"). The responsibility of the SIG is to form a consensus (and code that reflects it), and report back to Guido (and the main newsgroup) with the result of this. It worked well for RE, and allowed those of us not particularly interested to keep out of the debate. If the SIG can not form consensus, then tough - it dies - and should not be mourned. Presumably Guido would keep a watchful eye over the SIG, providing direction where necessary, but in general stay out of the day to day traffic. New SIGs seem to have stopped since this list creation, and it seems that issues that should be discussed in new SIGS are now discussed here. * Guido could delegate some of his authority to a single individual responsible for a certain limited area - a benevolent lieutenent. We may have a lieutentant responsible for different areas, and could only exercise their authority with small, trivial changes. Eg, the "getopt helper" thread - if a lieutenant was given authority for the "standard library", they could simply make a yea or nay decision, and present it to Guido. Presumably Guido trusts this person he delegated to enough that the majority of the lieutenant's recommendations would be accepted. Presumably there would be a small number of lieutentants, and they would then become the new "python-dev" - say up to 5 people. This list then discusses high level strategies and seek direction from each other when things get murky. This select group of people may not (indeed, probably would not) include me, but I would have no problem with that - I would prefer to see results achieved than have my own ego stroked by being included in a select, but ineffective group. In parting, I repeat this is not a direct critisism, simply an observation of the last few months. I am on this list, so I am definately as guilty as any one else - which is "not at all" - ie, no one is guilty, I simply see it as endemic to a committee with people of diverse backgrounds, skills and opinions. Any thoughts? Long live the dictator! :-) Mark. Recent threads, and my take on the results: * getopt helper? Too much noise regarding semantic changes. * Alternative Approach to Relative Imports * Relative package imports * Path hacking * Towards a Python based import scheme Too much noise - no one could really agree on the semantics. Implementation thrown in the ring, and promptly forgotten. * Corporate installations Very young, but no result at all. * Embedding Python when using different calling conventions Quite young, but no result as yet, and I have no reason to believe there will be. * Catching "return" and "return expr" at compile time Seemed to be blessed - yay! Dont believe I have seen a check-in yet. * More Python command-line features Seemed general agreement, but nothing happened? * Tackling circular dependencies in 2.0? Lots of noise, but no results other than "GC may be there in 2.0" * Buffer interface in abstract.c Determined it could break - no solution proposed. Lots of noise regarding if is is a good idea at all! * mmapfile module No result. * Quick-and-dirty weak references No result. * Portable "spawn" module for core? No result. * Fake threads Seemed to spawn stackless Python, but in the face of Guido being "at best, lukewarm" about this issue, I would again have to conclude "no result". An authorative "no" in this area may have saved lots of effort and heartache. * add Expat to 1.6 No result. * I'd like list.pop to accept an optional second argument giving a default value No result * etc No result. From jack at oratrix.nl Mon Nov 1 10:56:48 1999 From: jack at oratrix.nl (Jack Jansen) Date: Mon, 01 Nov 1999 10:56:48 +0100 Subject: [Python-Dev] Embedding Python when using different calling conventions. In-Reply-To: Message by "M.-A. Lemburg" <mal@lemburg.com> , Sat, 30 Oct 1999 10:46:30 +0200 , <381AB066.B54A47E0@lemburg.com> Message-ID: <19991101095648.DC2E535BB1E@snelboot.oratrix.nl> > OTOH, we could take chance to reorganize these macros from bottom > up: when I started coding extensions I found them not very useful > mostly because I didn't have control over them meaning "export > this symbol" or "import the symbol". Especially the DL_IMPORT > macro is strange because it seems to handle both import *and* > export depending on whether Python is compiled or not. This would be very nice. The DL_IMPORT/DL_EXPORT stuff is really weird unless you're working with it all the time. We were trying to build a plugin DLL for PythonWin and first you spend hours finding out that you have to set DL_IMPORT (and how to set it), and the you spend another few hours before you realize that you can't simply copy the DL_IMPORT and DL_EXPORT from, say, timemodule.c because timemodule.c is going to be in the Python core (and hence can use DL_IMPORT for its init() routine declaration) while your module is going to be a plugin so it can't. I would opt for a scheme where the define shows where the symbols is expected to live (DL_CORE and DL_THISMODULE would be needed at least, but probably one or two more for .h files). -- Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++ Jack.Jansen at oratrix.com | ++++ if you agree copy these lines to your sig ++++ www.oratrix.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm From jack at oratrix.nl Mon Nov 1 11:12:37 1999 From: jack at oratrix.nl (Jack Jansen) Date: Mon, 01 Nov 1999 11:12:37 +0100 Subject: [Python-Dev] Benevolent dictator versus the bureaucratic committee? In-Reply-To: Message by "Mark Hammond" <mhammond@skippinet.com.au> , Mon, 1 Nov 1999 12:51:56 +1100 , <002301bf240b$ae61fa00$0501a8c0@bobcat> Message-ID: <19991101101238.3D6FA35BB1E@snelboot.oratrix.nl> I think I agree with Mark's post, although I do see a little more light (the relative imports dicussion resulted in working code, for instance). The benevolent lieutenant idea may work, _if_ the lieutenants can be found. I myself will quickly join Mark in wishing the new python-dev well and abandoning ship (half a :-). If that doesn't work maybe we should try at the very least to create a "memory". If you bring up a subject for discussion and you don't have working code that's fine the first time. But if anyone brings it up a second time they're supposed to have code. That way at least we won't be rehashing old discussions (as happend on the python-list every time, with subjects like GC or optimizations). And maybe we should limit ourselves in our replies: don't speak up too much in discussions if you're not going to write code. I know that I'm pretty good at answering with my brilliant insights to everything myself:-). It could well be that refining and refining the design (as in the getopt discussion) results in such a mess of opinions that no-one has the guts to write the code anymore. -- Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++ Jack.Jansen at oratrix.com | ++++ if you agree copy these lines to your sig ++++ www.oratrix.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm From mal at lemburg.com Mon Nov 1 12:09:21 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Mon, 01 Nov 1999 12:09:21 +0100 Subject: [Python-Dev] dircache.py References: <1270737688-19939033@hypernet.com> Message-ID: <381D74E0.1AE3DA6A@lemburg.com> Gordon McMillan wrote: > > Pursuant to my volunteering to implement Guido's plan to > combine cmp.py, cmpcache.py, dircmp.py and dircache.py > into filecmp.py, I did some investigating of dircache.py. > > I find it completely unreliable. On my NT box, the mtime of the > directory is updated (on average) 2 secs after a file is added, > but within 10 tries, there's always one in which it takes more > than 100 secs (and my test script quits). My Linux box hardly > ever detects a change within 100 secs. > > I've tried a number of ways of testing this ("this" being > checking for a change in the mtime of the directory), the latest > of which is below. Even if dircache can be made to work > reliably and surprise-free on some platforms, I doubt it can be > done cross-platform. So I'd recommend that it just get dropped. > > Comments? Note that you'll have to flush and close the tmp file to actually have it written to the file system. That's why you are not seeing any new mtimes on Linux. Still, I'd suggest declaring it obsolete. Filesystem access is usually cached by the underlying OS anyway, so adding another layer of caching on top of it seems not worthwhile (plus, the OS knows better when and what to cache). Another argument against using stat() time entries for caching purposes is the resolution of 1 second. It makes the dircache.py unreliable per se for fast changing directories. The problem is most probably even worse for NFS and on Samba mounted WinXX filesystems the mtime trick doesn't work at all (stat() returns the creation time for atime, mtime and ctime). -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 60 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From gward at cnri.reston.va.us Mon Nov 1 14:28:51 1999 From: gward at cnri.reston.va.us (Greg Ward) Date: Mon, 1 Nov 1999 08:28:51 -0500 Subject: [Python-Dev] Benevolent dictator versus the bureaucratic committee? In-Reply-To: <002301bf240b$ae61fa00$0501a8c0@bobcat>; from mhammond@skippinet.com.au on Mon, Nov 01, 1999 at 12:51:56PM +1100 References: <002301bf240b$ae61fa00$0501a8c0@bobcat> Message-ID: <19991101082851.A16952@cnri.reston.va.us> On 01 November 1999, Mark Hammond said: > I have for some time been wondering about the usefulness of this > mailing list. It seems to have produced staggeringly few results > since inception. Perhaps this is an indication of stability rather than stagnation. Of course we can't have *total* stability or Python 1.6 will never appear, but... > * Portable "spawn" module for core? > No result. ...I started this little thread to see if there was any interest, and to find out the easy way if VMS/Unix/DOS-style "spawn sub-process with list of strings as command-line arguments" makes any sense at all on the Mac without actually having to go learn about the Mac. The result: if 'spawn()' is added to the core, it should probably be 'os.spawn()', but it's not really clear if this is necessary or useful to many people; and, no, it doesn't make sense on the Mac. That answered my questions, so I don't really see the thread as a failure. I might still turn the distutils.spawn module into an appendage of the os module, but there doesn't seem to be a compelling reason to do so. Not every thread has to result in working code. In other words, negative results are results too. Greg From skip at mojam.com Mon Nov 1 17:58:41 1999 From: skip at mojam.com (Skip Montanaro) Date: Mon, 1 Nov 1999 10:58:41 -0600 (CST) Subject: [Python-Dev] Benevolent dictator versus the bureaucratic committee? In-Reply-To: <002301bf240b$ae61fa00$0501a8c0@bobcat> References: <002301bf240b$ae61fa00$0501a8c0@bobcat> Message-ID: <14365.50881.778143.590205@dolphin.mojam.com> Mark> * Catching "return" and "return expr" at compile time Mark> Seemed to be blessed - yay! Dont believe I have seen a check-in Mark> yet. I did post a patch to compile.c here and to the announce list. I think the temporal distance between the furor in the main list and when it appeared "in print" may have been a problem. Also, as the author of that code I surmised that compile.c was the wrong place for it. I would have preferred to see it in some Python code somewhere, but there's no obvious place to put it. Finally, there is as yet no convention about how to handle warnings. (Maybe some sort of PyLint needs to be "blessed" and made part of the distribution.) Perhaps python-dev would be good to generate SIGs, sort of like a hurricane spinning off tornadoes. Skip Montanaro | http://www.mojam.com/ skip at mojam.com | http://www.musi-cal.com/ 847-971-7098 | Python: Programming the way Guido indented... From guido at CNRI.Reston.VA.US Mon Nov 1 19:41:32 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Mon, 01 Nov 1999 13:41:32 -0500 Subject: [Python-Dev] Misleading syntax error text In-Reply-To: Your message of "Mon, 01 Nov 1999 00:00:55 +0100." <381CCA27.59506CF6@lemburg.com> References: <1270838575-13870925@hypernet.com> <381CCA27.59506CF6@lemburg.com> Message-ID: <199911011841.NAA06233@eric.cnri.reston.va.us> > How about chainging the com_assign_trailer function in Python/compile.c > to: Please don't use the python-dev list for issues like this. The place to go is the python-bugs database (http://www.python.org/search/search_bugs.html) or you could just send me a patch (please use a context diff and include the standard disclaimer language). --Guido van Rossum (home page: http://www.python.org/~guido/) From mal at lemburg.com Mon Nov 1 20:06:39 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Mon, 01 Nov 1999 20:06:39 +0100 Subject: [Python-Dev] Misleading syntax error text References: <1270838575-13870925@hypernet.com> <381CCA27.59506CF6@lemburg.com> <199911011841.NAA06233@eric.cnri.reston.va.us> Message-ID: <381DE4BF.951B03F0@lemburg.com> Guido van Rossum wrote: > > > How about chainging the com_assign_trailer function in Python/compile.c > > to: > > Please don't use the python-dev list for issues like this. The place > to go is the python-bugs database > (http://www.python.org/search/search_bugs.html) or you could just send > me a patch (please use a context diff and include the standard disclaimer > language). This wasn't really a bug report... I was actually looking for some feedback prior to sending a real (context) patch. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 60 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From jim at interet.com Tue Nov 2 16:43:56 1999 From: jim at interet.com (James C. Ahlstrom) Date: Tue, 02 Nov 1999 10:43:56 -0500 Subject: [Python-Dev] Benevolent dictator versus the bureaucratic committee? References: <002301bf240b$ae61fa00$0501a8c0@bobcat> Message-ID: <381F06BC.CC2CBFBD@interet.com> Mark Hammond wrote: > > I have for some time been wondering about the usefulness of this > mailing list. It seems to have produced staggeringly few results > since inception. I appreciate the points you made, but I think this list is still a valuable place to air design issues. I don't want to see too many Python core changes anyway. Just my 2.E-2 worth. Jim Ahlstrom From Vladimir.Marangozov at inrialpes.fr Wed Nov 3 23:34:44 1999 From: Vladimir.Marangozov at inrialpes.fr (Vladimir Marangozov) Date: Wed, 3 Nov 1999 23:34:44 +0100 (NFT) Subject: [Python-Dev] paper available Message-ID: <199911032234.XAA26442@pukapuka.inrialpes.fr> I've OCR'd Saltzer's paper. It's available temporarily (in MS Word format) at http://sirac.inrialpes.fr/~marangoz/tmp/Saltzer.zip Since there may be legal problems with LNCS, I will disable the link shortly (so those of you who have not received a copy and are interested in reading it, please grab it quickly) If prof. Saltzer agrees (and if he can, legally) put it on his web page, I guess that the paper will show up at http://mit.edu/saltzer/ Jeremy, could you please check this with prof. Saltzer? (This version might need some corrections due to the OCR process, despite that I've made a significant effort to clean it up) -- Vladimir MARANGOZOV | Vladimir.Marangozov at inrialpes.fr http://sirac.inrialpes.fr/~marangoz | tel:(+33-4)76615277 fax:76615252 From guido at CNRI.Reston.VA.US Thu Nov 4 21:58:53 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Thu, 04 Nov 1999 15:58:53 -0500 Subject: [Python-Dev] wish list Message-ID: <199911042058.PAA15437@eric.cnri.reston.va.us> I got the wish list below. Anyone care to comment on how close we are on fulfilling some or all of this? --Guido van Rossum (home page: http://www.python.org/~guido/) ------- Forwarded Message Date: Thu, 04 Nov 1999 20:26:54 +0700 From: "Claudio Ram?n" <rmn70 at hotmail.com> To: guido at python.org Hello, I'm a python user (excuse my english, I'm spanish and...). I think it is a very complete language and I use it in solve statistics, phisics, mathematics, chemistry and biology problemns. I'm not an experienced programmer, only a scientific with problems to solve. The motive of this letter is explain to you a needs that I have in the python use and I think in the next versions... * GNU CC for Win32 compatibility (compilation of python interpreter and "Freeze" utility). I think MingWin32 (Mummint Khan) is a good alternative eviting the cygwin dll user. * Add low level programming capabilities for system access and speed of code fragments eviting the C-C++ or Java code use. Python, I think, must be a complete programming language in the "programming for every body" philosofy. * Incorporate WxWindows (wxpython) and/or Gtk+ (now exist a win32 port) GUI in the standard distribution. For example, Wxpython permit an html browser. It is very importan for document presentations. And Wxwindows and Gtk+ are faster than tk. * Incorporate a database system in the standard library distribution. To be possible with relational and documental capabilites and with import facility of DBASE, Paradox, MSAccess files. * Incorporate a XML/HTML/Math-ML editor/browser with graphics capability (to be possible with XML how internal file format). And to be possible with Microsoft Word import export facility. For example, AbiWord project can be an alternative but if lacks programming language. If we can make python the programming language for AbiWord project... Thanks. Ram?n Molina. rmn70 at hotmail.com ______________________________________________________ Get Your Private, Free Email at http://www.hotmail.com ------- End of Forwarded Message From skip at mojam.com Thu Nov 4 22:06:53 1999 From: skip at mojam.com (Skip Montanaro) Date: Thu, 4 Nov 1999 15:06:53 -0600 (CST) Subject: [Python-Dev] wish list In-Reply-To: <199911042058.PAA15437@eric.cnri.reston.va.us> References: <199911042058.PAA15437@eric.cnri.reston.va.us> Message-ID: <14369.62829.389307.377095@dolphin.mojam.com> * Incorporate a database system in the standard library distribution. To be possible with relational and documental capabilites and with import facility of DBASE, Paradox, MSAccess files. I know Digital Creations has a dbase module knocking around there somewhere. I hacked on it for them a couple years ago. You might see if JimF can scrounge it up and donate it to the cause. Skip Montanaro | http://www.mojam.com/ skip at mojam.com | http://www.musi-cal.com/ 847-971-7098 | Python: Programming the way Guido indented... From fdrake at acm.org Thu Nov 4 22:08:26 1999 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Thu, 4 Nov 1999 16:08:26 -0500 (EST) Subject: [Python-Dev] wish list In-Reply-To: <199911042058.PAA15437@eric.cnri.reston.va.us> References: <199911042058.PAA15437@eric.cnri.reston.va.us> Message-ID: <14369.62922.994300.233350@weyr.cnri.reston.va.us> Guido van Rossum writes: > I got the wish list below. Anyone care to comment on how close we are > on fulfilling some or all of this? Claudio Ram?n <rmn70 at hotmail.com> wrote: > * Incorporate WxWindows (wxpython) and/or Gtk+ (now exist a win32 port) GUI > in the standard distribution. For example, Wxpython permit an html browser. > It is very importan for document presentations. And Wxwindows and Gtk+ are > faster than tk. And GTK+ looks better, too. ;-) None the less, I don't think GTK+ is as solid or mature as Tk. There are still a lot of oddities, and several warnings/errors get messages printed on stderr/stdout (don't know which) rather than raising exceptions. (This is a failing of GTK+, not PyGTK.) There isn't an equivalent of the Tk text widget, which is a real shame. There are people working on something better, but it's not a trivial project and I don't have any idea how its going. > * Incorporate a database system in the standard library distribution. To be > possible with relational and documental capabilites and with import facility > of DBASE, Paradox, MSAccess files. Doesn't sound like part of a core library really, though I could see combining the Win32 extensions with the core package to produce a single installable. That should at least provide access to MSAccess, and possible the others, via ODBC. > * Incorporate a XML/HTML/Math-ML editor/browser with graphics capability (to > be possible with XML how internal file format). And to be possible with > Microsoft Word import export facility. For example, AbiWord project can be > an alternative but if lacks programming language. If we can make python the > programming language for AbiWord project... I think this would be great to have. But I wouldn't put the editor/browser in the core. I would stick something like the XML-SIG's package in, though, once that's better polished. -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> Corporation for National Research Initiatives From jim at interet.com Fri Nov 5 01:09:40 1999 From: jim at interet.com (James C. Ahlstrom) Date: Thu, 04 Nov 1999 19:09:40 -0500 Subject: [Python-Dev] wish list References: <199911042058.PAA15437@eric.cnri.reston.va.us> Message-ID: <38222044.46CB297E@interet.com> Guido van Rossum wrote: > > I got the wish list below. Anyone care to comment on how close we are > on fulfilling some or all of this? > * GNU CC for Win32 compatibility (compilation of python interpreter and > "Freeze" utility). I think MingWin32 (Mummint Khan) is a good alternative > eviting the cygwin dll user. I don't know what this means. > * Add low level programming capabilities for system access and speed of code > fragments eviting the C-C++ or Java code use. Python, I think, must be a > complete programming language in the "programming for every body" philosofy. I don't know what this means in practical terms either. I use the C interface for this. > * Incorporate WxWindows (wxpython) and/or Gtk+ (now exist a win32 port) GUI > in the standard distribution. For example, Wxpython permit an html browser. > It is very importan for document presentations. And Wxwindows and Gtk+ are > faster than tk. As a Windows user, I don't feel comfortable publishing GUI code based on these tools. Maybe they have progressed and I should look at them again. But I doubt the Python world is going to standardize on a single GUI anyway. Does anyone out there publish Windows Python code with a Windows Python GUI? If so, what GUI toolkit do you use? Jim Ahlstrom From rushing at nightmare.com Fri Nov 5 08:22:22 1999 From: rushing at nightmare.com (Sam Rushing) Date: Thu, 4 Nov 1999 23:22:22 -0800 (PST) Subject: [Python-Dev] wish list In-Reply-To: <668469884@toto.iv> Message-ID: <14370.34222.884193.260990@seattle.nightmare.com> James C. Ahlstrom writes: > Guido van Rossum wrote: > > I got the wish list below. Anyone care to comment on how close we are > > on fulfilling some or all of this? > > > * GNU CC for Win32 compatibility (compilation of python interpreter and > > "Freeze" utility). I think MingWin32 (Mummint Khan) is a good alternative > > eviting the cygwin dll user. > > I don't know what this means. mingw32: 'minimalist gcc for win32'. it's gcc on win32 without trying to be unix. It links against crtdll, so for example it can generate small executables that run on any win32 platform. Also, an alternative to plunking down money ever year to keep up with MSVC++ I used to use mingw32 a lot, and it's even possible to set up egcs to cross-compile to it. At one point using egcs on linux I was able to build a stripped-down python.exe for win32... http://agnes.dida.physik.uni-essen.de/~janjaap/mingw32/ -Sam From jim at interet.com Fri Nov 5 15:04:59 1999 From: jim at interet.com (James C. Ahlstrom) Date: Fri, 05 Nov 1999 09:04:59 -0500 Subject: [Python-Dev] wish list References: <14370.34222.884193.260990@seattle.nightmare.com> Message-ID: <3822E40B.99BA7CA0@interet.com> Sam Rushing wrote: > mingw32: 'minimalist gcc for win32'. it's gcc on win32 without trying > to be unix. It links against crtdll, so for example it can generate OK, thanks. But I don't believe this is something that Python should pursue. Binaries are available for Windows and Visual C++ is widely available and has a professional debugger (etc.). Jim Ahlstrom From skip at mojam.com Fri Nov 5 18:17:58 1999 From: skip at mojam.com (Skip Montanaro) Date: Fri, 5 Nov 1999 11:17:58 -0600 (CST) Subject: [Python-Dev] paper available In-Reply-To: <199911032234.XAA26442@pukapuka.inrialpes.fr> References: <199911032234.XAA26442@pukapuka.inrialpes.fr> Message-ID: <14371.4422.96832.498067@dolphin.mojam.com> Vlad> I've OCR'd Saltzer's paper. It's available temporarily (in MS Word Vlad> format) at http://sirac.inrialpes.fr/~marangoz/tmp/Saltzer.zip I downloaded it and took a very quick peek at it, but it's applicability to Python wasn't immediately obvious to me. Did you download it in response to some other thread I missed somewhere? Skip Montanaro | http://www.mojam.com/ skip at mojam.com | http://www.musi-cal.com/ 847-971-7098 | Python: Programming the way Guido indented... From gstein at lyra.org Fri Nov 5 23:19:49 1999 From: gstein at lyra.org (Greg Stein) Date: Fri, 5 Nov 1999 14:19:49 -0800 (PST) Subject: [Python-Dev] wish list In-Reply-To: <3822E40B.99BA7CA0@interet.com> Message-ID: <Pine.LNX.4.10.9911051418330.32496-100000@nebula.lyra.org> On Fri, 5 Nov 1999, James C. Ahlstrom wrote: > Sam Rushing wrote: > > mingw32: 'minimalist gcc for win32'. it's gcc on win32 without trying > > to be unix. It links against crtdll, so for example it can generate > > OK, thanks. But I don't believe this is something that > Python should pursue. Binaries are available for Windows > and Visual C++ is widely available and has a professional > debugger (etc.). If somebody is willing to submit patches, then I don't see a problem with it. There are quite a few people who are unable/unwilling to purchase VC++. People may also need to build their own Python rather than using the prebuilt binaries. Cheers, -g -- Greg Stein, http://www.lyra.org/ From gstein at lyra.org Sun Nov 7 14:24:24 1999 From: gstein at lyra.org (Greg Stein) Date: Sun, 7 Nov 1999 05:24:24 -0800 (PST) Subject: [Python-Dev] updated modules Message-ID: <Pine.LNX.4.10.9911070518020.32496-100000@nebula.lyra.org> Hi all... I've updated some of the modules at http://www.lyra.org/greg/python/. Specifically, there is a new httplib.py, davlib.py, qp_xml.py, and a new imputil.py. The latter will be updated again RSN with some patches from Jim Ahlstrom. Besides some tweaks/fixes/etc, I've also clarified the ownership and licensing of the things. httplib and davlib are (C) Guido, licensed under the Python license (well... anything he chooses :-). qp_xml and imputil are still Public Domain. I also added some comments into the headers to note where they come from (I've had a few people remark that they ran across the module but had no idea who wrote it or where to get updated versions :-), and I inserted a CVS Id to track the versions (yes, I put them into CVS just now). Note: as soon as I figure out the paperwork or whatever, I'll also be skipping the whole "wetsign.txt" thingy and just transfer everything to Guido. He remarked a while ago that he will finally own some code in the Python distribution(!) despite not writing it :-) I might encourage others to consider the same... Cheers, -g -- Greg Stein, http://www.lyra.org/ From mal at lemburg.com Mon Nov 8 10:33:30 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Mon, 08 Nov 1999 10:33:30 +0100 Subject: [Python-Dev] wish list References: <199911042058.PAA15437@eric.cnri.reston.va.us> Message-ID: <382698EA.4DBA5E4B@lemburg.com> Guido van Rossum wrote: > > * GNU CC for Win32 compatibility (compilation of python interpreter and > "Freeze" utility). I think MingWin32 (Mummint Khan) is a good alternative > eviting the cygwin dll user. I think this would be a good alternative for all those not having MS VC for one reason or another. Since Mingw32 is free this might be an appropriate solution for e.g. schools which don't want to spend lots of money for VC licenses. > * Add low level programming capabilities for system access and speed of code > fragments eviting the C-C++ or Java code use. Python, I think, must be a > complete programming language in the "programming for every body" philosofy. Don't know what he meant here... > * Incorporate WxWindows (wxpython) and/or Gtk+ (now exist a win32 port) GUI > in the standard distribution. For example, Wxpython permit an html browser. > It is very importan for document presentations. And Wxwindows and Gtk+ are > faster than tk. GUIs tend to be fast moving targets, better leave them out of the main distribution. > * Incorporate a database system in the standard library distribution. To be > possible with relational and documental capabilites and with import facility > of DBASE, Paradox, MSAccess files. Database interfaces are usually way to complicated and largish for the standard dist. IMHO, they should always be packaged separately. Note that simple interfaces such as a standard CSV file import/export module would be neat extensions to the dist. > * Incorporate a XML/HTML/Math-ML editor/browser with graphics capability (to > be possible with XML how internal file format). And to be possible with > Microsoft Word import export facility. For example, AbiWord project can be > an alternative but if lacks programming language. If we can make python the > programming language for AbiWord project... I'm getting the feeling that Ramon is looking for a complete visual programming environment here. XML support in the standard dist (faster than xmllib.py) would be nice. Before that we'd need solid builtin Unicode support though... -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 53 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From captainrobbo at yahoo.com Tue Nov 9 14:57:46 1999 From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=) Date: Tue, 9 Nov 1999 05:57:46 -0800 (PST) Subject: [Python-Dev] Internationalisation Case Study Message-ID: <19991109135746.20446.rocketmail@web608.mail.yahoo.com> Guido has asked me to get involved in this discussion, as I've been working practically full-time on i18n for the last year and a half and have done quite a bit with Python in this regard. I thought the most helpful thing would be to describe the real-world business problems I have been tackling so people can understand what one might want from an encoding toolkit. In this (long) post I have included: 1. who I am and what I want to do 2. useful sources of info 3. a real world i18n project 4. what I'd like to see in an encoding toolkit Grab a coffee - this is a long one. 1. Who I am -------------- Firstly, credentials. I'm a Python programmer by night, and when I can involve it in my work which happens perhaps 20% of the time. More relevantly, I did a postgrad course in Japanese Studies and lived in Japan for about two years; in 1990 when I returned, I was speaking fairly fluently and could read a newspaper with regular reference tio a dictionary. Since then my Japanese has atrophied badly, but it is good enough for IT purposes. For the last year and a half I have been internationalizing a lot of systems - more on this below. My main personal interest is that I am hoping to launch a company using Python for reporting, data cleaning and transformation. An encoding library is sorely needed for this. 2. Sources of Knowledge ------------------------------ We should really go for world class advice on this. Some people who could really contribute to this discussion are: - Ken Lunde, author of "CJKV Information Processing" and head of Asian Type Development at Adobe. - Jeffrey Friedl, author of "Mastering Regular Expressions", and a long time Japan resident and expert on things Japanese - Maybe some of the Ruby community? I'll list up books URLs etc. for anyone who needs them on request. 3. A Real World Project ---------------------------- 18 months ago I was offered a contract with one of the world's largest investment management companies (which I will nickname HugeCo) , who (after many years having analysts out there) were launching a business in Japan to attract savers; due to recent legal changes, Japanese people can now freely buy into mutual funds run by foreign firms. Given the 2% they historically get on their savings, and the 12% that US equities have returned for most of this century, this is a business with huge potential. I've been there for a while now, rotating through many different IT projects. HugeCo runs its non-US business out of the UK. The core deal-processing business runs on IBM AS400s. These are kind of a cross between a relational database and a file system, and speak their own encoding called EBCDIC. Five years ago the AS400 had limited connectivity to everything else, so they also started deploying Sybase databases on Unix to support some functions. This means 'mirroring' data between the two systems on a regular basis. IBM has always included encoding information on the AS400 and it converts from EBCDIC to ASCII on request with most of the transfer tools (FTP, database queries etc.) To make things work for Japan, everyone realised that a double-byte representation would be needed. Japanese has about 7000 characters in most IT-related character sets, and there are a lot of ways to store it. Here's a potted language lesson. (Apologies to people who really know this field -- I am not going to be fully pedantic or this would take forever). Japanese includes two phonetic alphabets (each with about 80-90 characters), the thousands of Kanji, and English characters, often all in the same sentence. The first attempt to display something was to make a single -byte character set which included ASCII, and a simplified (and very ugly) katakana alphabet in the upper half of the code page. So you could spell out the sounds of Japanese words using 'half width katakana'. The basic 'character set' is Japan Industrial Standard 0208 ("JIS"). This was defined in 1978, the first official Asian character set to be defined by a government. This can be thought of as a printed chart showing the characters - it does not define their storage on a computer. It defined a logical 94 x 94 grid, and each character has an index in this grid. The "JIS" encoding was a way of mixing ASCII and Japanese in text files and emails. Each Japanese character had a double-byte value. It had 'escape sequences' to say 'You are now entering ASCII territory' or the opposite. In 1978 Microsoft quickly came up with Shift-JIS, a smarter encoding. This basically said "Look at the next byte. If below 127, it is ASCII; if between A and B, it is a half-width katakana; if between B and C, it is the first half of a double-byte character and the next one is the second half". Extended Unix Code (EUC) does similar tricks. Both have the property that there are no control characters, and ASCII is still ASCII. There are a few other encodings too. Unfortunately for me and HugeCo, IBM had their own standard before the Japanese government did, and it differs; it is most commonly called DBCS (Double-Byte Character Set). This involves shift-in and shift-out sequences (0x16 and 0x17, cannot remember which way round), so you can mix single and double bytes in a field. And we used AS400s for our core processing. So, back to the problem. We had a FoxPro system using ShiftJIS on the desks in Japan which we wanted to replace in stages, and an AS400 database to replace it with. The first stage was to hook them up so names and addresses could be uploaded to the AS400, and data files consisting of daily report input could be downloaded to the PCs. The AS400 supposedly had a library which did the conversions, but no one at IBM knew how it worked. The people who did all the evaluations had basically proved that 'Hello World' in Japanese could be stored on an AS400, but never looked at the conversion issues until mid-project. Not only did we need a conversion filter, we had the problem that the character sets were of different sizes. So it was possible - indeed, likely - that some of our ten thousand customers' names and addresses would contain characters only on one system or the other, and fail to survive a round trip. (This is the absolute key issue for me - will a given set of data survive a round trip through various encoding conversions?) We figured out how to get the AS400 do to the conversions during a file transfer in one direction, and I wrote some Python scripts to make up files with each official character in JIS on a line; these went up with conversion, came back binary, and I was able to build a mapping table and 'reverse engineer' the IBM encoding. It was straightforward in theory, "fun" in practice. I then wrote a python library which knew about the AS400 and Shift-JIS encodings, and could translate a string between them. It could also detect corruption and warn us when it occurred. (This is another key issue - you will often get badly encoded data, half a kanji or a couple of random bytes, and need to be clear on your strategy for handling it in any library). It was slow, but it got us our gateway in both directions, and it warned us of bad input. 360 characters in the DBCS encoding actually appear twice, so perfect round trips are impossible, but practically you can survive with some validation of input at both ends. The final story was that our names and addresses were mostly safe, but a few obscure symbols weren't. A big issue was that field lengths varied. An address field 40 characters long on a PC might grow to 42 or 44 on an AS400 because of the shift characters, so the software would truncate the address during import, and cut a kanji in half. This resulted in a string that was illegal DBCS, and errors in the database. To guard against this, you need really picky input validation. You not only ask 'is this string valid Shift-JIS', you check it will fit on the other system too. The next stage was to bring in our Sybase databases. Sybase make a Unicode database, which works like the usual one except that all your SQL code suddenly becomes case sensitive - more (unrelated) fun when you have 2000 tables. Internally it stores data in UTF8, which is a 'rearrangement' of Unicode which is much safer to store in conventional systems. Basically, a UTF8 character is between one and three bytes, there are no nulls or control characters, and the ASCII characters are still the same ASCII characters. UTF8<->Unicode involves some bit twiddling but is one-to-one and entirely algorithmic. We had a product to 'mirror' data between AS400 and Sybase, which promptly broke when we fed it Japanese. The company bought a library called Unilib to do conversions, and started rewriting the data mirror software. This library (like many) uses Unicode as a central point in all conversions, and offers most of the world's encodings. We wanted to test it, and used the Python routines to put together a regression test. As expected, it was mostly right but had some differences, which we were at least able to document. We also needed to rig up a daily feed from the legacy FoxPro database into Sybase while it was being replaced (about six months). We took the same library, built a DLL wrapper around it, and I interfaced to this with DynWin , so we were able to do the low-level string conversion in compiled code and the high-level control in Python. A FoxPro batch job wrote out delimited text in shift-JIS; Python read this in, ran it through the DLL to convert it to UTF8, wrote that out as UTF8 delimited files, ftp'ed them to an in directory on the Unix box ready for daily import. At this point we had a lot of fun with field widths - Shift-JIS is much more compact than UTF8 when you have a lot of kanji (e.g. address fields). Another issue was half-width katakana. These were the earliest attempt to get some form of Japanese out of a computer, and are single-byte characters above 128 in Shift-JIS - but are not part of the JIS0208 standard. They look ugly and are discouraged; but when you ar enterinh a long address in a field of a database, and it won't quite fit, the temptation is to go from two-bytes-per -character to one (just hit F7 in windows) to save space. Unilib rejected these (as would Java), but has optional modes to preserve them or 'expand them out' to their full-width equivalents. The final technical step was our reports package. This is a 4GL using a really horrible 1980s Basic-like language which reads in fixed-width data files and writes out Postscript; you write programs saying 'go to x,y' and 'print customer_name', and can build up anything you want out of that. It's a monster to develop in, but when done it really works - million page jobs no problem. We had bought into this on the promise that it supported Japanese; actually, I think they had got the equivalent of 'Hello World' out of it, since we had a lot of problems later. The first stage was that the AS400 would send down fixed width data files in EBCDIC and DBCS. We ran these through a C++ conversion utility, again using Unilib. We had to filter out and warn about corrupt fields, which the conversion utility would reject. Surviving records then went into the reports program. It then turned out that the reports program only supported some of the Japanese alphabets. Specifically, it had a built in font switching system whereby when it encountered ASCII text, it would flip to the most recent single byte text, and when it found a byte above 127, it would flip to a double byte font. This is because many Chinese fonts do (or did) not include English characters, or included really ugly ones. This was wrong for Japanese, and made the half-width katakana unprintable. I found out that I could control fonts if I printed one character at a time with a special escape sequence, so wrote my own bit-scanning code (tough in a language without ord() or bitwise operations) to examine a string, classify every byte, and control the fonts the way I wanted. So a special subroutine is used for every name or address field. This is apparently not unusual in GUI development (especially web browsers) - you rarely find a complete Unicode font, so you have to switch fonts on the fly as you print a string. After all of this, we had a working system and knew quite a bit about encodings. Then the curve ball arrived: User Defined Characters! It is not true to say that there are exactly 6879 characters in Japanese, and more than counting the number of languages on the Indian sub-continent or the types of cheese in France. There are historical variations and they evolve. Some people's names got missed out, and others like to write a kanji in an unusual way. Others arrived from China where they have more complex variants of the same characters. Despite the Japanese government's best attempts, these people have dug their heels in and want to keep their names the way they like them. My first reaction was 'Just Say No' - I basically said that it one of these customers (14 out of a database of 8000) could show me a tax form or phone bill with the correct UDC on it, we would implement it but not otherwise (the usual workaround is to spell their name phonetically in katakana). But our marketing people put their foot down. A key factor is that Microsoft has 'extended the standard' a few times. First of all, Microsoft and IBM include an extra 360 characters in their code page which are not in the JIS0208 standard. This is well understood and most encoding toolkits know what 'Code Page 932' is Shift-JIS plus a few extra characters. Secondly, Shift-JIS has a User-Defined region of a couple of thousand characters. They have lately been taking Chinese variants of Japanese characters (which are readable but a bit old-fashioned - I can imagine pipe-smoking professors using these forms as an affectation) and adding them into their standard Windows fonts; so users are getting used to these being available. These are not in a standard. Thirdly, they include something called the 'Gaiji Editor' in Japanese Win95, which lets you add new characters to the fonts on your PC within the user-defined region. The first step was to review all the PCs in the Tokyo office, and get one centralized extension font file on a server. This was also fun as people had assigned different code points to characters on differene machines, so what looked correct on your word processor was a black square on mine. Effectively, each company has its own custom encoding a bit bigger than the standard. Clearly, none of these extensions would convert automatically to the other platforms. Once we actually had an agreed list of code points, we scanned the database by eye and made sure that the relevant people were using them. We decided that space for 128 User-Defined Characters would be allowed. We thought we would need a wrapper around Unilib to intercept these values and do a special conversion; but to our amazement it worked! Somebody had already figured out a mapping for at least 1000 characters for all the Japanes encodings, and they did the round trips from Shift-JIS to Unicode to DBCS and back. So the conversion problem needed less code than we thought. This mapping is not defined in a standard AFAIK (certainly not for DBCS anyway). We did, however, need some really impressive validation. When you input a name or address on any of the platforms, the system should say (a) is it valid for my encoding? (b) will it fit in the available field space in the other platforms? (c) if it contains user-defined characters, are they the ones we know about, or is this a new guy who will require updates to our fonts etc.? Finally, we got back to the display problems. Our chosen range had a particular first byte. We built a miniature font with the characters we needed starting in the lower half of the code page. I then generalized by name-printing routine to say 'if the first character is XX, throw it away, and print the subsequent character in our custom font'. This worked beautifully - not only could we print everything, we were using type 1 embedded fonts for the user defined characters, so we could distill it and also capture it for our internal document imaging systems. So, that is roughly what is involved in building a Japanese client reporting system that spans several platforms. I then moved over to the web team to work on our online trading system for Japan, where I am now - people will be able to open accounts and invest on the web. The first stage was to prove it all worked. With HTML, Java and the Web, I had high hopes, which have mostly been fulfilled - we set an option in the database connection to say 'this is a UTF8 database', and Java converts it to Unicode when reading the results, and we set another option saying 'the output stream should be Shift-JIS' when we spew out the HTML. There is one limitations: Java sticks to the JIS0208 standard, so the 360 extra IBM/Microsoft Kanji and our user defined characters won't work on the web. You cannot control the fonts on someone else's web browser; management accepted this because we gave them no alternative. Certain customers will need to be warned, or asked to suggest a standard version of a charactere if they want to see their name on the web. I really hope the web actually brings character usage in line with the standard in due course, as it will save a fortune. Our system is multi-language - when a customer logs in, we want to say 'You are a Japanese customer of our Tokyo Operation, so you see page X in language Y'. The language strings all all kept in UTF8 in XML files, so the same file can hold many languages. This and the database are the real-world reasons why you want to store stuff in UTF8. There are very few tools to let you view UTF8, but luckily there is a free Word Processor that lets you type Japanese and save it in any encoding; so we can cut and paste between Shift-JIS and UTF8 as needed. And that's it. No climactic endings and a lot of real world mess, just like life in IT. But hopefully this gives you a feel for some of the practical stuff internationalisation projects have to deal with. See my other mail for actual suggestions - Andy Robinson ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From captainrobbo at yahoo.com Tue Nov 9 14:58:39 1999 From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=) Date: Tue, 9 Nov 1999 05:58:39 -0800 (PST) Subject: [Python-Dev] Internationalization Toolkit Message-ID: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> Here are the features I'd like to see in a Python Internationalisation Toolkit. I'm very open to persuasion about APIs and how to do it, but this is roughly the functionality I would have wanted for the last year (see separate post "Internationalization Case Study"): Built-in types: --------------- "Unicode String" and "Normal String". The normal string is can hold all 256 possible byte values and is analogous to java's Byte Array - in other words an ordinary Python string. Unicode strings iterate (and are manipulated) per character, not per byte. You knew that already. To manipulate anything in a funny encoding, you convert it to Unicode, manipulate it there, then convert it back. Easy Conversions ---------------------- This is modelled on Java which I think has it right. When you construct a Unicode string, you may supply an optional encoding argument. I'm not bothered if conversion happens in a global function, a constructor method or whatever. MyUniString = ToUnicode('hello') # assumes ASCII MyUniString = ToUnicode('pretend this is Japanese', 'ShiftJIS') #specified The converse applies when converting back. The encoding designators should agree with Java. If data is encountered which is not valid for the encoding, there are several strategies, and it would be nice if they could be specified explicitly: 1. replace offending characters with a question mark 2. try to recover intelligently (possible in some cases) 3. raise an exception A 'Unicode' designator is needed which performs a dummy conversion. File Opening: --------------- It should be possible to work with files as we do now - just streams of binary data. It should also be possible to read, say, a file of locally endoded addresses into a Unicode string. e.g. open(myfile, 'r', 'ShiftJIS'). It should also be possible to open a raw Unicode file and read the bytes into ordinary Python strings, or Unicode strings. In this case one needs to watch out for the byte-order marks at the beginning of the file. Not sure of a good API to do this. We could have OrdinaryFile objects and UnicodeFile objects, or proliferate the arguments to 'open. Doing the Conversions ---------------------------- All conversions should go through Unicode as the central point. Here is where we can start to define the territory. Some conversions are algorithmic, some are lookups, many are a mixture with some simple state transitions (e.g. shift characters to denote switches from double-byte to single-byte). I'd like to see an 'encoding engine' modelled on something like mxTextTools - a state machine with a few simple actions, effectively a mini-language for doing simple operations. Then a new encoding can be added in a data-driven way, and still go at C-like speeds. Making this open and extensible (and preferably not needing to code C to do it) is the only way I can see to get a really good solid encodings library. Not all encodings need go in the standard distribution, but all should be downloadable from www.python.org. A generalized two-byte-to-two-byte mapping is 128kb. But there are compact forms which can reduce these to a few kb, and also make the data intelligible. It is obviously desirable to store stuff compactly if we can unpack it fast. Typed Strings ---------------- When you are writing data conversion tools to sit in the middle of a bunch of databases, you could save a lot of grief with a string that knows its encoding. What follows could be done as a Python wrapper around something ordinary strings rather than as a new type, and thus need not be part of the language. This is analogous to Martin Fowler's Quantity pattern in Analysis Patterns, where a number knows its units and you cannot add dollars and pounds accidentally. These would do implicit conversions; and they would stop you assigning or confusing differently encoded strings. They would also validate when constructed. 'Typecasting' would be allowed but would require explicit code. So maybe something like... >>>ts1 = TypedString('hello', 'cp932ms') # specify encoding, it remembers it >>>ts2 = TypedString('goodbye','cp5035') >>>ts1 + ts2 #or any of a host of other encoding options EncodingError >>>ts3 = TypedString(ts1, 'cp5035') #converts it implicitly going via Unicode >>>ts4 = ts1.cast('ShiftJIS') #the developer knows that in this case the string is compatible. Going Deeper ---------------- The project I describe involved many more issues than just a straight conversion. I envisage an encodings package or module which power users could get at directly. We have be able to answer the questions: 'is string X a valid instance of encoding Y?' 'is string X nearly a valid instance of encoding Y, maybe with a little corruption, or is it something totally different?' - this one might be a task left to a programmer, but the toolkit should help where it can. 'can string X be converted from encoding Y to encoding Z without loss of data? If not, exactly what will get trashed' ? This is a really useful utility. More generally, I want tools to reason about character sets and encodings. I have 'Character Set' and 'Character Mapping' classes - very app-specific and proprietary - which let me express and answer questions about whether one character set is a superset of another and reason about round trips. I'd like to do these properly for the toolkit. They would need some C support for speed, but I think they could still be data driven. So we could have an Endoding object which could be pickled, and we could keep a directory full of them as our database. There might actually be two encoding objects - one for single-byte, one for multi-byte, with the same API. There are so many subtle differences between encodings (even within the Shift-JIS family) - company X has ten extra characters, and that is technically a new encoding. So it would be really useful to reason about these and say 'find me all JIS-compatible encodings', or 'report on the differences between Shift-JIS and 'cp932ms'. GUI Issues ------------- The new Pythonwin breaks somewhat on Japanese - editor windows are fine but console output is show as single-byte garbage. I will try to evaluate IDLE on a Japanese test box this week. I think these two need to work for double-byte languages for our credibility. Verifiability and printing ----------------------------- We will need to prove it all works. This means looking at text on a screen or on paper. A really wicked demo utility would be a GUI which could open files and convert encodings in an editor window or spreadsheet window, and specify conversions on copy/paste. If it could save a page as HTML (just an encoding tag and data between <PRE> tags, then we could use Netscape/IE for verification. Better still, a web server demo could convert on python.org and tag the pages appropriately - browsers support most common encodings. All the encoding stuff is ultimately a bit meaningless without a way to display a character. I am hoping that PDF and PDFgen may add a lot of value here. Adobe (and Ken Lunde) have spent years coming up with a general architecture for this stuff in PDF. Basically, the multi-byte fonts they use are encoding independent, and come with a whole bunch of mapping tables. So I can ask for the same Japanese font in any of about ten encodings - font name is a combination of face name and encoding. The font itself does the remapping. They make available downloadable font packs for Acrobat 4.0 for most languages now; these are good places to raid for building encoding databases. It also means that I can write a Python script to crank out beautiful-looking code page charts for all of our encodings from the database, and input and output to regression tests. I've done it for Shift-JIS at Fidelity, and would have to rewrite it once I am out of here. But I think that some good graphic design here would lead to a product that blows people away - an encodings library that can print out its own contents for viewing and thus help demonstrate its own correctness (or make errors stick out like a sore thumb). Am I mad? Have I put you off forever? What I outline above would be a serious project needing months of work; I'd be really happy to take a role, if we could find sponsors for the project. But I believe we could define the standard for years to come. Furthermore, it would go a long way to making Python the corporate choice for data cleaning and transformation - territory I think we should own. Regards, Andy Robinson Robinson Analytics Ltd. ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From guido at CNRI.Reston.VA.US Tue Nov 9 17:46:41 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Tue, 09 Nov 1999 11:46:41 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: Your message of "Tue, 09 Nov 1999 05:58:39 PST." <19991109135839.25864.rocketmail@web607.mail.yahoo.com> References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> Message-ID: <199911091646.LAA21467@eric.cnri.reston.va.us> Andy, Thanks a bundle for your case study and your toolkit proposal. It's interesting that you haven't touched upon internationalization of user interfaces (dialog text, menus etc.) -- that's a whole nother can of worms. Marc-Andre Lemburg has a proposal for work that I'm asking him to do (under pressure from HP who want Python i18n badly and are willing to pay!): http://starship.skyport.net/~lemburg/unicode-proposal.txt I think his proposal will go a long way towards your toolkit. I hope to hear soon from anybody who disagrees with Marc-Andre's proposal, because without opposition this is going to be Python 1.6's offering for i18n... (Together with a new Unicode regex engine by /F.) One specific question: in you discussion of typed strings, I'm not sure why you couldn't convert everything to Unicode and be done with it. I have a feeling that the answer is somewhere in your case study -- maybe you can elaborate? --Guido van Rossum (home page: http://www.python.org/~guido/) From akuchlin at mems-exchange.org Tue Nov 9 18:21:03 1999 From: akuchlin at mems-exchange.org (Andrew M. Kuchling) Date: Tue, 9 Nov 1999 12:21:03 -0500 (EST) Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <199911091646.LAA21467@eric.cnri.reston.va.us> References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> <199911091646.LAA21467@eric.cnri.reston.va.us> Message-ID: <14376.22527.323888.677816@amarok.cnri.reston.va.us> Guido van Rossum writes: >I think his proposal will go a long way towards your toolkit. I hope >to hear soon from anybody who disagrees with Marc-Andre's proposal, >because without opposition this is going to be Python 1.6's offering >for i18n... The proposal seems reasonable to me. >(Together with a new Unicode regex engine by /F.) This is good news! Would it be a from-scratch regex implementation, or would it be an adaptation of an existing engine? Would it involve modifications to the existing re module, or a completely new unicodere module? (If, unlike re.py, it has POSIX longest-match semantics, that would pretty much settle the question.) -- A.M. Kuchling http://starship.python.net/crew/amk/ All around me darkness gathers, fading is the sun that shone, we must speak of other matters, you can be me when I'm gone... -- The train's clattering, in SANDMAN #67: "The Kindly Ones:11" From guido at CNRI.Reston.VA.US Tue Nov 9 18:26:38 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Tue, 09 Nov 1999 12:26:38 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: Your message of "Tue, 09 Nov 1999 12:21:03 EST." <14376.22527.323888.677816@amarok.cnri.reston.va.us> References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> <199911091646.LAA21467@eric.cnri.reston.va.us> <14376.22527.323888.677816@amarok.cnri.reston.va.us> Message-ID: <199911091726.MAA21754@eric.cnri.reston.va.us> [AMK] > The proposal seems reasonable to me. Thanks. I really hope that this time we can move forward united... > >(Together with a new Unicode regex engine by /F.) > > This is good news! Would it be a from-scratch regex implementation, > or would it be an adaptation of an existing engine? Would it involve > modifications to the existing re module, or a completely new unicodere > module? (If, unlike re.py, it has POSIX longest-match semantics, that > would pretty much settle the question.) It's from scratch, and I believe it's got Perl style, not POSIX style semantics -- per Tim Peters' recommendations. Do we need to open the discussion again? It involves a redone re module (supporting Unicode as well as 8-bit), but its API could be unchanged. /F does the parsing and compilation in Python, only the matching engine is in C -- not sure how that impacts performance, but I imagine with aggressive caching it would be okay. --Guido van Rossum (home page: http://www.python.org/~guido/) From akuchlin at mems-exchange.org Tue Nov 9 18:40:07 1999 From: akuchlin at mems-exchange.org (Andrew M. Kuchling) Date: Tue, 9 Nov 1999 12:40:07 -0500 (EST) Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <199911091726.MAA21754@eric.cnri.reston.va.us> References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> <199911091646.LAA21467@eric.cnri.reston.va.us> <14376.22527.323888.677816@amarok.cnri.reston.va.us> <199911091726.MAA21754@eric.cnri.reston.va.us> Message-ID: <14376.23671.250752.637144@amarok.cnri.reston.va.us> Guido van Rossum writes: >It's from scratch, and I believe it's got Perl style, not POSIX style >semantics -- per Tim Peters' recommendations. Do we need to open the >discussion again? No, no; I'm actually happier with Perl-style, because it's far better documented and familiar to people. Worse *is* better, after all. My concern is simply that I've started translating re.py into C, and wonder how this affects the translation. This isn't a pressing issue, because the C version isn't finished yet. >It involves a redone re module (supporting Unicode as well as 8-bit), >but its API could be unchanged. /F does the parsing and compilation >in Python, only the matching engine is in C -- not sure how that >impacts performance, but I imagine with aggressive caching it would be >okay. Can I get my paws on a copy of the modified re.py to see what ramifications it has, or is this all still an unreleased work-in-progress? Doing the compilation in Python is a good idea, and will make it possible to implement alternative syntaxes. I would have liked to make it possible to generate PCRE bytecodes from Python, but what stopped me is the chance of bogus bytecode causing the engine to dump core, loop forever, or some other nastiness. (This is particularly important for code that uses rexec.py, because you'd expect regexes to be safe.) Fixing the engine to be stable when faced with bad bytecodes appears to require many additional checks that would slow down the common case of correct code, which is unappealing. -- A.M. Kuchling http://starship.python.net/crew/amk/ Anybody else on the list got an opinion? Should I change the language or not? -- Guido van Rossum, 28 Dec 91 From ping at lfw.org Tue Nov 9 19:08:05 1999 From: ping at lfw.org (Ka-Ping Yee) Date: Tue, 9 Nov 1999 10:08:05 -0800 (PST) Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <14376.23671.250752.637144@amarok.cnri.reston.va.us> Message-ID: <Pine.LNX.4.10.9911091004240.7102-100000@localhost> On Tue, 9 Nov 1999, Andrew M. Kuchling wrote: > Guido van Rossum writes: > >It's from scratch, and I believe it's got Perl style, not POSIX style > >semantics -- per Tim Peters' recommendations. Do we need to open the > >discussion again? > > No, no; I'm actually happier with Perl-style, because it's far better > documented and familiar to people. Worse *is* better, after all. I would concur with the preference for Perl-style semantics. Aside from the issue of consistency with other scripting languages, i think it's easier to predict the behaviour of these semantics. You can run the algorithm in your head, and try the backtracking yourself. It's good for the algorithm to be predictable and well understood. > Doing the compilation in Python is a good idea, and will make it > possible to implement alternative syntaxes. Also agree. I still have some vague wishes for a simpler, more readable (more Pythonian?) way to express patterns -- perhaps not as powerful as full regular expressions, but useful for many simpler cases (an 80-20 solution). -- ?!ng From bwarsaw at cnri.reston.va.us Tue Nov 9 19:15:04 1999 From: bwarsaw at cnri.reston.va.us (Barry A. Warsaw) Date: Tue, 9 Nov 1999 13:15:04 -0500 (EST) Subject: [Python-Dev] Internationalization Toolkit References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> <199911091646.LAA21467@eric.cnri.reston.va.us> <14376.22527.323888.677816@amarok.cnri.reston.va.us> <199911091726.MAA21754@eric.cnri.reston.va.us> <14376.23671.250752.637144@amarok.cnri.reston.va.us> Message-ID: <14376.25768.368164.88151@anthem.cnri.reston.va.us> >>>>> "AMK" == Andrew M Kuchling <akuchlin at mems-exchange.org> writes: AMK> No, no; I'm actually happier with Perl-style, because it's AMK> far better documented and familiar to people. Worse *is* AMK> better, after all. Plus, you can't change re's semantics and I think it makes sense if the Unicode engine is as close semantically as possible to the existing engine. We need to be careful not to worsen performance for 8bit strings. I think we're already on the edge of acceptability w.r.t. P*** and hopefully we can /improve/ performance here. MAL's proposal seems quite reasonable. It would be excellent to see these things done for Python 1.6. There's still some discussion on supporting internationalization of applications, e.g. using gettext but I think those are smaller in scope. -Barry From akuchlin at mems-exchange.org Tue Nov 9 20:36:28 1999 From: akuchlin at mems-exchange.org (Andrew M. Kuchling) Date: Tue, 9 Nov 1999 14:36:28 -0500 (EST) Subject: [Python-Dev] I18N Toolkit In-Reply-To: <14376.25768.368164.88151@anthem.cnri.reston.va.us> References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> <199911091646.LAA21467@eric.cnri.reston.va.us> <14376.22527.323888.677816@amarok.cnri.reston.va.us> <199911091726.MAA21754@eric.cnri.reston.va.us> <14376.23671.250752.637144@amarok.cnri.reston.va.us> <14376.25768.368164.88151@anthem.cnri.reston.va.us> Message-ID: <14376.30652.201552.116828@amarok.cnri.reston.va.us> Barry A. Warsaw writes: (in relation to support for Unicode regexes) >We need to be careful not to worsen performance for 8bit strings. I >think we're already on the edge of acceptability w.r.t. P*** and >hopefully we can /improve/ performance here. I don't think that will be a problem, given that the Unicode engine would be a separate C implementation. A bit of 'if type(strg) == UnicodeType' in re.py isn't going to cost very much speed. (Speeding up PCRE -- that's another question. I'm often tempted to rewrite pcre_compile to generate an easier-to-analyse parse tree, instead of its current complicated-but-memory-parsimonious compiler, but I'm very reluctant to introduce a fork like that.) -- A.M. Kuchling http://starship.python.net/crew/amk/ The world does so well without me, that I am moved to wish that I could do equally well without the world. -- Robertson Davies, _The Diary of Samuel Marchbanks_ From mhammond at skippinet.com.au Tue Nov 9 23:27:45 1999 From: mhammond at skippinet.com.au (Mark Hammond) Date: Wed, 10 Nov 1999 09:27:45 +1100 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <199911091646.LAA21467@eric.cnri.reston.va.us> Message-ID: <001c01bf2b01$a58d5d50$0501a8c0@bobcat> > I think his proposal will go a long way towards your toolkit. I hope > to hear soon from anybody who disagrees with Marc-Andre's proposal, No disagreement as such, but a small hole: From tim_one at email.msn.com Wed Nov 10 06:57:14 1999 From: tim_one at email.msn.com (Tim Peters) Date: Wed, 10 Nov 1999 00:57:14 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <199911091726.MAA21754@eric.cnri.reston.va.us> Message-ID: <000001bf2b40$70183840$d82d153f@tim> [Guido, on "a new Unicode regex engine by /F"] > It's from scratch, and I believe it's got Perl style, not POSIX style > semantics -- per Tim Peters' recommendations. Do we need to open the > discussion again? No, but I get to whine just a little <wink>: I didn't recommend either approach. I asked many futile questions about HP's requirements, and sketched implications either way. If HP *has* a requirement wrt POSIX-vs-Perl, it would be good to find that out before it's too late. I personally prefer POSIX semantics -- but, as Andrew so eloquently said, worse is better here; all else being equal it's best to follow JPython's Perl-compatible re lead. last-time-i-ever-say-what-i-really-think<wink>-ly y'rs - tim From tim_one at email.msn.com Wed Nov 10 07:25:07 1999 From: tim_one at email.msn.com (Tim Peters) Date: Wed, 10 Nov 1999 01:25:07 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <199911091646.LAA21467@eric.cnri.reston.va.us> Message-ID: <000201bf2b44$55b8ad00$d82d153f@tim> > Marc-Andre Lemburg has a proposal for work that I'm asking him to do > (under pressure from HP who want Python i18n badly and are willing to > pay!): http://starship.skyport.net/~lemburg/unicode-proposal.txt I can't make time for a close review now. Just one thing that hit my eye early: Python should provide a built-in constructor for Unicode strings which is available through __builtins__: u = unicode(<encoded Python string>[,<encoding name>= <default encoding>]) u = u'<utf-8 encoded Python string>' Two points on the Unicode literals (u'abc'): UTF-8 is a very nice encoding scheme, but is very hard for people "to do" by hand -- it breaks apart and rearranges bytes at the bit level, and everything other than 7-bit ASCII requires solid strings of "high-bit" characters. This is painful for people to enter manually on both counts -- and no common reference gives the UTF-8 encoding of glyphs directly. So, as discussed earlier, we should follow Java's lead and also introduce a \u escape sequence: octet: hexdigit hexdigit unicodecode: octet octet unicode_escape: "\\u" unicodecode Inside a u'' string, I guess this should expand to the UTF-8 encoding of the Unicode character at the unicodecode code position. For consistency, then, it should probably expand the same way inside "regular strings" too. Unlike Java does, I'd rather not give it a meaning outside string literals. The other point is a nit: The vast bulk of UTF-8 encodings encode characters in UCS-4 space outside of Unicode. In good Pythonic fashion, those must either be explicitly outlawed, or explicitly defined. I vote for outlawed, in the sense of detected error that raises an exception. That leaves our future options open. BTW, is ord(unicode_char) defined? And as what? And does ord have an inverse in the Unicode world? Both seem essential. international-in-spite-of-himself-ly y'rs - tim From fredrik at pythonware.com Wed Nov 10 09:08:06 1999 From: fredrik at pythonware.com (Fredrik Lundh) Date: Wed, 10 Nov 1999 09:08:06 +0100 Subject: Internal Format (Re: [Python-Dev] Internationalization Toolkit) References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> <199911091646.LAA21467@eric.cnri.reston.va.us> Message-ID: <00f401bf2b53$3c234e90$f29b12c2@secret.pythonware.com> Guido van Rossum <guido at CNRI.Reston.VA.US> wrote: > http://starship.skyport.net/~lemburg/unicode-proposal.txt Marc-Andre writes: The internal format for Unicode objects should either use a Python specific fixed cross-platform format <PythonUnicode> (e.g. 2-byte little endian byte order) or a compiler provided wchar_t format (if available). Using the wchar_t format will ease embedding of Python in other Unicode aware applications, but will also make internal format dumps platform dependent. having been there and done that, I strongly suggest a third option: a 16-bit unsigned integer, in platform specific byte order (PY_UNICODE_T). along all other roads lie code bloat and speed penalties... (besides, this is exactly how it's already done in unicode.c and what 'sre' prefers...) </F> From captainrobbo at yahoo.com Wed Nov 10 09:09:26 1999 From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=) Date: Wed, 10 Nov 1999 00:09:26 -0800 (PST) Subject: [Python-Dev] Internationalization Toolkit Message-ID: <19991110080926.2400.rocketmail@web602.mail.yahoo.com> In general, I like this proposal a lot, but I think it only covers half the story. How we actually build the encoder/decoder for each encoding is a very big issue. Thoughts on this below. First, a little nit > u = u'<utf-8 encoded Python string>' I don't like using funny prime characters - why not an explicit function like "utf8()" On to the important stuff:> > unicodec.register(<encname>,<encoder>,<decoder> > [,<stream_encoder>, <stream_decoder>]) > This registers the codecs under the given encoding > name in the module global dictionary > unicodec.codecs. Stream codecs are optional: > the unicodec module will provide appropriate > wrappers around <encoder> and > <decoder> if not given. I would MUCH prefer a single 'Encoding' class or type to wrap up these things, rather than up to four disconnected objects/functions. Essentially it would be an interface standard and would offer methods to do the four things above. There are several reasons for this. (1) there are quite a lot of things you might want to do with an encoding object, and we could extend the interface in future easily. As a minimum, give it the four methods implied by the above, two of which can be defaults. But I'd like an encoding to be able to tell me the set of characters to which it applies; validate a string; and maybe tell me if it is a subset or superset of another. (2) especially with double-byte encodings, they will need to load up some kind of database on startup and use this for both encoding and decoding - much better to share it and encapsulate it inside one object (3) for some languages, there are extra functions wanted. For Japanese, you need two or three functions to expand half-width to full-width katakana, convert double-byte english to single-byte and vice versa. A Japanese encoding object would be a handy place to put this knowledge. (4) In the real world you get many encodings which are subtle variations of the same thing, plus or minus a few characters. One bit of code might be able to share the work of several encodings, by setting a few flags. Certainly true of Japanese. (5) encoding/decoding algorithms can be program or data or (very often) a bit of both. We have not yet discussed where to keep all the mapping tables, but if data is involved it should be hidden in an object. (6) See my comments on a state machine for doing the encodings. If this is done well, we might two different standard objects which conform to the Encoding interface (a really light one for single-byte encodings, and a bigger one for multi-byte), and everything else could be data driven. (6) Easy to grow - encodings can be prototyped and proven in Python, ported to C if needed or when ready. In summary, firm up the concept of an Encoding object and give it room to grow - that's the key to real-world usefulness. If people feel the same way I'll have a go at an interface for that, and try show how it would have simplified specific problems I have faced. We also need to think about where encoding info will live. You cannot avoid mapping tables, although you can hide them inside code modules or pickled objects if you want. Should there be a standard "..\Python\Enc" directory? And we're going to need some kind of testing and certification procedure when adding new encodings. This stuff has to be right. Guido asked about TypedString. This can probably be done on top of the built-in stuff - it is just a convenience which would clarify intent, reduce lines of code and prevent people shooting themselves in the foot when juggling a lot of strings in different (non-Unicode) encodings. I can do a Python module to implement that on top of whatever is built. Regards, Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From fredrik at pythonware.com Wed Nov 10 09:14:21 1999 From: fredrik at pythonware.com (Fredrik Lundh) Date: Wed, 10 Nov 1999 09:14:21 +0100 Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit) References: <000201bf2b44$55b8ad00$d82d153f@tim> Message-ID: <00f501bf2b53$9872e610$f29b12c2@secret.pythonware.com> Tim Peters wrote: > UTF-8 is a very nice encoding scheme, but is very hard for people "to do" by > hand -- it breaks apart and rearranges bytes at the bit level, and > everything other than 7-bit ASCII requires solid strings of "high-bit" > characters. unless you're using a UTF-8 aware editor, of course ;-) (some days, I think we need some way to tell the compiler what encoding we're using for the source file...) > This is painful for people to enter manually on both counts -- > and no common reference gives the UTF-8 encoding of glyphs > directly. So, as discussed earlier, we should follow Java's lead > and also introduce a \u escape sequence: > > octet: hexdigit hexdigit > unicodecode: octet octet > unicode_escape: "\\u" unicodecode > > Inside a u'' string, I guess this should expand to the UTF-8 encoding of the > Unicode character at the unicodecode code position. For consistency, then, > it should probably expand the same way inside "regular strings" too. Unlike > Java does, I'd rather not give it a meaning outside string literals. good idea. and by some reason, patches for this is included in the unicode distribution (see the attached str2utf.c). > The other point is a nit: The vast bulk of UTF-8 encodings encode > characters in UCS-4 space outside of Unicode. In good Pythonic fashion, > those must either be explicitly outlawed, or explicitly defined. I vote for > outlawed, in the sense of detected error that raises an exception. That > leaves our future options open. I vote for 'outlaw'. </F> /* A small code snippet that translates \uxxxx syntax to UTF-8 text. To be cut and pasted into Python/compile.c */ /* Written by Fredrik Lundh, January 1999. */ /* Documentation (for the language reference): \uxxxx -- Unicode character with hexadecimal value xxxx. The character is stored using UTF-8 encoding, which means that this sequence can result in up to three encoded characters. Note that the 'u' must be followed by four hexadecimal digits. If fewer digits are given, the sequence is left in the resulting string exactly as given. If more digits are given, only the first four are translated to Unicode, and the remaining digits are left in the resulting string. */ #define Py_CHARMASK(ch) ch void convert(const char *s, char *p) { while (*s) { if (*s != '\\') { *p++ = *s++; continue; } s++; switch (*s++) { /* -------------------------------------------------------------------- */ /* copy this section to the appropriate place in compile.c... */ case 'u': /* \uxxxx => UTF-8 encoded unicode character */ if (isxdigit(Py_CHARMASK(s[0])) && isxdigit(Py_CHARMASK(s[1])) && isxdigit(Py_CHARMASK(s[2])) && isxdigit(Py_CHARMASK(s[3]))) { /* fetch hexadecimal character value */ unsigned int n, ch = 0; for (n = 0; n < 4; n++) { int c = Py_CHARMASK(*s); s++; ch = (ch << 4) & ~0xF; if (isdigit(c)) ch += c - '0'; else if (islower(c)) ch += 10 + c - 'a'; else ch += 10 + c - 'A'; } /* store as UTF-8 */ if (ch < 0x80) *p++ = (char) ch; else { if (ch < 0x800) { *p++ = 0xc0 | (ch >> 6); *p++ = 0x80 | (ch & 0x3f); } else { *p++ = 0xe0 | (ch >> 12); *p++ = 0x80 | ((ch >> 6) & 0x3f); *p++ = 0x80 | (ch & 0x3f); } } break; } else goto bogus; /* -------------------------------------------------------------------- */ default: bogus: *p++ = '\\'; *p++ = s[-1]; break; } } *p++ = '\0'; } main() { int i; unsigned char buffer[100]; convert("Link\\u00f6ping", buffer); for (i = 0; buffer[i]; i++) if (buffer[i] < 0x20 || buffer[i] >= 0x80) printf("\\%03o", buffer[i]); else printf("%c", buffer[i]); } From gstein at lyra.org Thu Nov 11 10:18:52 1999 From: gstein at lyra.org (Greg Stein) Date: Thu, 11 Nov 1999 01:18:52 -0800 (PST) Subject: [Python-Dev] Re: Internal Format In-Reply-To: <00f401bf2b53$3c234e90$f29b12c2@secret.pythonware.com> Message-ID: <Pine.LNX.4.10.9911110116050.638-100000@nebula.lyra.org> On Wed, 10 Nov 1999, Fredrik Lundh wrote: > Marc-Andre writes: > > The internal format for Unicode objects should either use a Python > specific fixed cross-platform format <PythonUnicode> (e.g. 2-byte > little endian byte order) or a compiler provided wchar_t format (if > available). Using the wchar_t format will ease embedding of Python in > other Unicode aware applications, but will also make internal format > dumps platform dependent. > > having been there and done that, I strongly suggest > a third option: a 16-bit unsigned integer, in platform > specific byte order (PY_UNICODE_T). along all other > roads lie code bloat and speed penalties... I agree 100% !! wchar_t will introduce portability issues right on up into the Python level. The byte-order introduces speed issues and OS interoperability issues, yet solves no portability problems (Byte Order Marks should still be present and used). There are two "platforms" out there that use Unicode: Win32 and Java. They both use UCS-2, AFAIK. Cheers, -g -- Greg Stein, http://www.lyra.org/ From fredrik at pythonware.com Wed Nov 10 09:24:16 1999 From: fredrik at pythonware.com (Fredrik Lundh) Date: Wed, 10 Nov 1999 09:24:16 +0100 Subject: cached encoding (Re: [Python-Dev] Internationalization Toolkit) References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> <199911091646.LAA21467@eric.cnri.reston.va.us> Message-ID: <010b01bf2b54$fb107430$f29b12c2@secret.pythonware.com> Guido van Rossum <guido at CNRI.Reston.VA.US> wrote: > One specific question: in you discussion of typed strings, I'm not > sure why you couldn't convert everything to Unicode and be done with > it. I have a feeling that the answer is somewhere in your case study > -- maybe you can elaborate? Marc-Andre writes: Unicode objects should have a pointer to a cached (read-only) char buffer <defencbuf> holding the object's value using the current <default encoding>. This is needed for performance and internal parsing (see below) reasons. The buffer is filled when the first conversion request to the <default encoding> is issued on the object. keeping track of an external encoding is better left for the application programmers -- I'm pretty sure that different application builders will want to handle this in radically different ways, depending on their environ- ment, underlying user interface toolkit, etc. besides, this is how Tcl would have done it. Python's not Tcl, and I think you need *very* good arguments for moving in that direction. </F> From mal at lemburg.com Wed Nov 10 10:04:39 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Wed, 10 Nov 1999 10:04:39 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <001c01bf2b01$a58d5d50$0501a8c0@bobcat> Message-ID: <38293527.3CF5C7B0@lemburg.com> Mark Hammond wrote: > > > I think his proposal will go a long way towards your toolkit. I > hope > > to hear soon from anybody who disagrees with Marc-Andre's proposal, > > No disagreement as such, but a small hole: > > >From the proposal: > > Internal Argument Parsing: > -------------------------- > ... > 's': For Unicode objects: auto convert them to the <default encoding> > and return a pointer to the object's <defencbuf> buffer. > > -- > Excellent - if someone passes a Unicode object, it can be > auto-converted to a string. This will allow "open()" to accept > Unicode strings. Well almost... it depends on the current value of <default encoding>. If it's UTF8 and you only use normal ASCII characters the above is indeed true, but UTF8 can go far beyond ASCII and have up to 3 bytes per character (for UCS2, even more for UCS4). With <default encoding> set to other exotic encodings this is likely to fail though. > However, there doesnt appear to be a reverse. Eg, if my extension > module interfaces to a library that uses Unicode natively, how can I > get a Unicode object when the user passes a string? If I had to > explicitely check for a string, then check for a Unicode on failure it > would get messy pretty quickly... Is it not possible to have "U" also > do a conversion? "U" is meant to simplify checks for Unicode objects, much like "S". It returns a reference to the object. Auto-conversions are not possible due to this, because they would create new objects which don't get properly garbage collected later on. Another problem is that Unicode types differ between platforms (MS VCLIB uses 16-bit wchar_t, while GLIBC2 uses 32-bit wchar_t). Depending on the internal format of Unicode objects this could mean calling different conversion APIs. BTW, I'm still not too sure about the underlying internal format. The problem here is that Unicode started out as 2-byte fixed length representation (UCS2) but then shifted towards a 4-byte fixed length reprensetation known as UCS4. Since having 4 bytes per character is hard sell to customers, UTF16 was created to stuff the UCS4 code points (this is how character entities are called in Unicode) into 2 bytes... with a variable length encoding. Some platforms that started early into the Unicode business such as the MS ones use UCS2 as wchar_t, while more recent ones (e.g. the glibc2 on Linux) use UCS4 for wchar_t. I haven't yet checked in what ways the two are compatible (I would suspect the top bytes in UCS4 being 0 for UCS2 codes), but would like to hear whether it wouldn't be a better idea to use UTF16 as internal format. The latter works in 2 bytes for most characters and conversion to UCS2|4 should be fast. Still, conversion to UCS2 could fail. The downside of using UTF16: it is a variable length format, so iterations over it will be slower than for UCS4. Simply sticking to UCS2 is probably out of the question, since Unicode 3.0 requires UCS4 and we are targetting Unicode 3.0. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Wed Nov 10 10:49:01 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Wed, 10 Nov 1999 10:49:01 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <000201bf2b44$55b8ad00$d82d153f@tim> Message-ID: <38293F8D.F60AE605@lemburg.com> Tim Peters wrote: > > > Marc-Andre Lemburg has a proposal for work that I'm asking him to do > > (under pressure from HP who want Python i18n badly and are willing to > > pay!): http://starship.skyport.net/~lemburg/unicode-proposal.txt > > I can't make time for a close review now. Just one thing that hit my eye > early: > > Python should provide a built-in constructor for Unicode strings > which is available through __builtins__: > > u = unicode(<encoded Python string>[,<encoding name>= > <default encoding>]) > > u = u'<utf-8 encoded Python string>' > > Two points on the Unicode literals (u'abc'): > > UTF-8 is a very nice encoding scheme, but is very hard for people "to do" by > hand -- it breaks apart and rearranges bytes at the bit level, and > everything other than 7-bit ASCII requires solid strings of "high-bit" > characters. This is painful for people to enter manually on both counts -- > and no common reference gives the UTF-8 encoding of glyphs directly. So, as > discussed earlier, we should follow Java's lead and also introduce a \u > escape sequence: > > octet: hexdigit hexdigit > unicodecode: octet octet > unicode_escape: "\\u" unicodecode > > Inside a u'' string, I guess this should expand to the UTF-8 encoding of the > Unicode character at the unicodecode code position. For consistency, then, > it should probably expand the same way inside "regular strings" too. Unlike > Java does, I'd rather not give it a meaning outside string literals. It would be more conform to use the Unicode ordinal (instead of interpreting the number as UTF8 encoding), e.g. \u03C0 for Pi. The codes are easy to look up in the standard's UnicodeData.txt file or the Unicode book for that matter. > The other point is a nit: The vast bulk of UTF-8 encodings encode > characters in UCS-4 space outside of Unicode. In good Pythonic fashion, > those must either be explicitly outlawed, or explicitly defined. I vote for > outlawed, in the sense of detected error that raises an exception. That > leaves our future options open. See my other post for a discussion of UCS4 vs. UTF16 vs. UCS2. Perhaps we could add a flag to Unicode objects stating whether the characters can be treated as UCS4 limited to the lower 16 bits (UCS4 and UTF16 are the same in most ranges). This flag could then be used to choose optimized algorithms for scanning the strings. Fredrik's implementation currently uses UCS2, BTW. > BTW, is ord(unicode_char) defined? And as what? And does ord have an > inverse in the Unicode world? Both seem essential. Good points. How about uniord(u[:1]) --> Unicode ordinal number (32-bit) unichr(i) --> Unicode object for character i (provided it is 32-bit); ValueError otherwise They are inverse of each other, but note that Unicode allows private encodings too, which will of course not necessarily make it across platforms or even from one PC to the next (see Andy Robinson's interesting case study). I've uploaded a new version of the proposal (0.3) to the URL: http://starship.skyport.net/~lemburg/unicode-proposal.txt Thanks, -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From fredrik at pythonware.com Wed Nov 10 11:50:05 1999 From: fredrik at pythonware.com (Fredrik Lundh) Date: Wed, 10 Nov 1999 11:50:05 +0100 Subject: regexp performance (Re: [Python-Dev] I18N Toolkit References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com><199911091646.LAA21467@eric.cnri.reston.va.us><14376.22527.323888.677816@amarok.cnri.reston.va.us><199911091726.MAA21754@eric.cnri.reston.va.us><14376.23671.250752.637144@amarok.cnri.reston.va.us><14376.25768.368164.88151@anthem.cnri.reston.va.us> <14376.30652.201552.116828@amarok.cnri.reston.va.us> Message-ID: <027c01bf2b69$59e60330$f29b12c2@secret.pythonware.com> Andrew M. Kuchling <akuchlin at mems-exchange.org> wrote: > (Speeding up PCRE -- that's another question. I'm often tempted to > rewrite pcre_compile to generate an easier-to-analyse parse tree, > instead of its current complicated-but-memory-parsimonious compiler, > but I'm very reluctant to introduce a fork like that.) any special pattern constructs that are in need of per- formance improvements? (compared to Perl, that is). or maybe anyone has an extensive performance test suite for perlish regular expressions? (preferrably based on how real people use regular expressions, not only on things that are known to be slow if not optimized) </F> From gstein at lyra.org Thu Nov 11 11:46:55 1999 From: gstein at lyra.org (Greg Stein) Date: Thu, 11 Nov 1999 02:46:55 -0800 (PST) Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <38293527.3CF5C7B0@lemburg.com> Message-ID: <Pine.LNX.4.10.9911110236000.18059-100000@nebula.lyra.org> On Wed, 10 Nov 1999, M.-A. Lemburg wrote: >... > Well almost... it depends on the current value of <default encoding>. Default encodings are kind of nasty when they can be altered. The same problem occurred with import hooks. Only one can be present at a time. This implies that modules, packages, subsystems, whatever, cannot set a default encoding because something else might depend on it having a different value. In the end, nobody uses the default encoding because it is unreliable, so you end up with extra implementation/semantics that aren't used/needed. Have you ever noticed how Python modules, packages, tools, etc, never define an import hook? I'll bet nobody ever monkeys with the default encoding either... I say axe it and say "UTF-8" is the fixed, default encoding. If you want something else, then do that explicitly. >... > Another problem is that Unicode types differ between platforms > (MS VCLIB uses 16-bit wchar_t, while GLIBC2 uses 32-bit > wchar_t). Depending on the internal format of Unicode objects > this could mean calling different conversion APIs. Exactly the reason to avoid wchar_t. > BTW, I'm still not too sure about the underlying internal format. > The problem here is that Unicode started out as 2-byte fixed length > representation (UCS2) but then shifted towards a 4-byte fixed length > reprensetation known as UCS4. Since having 4 bytes per character > is hard sell to customers, UTF16 was created to stuff the UCS4 > code points (this is how character entities are called in Unicode) > into 2 bytes... with a variable length encoding. History is basically irrelevant. What is the situation today? What is in use, and what are people planning for right now? >... > The downside of using UTF16: it is a variable length format, > so iterations over it will be slower than for UCS4. Bzzt. May as well go with UTF-8 as the internal format, much like Perl is doing (as I recall). Why go with a variable length format, when people seem to be doing fine with UCS-2? Like I said in the other mail note: two large platforms out there are UCS-2 based. They seem to be doing quite well with that approach. If people truly need UCS-4, then they can work with that on their own. One of the major reasons for putting Unicode into Python is to increase/simplify its ability to speak to the underlying platform. Hey! Guess what? That generally means UCS2. If we didn't need to speak to the OS with these Unicode values, then people can work with the values entirely in Python, PyUnicodeType-be-damned. Are we digging a hole for ourselves? Maybe. But there are two other big platforms that have the same hole to dig out of *IF* it ever comes to that. I posit that it won't be necessary; that the people needing UCS-4 can do so entirely in Python. Maybe we can allow the encoder to do UCS-4 to UTF-8 encoding and vice-versa. But: it only does it from String to String -- you can't use Unicode objects anywhere in there. > Simply sticking to UCS2 is probably out of the question, > since Unicode 3.0 requires UCS4 and we are targetting > Unicode 3.0. Oh? Who says? Cheers, -g -- Greg Stein, http://www.lyra.org/ From fredrik at pythonware.com Wed Nov 10 11:52:28 1999 From: fredrik at pythonware.com (Fredrik Lundh) Date: Wed, 10 Nov 1999 11:52:28 +0100 Subject: [Python-Dev] I18N Toolkit References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com><199911091646.LAA21467@eric.cnri.reston.va.us><14376.22527.323888.677816@amarok.cnri.reston.va.us><199911091726.MAA21754@eric.cnri.reston.va.us><14376.23671.250752.637144@amarok.cnri.reston.va.us><14376.25768.368164.88151@anthem.cnri.reston.va.us> <14376.30652.201552.116828@amarok.cnri.reston.va.us> Message-ID: <029c01bf2b69$af0da250$f29b12c2@secret.pythonware.com> (a copy was sent to comp.lang.python by mistake; sorry for that). Andrew M. Kuchling <akuchlin at mems-exchange.org> wrote: > I don't think that will be a problem, given that the Unicode engine > would be a separate C implementation. A bit of 'if type(strg) == > UnicodeType' in re.py isn't going to cost very much speed. a slightly hairer design issue is what combinations of pattern and string the new 're' will handle. the first two are obvious: ordinary pattern, ordinary string unicode pattern, unicode string but what about these? ordinary pattern, unicode string unicode pattern, ordinary string "coercing" patterns (i.e. recompiling, on demand) seem to be a somewhat risky business ;-) </F> From gstein at lyra.org Thu Nov 11 11:50:56 1999 From: gstein at lyra.org (Greg Stein) Date: Thu, 11 Nov 1999 02:50:56 -0800 (PST) Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <38293F8D.F60AE605@lemburg.com> Message-ID: <Pine.LNX.4.10.9911110248270.18059-100000@nebula.lyra.org> On Wed, 10 Nov 1999, M.-A. Lemburg wrote: > Tim Peters wrote: > > BTW, is ord(unicode_char) defined? And as what? And does ord have an > > inverse in the Unicode world? Both seem essential. > > Good points. > > How about > > uniord(u[:1]) --> Unicode ordinal number (32-bit) > > unichr(i) --> Unicode object for character i (provided it is 32-bit); > ValueError otherwise Why new functions? Why not extend the definition of ord() and chr()? In terms of backwards compatibility, the only issue could possibly be that people relied on chr(x) to throw an error when x>=256. They certainly couldn't pass a Unicode object to ord(), so that function can safely be extended to accept a Unicode object and return a larger integer. Cheers, -g -- Greg Stein, http://www.lyra.org/ From jcw at equi4.com Wed Nov 10 12:14:17 1999 From: jcw at equi4.com (Jean-Claude Wippler) Date: Wed, 10 Nov 1999 12:14:17 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <Pine.LNX.4.10.9911110236000.18059-100000@nebula.lyra.org> Message-ID: <38295389.397DDE5E@equi4.com> Greg Stein wrote: [MAL:] > > The downside of using UTF16: it is a variable length format, > > so iterations over it will be slower than for UCS4. > > Bzzt. May as well go with UTF-8 as the internal format, much like Perl > is doing (as I recall). Ehm, pardon me for asking - what is the brief rationale for selecting UCS2/4, or whetever it ends up being, over UTF8? I couldn't find a discussion in the last months of the string SIG, was this decided upon and frozen long ago? I'm not trying to re-open a can of worms, just to understand. -- Jean-Claude From gstein at lyra.org Thu Nov 11 12:17:56 1999 From: gstein at lyra.org (Greg Stein) Date: Thu, 11 Nov 1999 03:17:56 -0800 (PST) Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <38295389.397DDE5E@equi4.com> Message-ID: <Pine.LNX.4.10.9911110315330.18059-100000@nebula.lyra.org> On Wed, 10 Nov 1999, Jean-Claude Wippler wrote: > Greg Stein wrote: > > Bzzt. May as well go with UTF-8 as the internal format, much like Perl > > is doing (as I recall). > > Ehm, pardon me for asking - what is the brief rationale for selecting > UCS2/4, or whetever it ends up being, over UTF8? > > I couldn't find a discussion in the last months of the string SIG, was > this decided upon and frozen long ago? Try sometime last year :-) ... something like July thru September as I recall. Things will be a lot faster if we have a fixed-size character. Variable length formats like UTF-8 are a lot harder to slice, search, etc. Also, (IMO) a big reason for this new type is for interaction with the underlying OS/platform. I don't know of any platforms right now that really use UTF-8 as their Unicode string representation (meaning we'd have to convert back/forth from our UTF-8 representation to talk to the OS). Cheers, -g -- Greg Stein, http://www.lyra.org/ From mal at lemburg.com Wed Nov 10 10:55:42 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Wed, 10 Nov 1999 10:55:42 +0100 Subject: cached encoding (Re: [Python-Dev] Internationalization Toolkit) References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> <199911091646.LAA21467@eric.cnri.reston.va.us> <010b01bf2b54$fb107430$f29b12c2@secret.pythonware.com> Message-ID: <3829411E.FD32F8CC@lemburg.com> Fredrik Lundh wrote: > > Guido van Rossum <guido at CNRI.Reston.VA.US> wrote: > > One specific question: in you discussion of typed strings, I'm not > > sure why you couldn't convert everything to Unicode and be done with > > it. I have a feeling that the answer is somewhere in your case study > > -- maybe you can elaborate? > > Marc-Andre writes: > > Unicode objects should have a pointer to a cached (read-only) char > buffer <defencbuf> holding the object's value using the current > <default encoding>. This is needed for performance and internal > parsing (see below) reasons. The buffer is filled when the first > conversion request to the <default encoding> is issued on the object. > > keeping track of an external encoding is better left > for the application programmers -- I'm pretty sure that > different application builders will want to handle this > in radically different ways, depending on their environ- > ment, underlying user interface toolkit, etc. It's not that hard to implement. All you have to do is check whether the current encoding in <defencbuf> still is the same as the threads view of <default encoding>. The <defencbuf> buffer is needed to implement "s" et al. argument parsing anyways. > besides, this is how Tcl would have done it. Python's > not Tcl, and I think you need *very* good arguments > for moving in that direction. > > </F> > > _______________________________________________ > Python-Dev maillist - Python-Dev at python.org > http://www.python.org/mailman/listinfo/python-dev -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Wed Nov 10 12:42:00 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Wed, 10 Nov 1999 12:42:00 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <19991110080926.2400.rocketmail@web602.mail.yahoo.com> Message-ID: <38295A08.D3928401@lemburg.com> Andy Robinson wrote: > > In general, I like this proposal a lot, but I think it > only covers half the story. How we actually build the > encoder/decoder for each encoding is a very big issue. > Thoughts on this below. > > First, a little nit > > u = u'<utf-8 encoded Python string>' > I don't like using funny prime characters - why not an > explicit function like "utf8()" u = unicode('...I am UTF8...','utf-8') will do just that. I've moved to Tim's proposal with the \uXXXX encoding for u'', BTW. > On to the important stuff:> > > unicodec.register(<encname>,<encoder>,<decoder> > > [,<stream_encoder>, <stream_decoder>]) > > > This registers the codecs under the given encoding > > name in the module global dictionary > > unicodec.codecs. Stream codecs are optional: > > the unicodec module will provide appropriate > > wrappers around <encoder> and > > <decoder> if not given. > > I would MUCH prefer a single 'Encoding' class or type > to wrap up these things, rather than up to four > disconnected objects/functions. Essentially it would > be an interface standard and would offer methods to do > the four things above. > > There are several reasons for this. > > ... > > In summary, firm up the concept of an Encoding object > and give it room to grow - that's the key to > real-world usefulness. If people feel the same way > I'll have a go at an interface for that, and try show > how it would have simplified specific problems I have > faced. Ok, you have a point there. Here's a proposal (note that this only defines an interface, not a class structure): Codec Interface Definition: --------------------------- The following base class should be defined in the module unicodec. class Codec: def encode(self,u): """ Return the Unicode object u encoded as Python string. """ ... def decode(self,s): """ Return an equivalent Unicode object for the encoded Python string s. """ ... def dump(self,u,stream,slice=None): """ Writes the Unicode object's contents encoded to the stream. stream must be a file-like object open for writing binary data. If slice is given (as slice object), only the sliced part of the Unicode object is written. """ ... the base class should provide a default implementation of this method using self.encode ... def load(self,stream,length=None): """ Reads an encoded string (up to <length> bytes) from the stream and returns an equivalent Unicode object. stream must be a file-like object open for reading binary data. If length is given, only length bytes are read. Note that this can cause the decoding algorithm to fail due to truncations in the encoding. """ ... the base class should provide a default implementation of this method using self.encode ... Codecs should raise an UnicodeError in case the conversion is not possible. It is not required by the unicodec.register() API to provide a subclass of this base class, only the 4 given methods must be present. This allows writing Codecs as extensions types. XXX Still to be discussed: ? support for line breaks (see http://www.unicode.org/unicode/reports/tr13/ ) ? support for case conversion: Problems: string lengths can change due to multiple characters being mapped to a single new one, capital letters starting a word can be different than ones occurring in the middle, there are locale dependent deviations from the standard mappings. ? support for numbers, digits, whitespace, etc. ? support (or no support) for private code point areas > We also need to think about where encoding info will > live. You cannot avoid mapping tables, although you > can hide them inside code modules or pickled objects > if you want. Should there be a standard > "..\Python\Enc" directory? Mapping tables should be incorporated into the codec modules preferably as static C data. That way multiple processes can share the same data. > And we're going to need some kind of testing and > certification procedure when adding new encodings. > This stuff has to be right. I will have to rely on your cooperation for the test data. Roundtrip testing is easy to implement, but I will also have to verify the output against prechecked data which is probably only creatable using visual tools to which I don't have access (e.g. a Japanese Windows installation). > Guido asked about TypedString. This can probably be > done on top of the built-in stuff - it is just a > convenience which would clarify intent, reduce lines > of code and prevent people shooting themselves in the > foot when juggling a lot of strings in different > (non-Unicode) encodings. I can do a Python module to > implement that on top of whatever is built. Ok. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Wed Nov 10 11:03:36 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Wed, 10 Nov 1999 11:03:36 +0100 Subject: Internal Format (Re: [Python-Dev] Internationalization Toolkit) References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> <199911091646.LAA21467@eric.cnri.reston.va.us> <00f401bf2b53$3c234e90$f29b12c2@secret.pythonware.com> Message-ID: <382942F8.1921158E@lemburg.com> Fredrik Lundh wrote: > > Guido van Rossum <guido at CNRI.Reston.VA.US> wrote: > > http://starship.skyport.net/~lemburg/unicode-proposal.txt > > Marc-Andre writes: > > The internal format for Unicode objects should either use a Python > specific fixed cross-platform format <PythonUnicode> (e.g. 2-byte > little endian byte order) or a compiler provided wchar_t format (if > available). Using the wchar_t format will ease embedding of Python in > other Unicode aware applications, but will also make internal format > dumps platform dependent. > > having been there and done that, I strongly suggest > a third option: a 16-bit unsigned integer, in platform > specific byte order (PY_UNICODE_T). along all other > roads lie code bloat and speed penalties... > > (besides, this is exactly how it's already done in > unicode.c and what 'sre' prefers...) Ok, byte order can cause a speed penalty, so it might be worthwhile introducing sys.bom (or sys.endianness) for this reason and sticking to 16-bit integers as you have already done in unicode.h. What I don't like is using wchar_t if available (and then addressing it as if it were defined as unsigned integer). IMO, it's better to define a Python Unicode representation which then gets converted to whatever wchar_t represents on the target machine. Another issue is whether to use UCS2 (as you have done) or UTF16 (which is what Unicode 3.0 requires)... see my other post for a discussion. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From fredrik at pythonware.com Wed Nov 10 13:32:16 1999 From: fredrik at pythonware.com (Fredrik Lundh) Date: Wed, 10 Nov 1999 13:32:16 +0100 Subject: Internal Format (Re: [Python-Dev] Internationalization Toolkit) References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> <199911091646.LAA21467@eric.cnri.reston.va.us> <00f401bf2b53$3c234e90$f29b12c2@secret.pythonware.com> <382942F8.1921158E@lemburg.com> Message-ID: <038501bf2b77$a06046f0$f29b12c2@secret.pythonware.com> > What I don't like is using wchar_t if available (and then addressing > it as if it were defined as unsigned integer). IMO, it's better > to define a Python Unicode representation which then gets converted > to whatever wchar_t represents on the target machine. you should read the unicode.h file a bit more carefully: ... /* Unicode declarations. Tweak these to match your platform */ /* set this flag if the platform has "wchar.h", "wctype.h" and the wchar_t type is a 16-bit unsigned type */ #define HAVE_USABLE_WCHAR_H #if defined(WIN32) || defined(HAVE_USABLE_WCHAR_H) (this uses wchar_t, and also iswspace and friends) ... #else /* Use if you have a standard ANSI compiler, without wchar_t support. If a short is not 16 bits on your platform, you have to fix the typedef below, or the module initialization code will complain. */ (this maps iswspace to isspace, for 8-bit characters). #endif ... the plan was to use the second solution (using "configure" to figure out what integer type to use), and its own uni- code database table for the is/to primitives (iirc, the unicode.txt file discussed this, but that one seems to be missing from the zip archive). </F> From fredrik at pythonware.com Wed Nov 10 13:39:56 1999 From: fredrik at pythonware.com (Fredrik Lundh) Date: Wed, 10 Nov 1999 13:39:56 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <Pine.LNX.4.10.9911110236000.18059-100000@nebula.lyra.org> Message-ID: <039c01bf2b78$b234d520$f29b12c2@secret.pythonware.com> Greg Stein <gstein at lyra.org> wrote: > Have you ever noticed how Python modules, packages, tools, etc, never > define an import hook? hey, didn't MAL use one in one of his mx kits? ;-) > I say axe it and say "UTF-8" is the fixed, default encoding. If you want > something else, then do that explicitly. exactly. modes are evil. python is not perl. etc. > Are we digging a hole for ourselves? Maybe. But there are two other big > platforms that have the same hole to dig out of *IF* it ever comes to > that. I posit that it won't be necessary; that the people needing UCS-4 > can do so entirely in Python. last time I checked, there were no characters (even in the ISO standard) outside the 16-bit range. has that changed? </F> From mal at lemburg.com Wed Nov 10 13:44:39 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Wed, 10 Nov 1999 13:44:39 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <Pine.LNX.4.10.9911110248270.18059-100000@nebula.lyra.org> Message-ID: <382968B7.ABFFD4C0@lemburg.com> Greg Stein wrote: > > On Wed, 10 Nov 1999, M.-A. Lemburg wrote: > > Tim Peters wrote: > > > BTW, is ord(unicode_char) defined? And as what? And does ord have an > > > inverse in the Unicode world? Both seem essential. > > > > Good points. > > > > How about > > > > uniord(u[:1]) --> Unicode ordinal number (32-bit) > > > > unichr(i) --> Unicode object for character i (provided it is 32-bit); > > ValueError otherwise > > Why new functions? Why not extend the definition of ord() and chr()? > > In terms of backwards compatibility, the only issue could possibly be that > people relied on chr(x) to throw an error when x>=256. They certainly > couldn't pass a Unicode object to ord(), so that function can safely be > extended to accept a Unicode object and return a larger integer. Because unichr() will always have to return Unicode objects. You don't want chr(i) to return Unicode for i>255 and strings for i<256. OTOH, ord() could probably be extended to also work on Unicode objects. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Wed Nov 10 14:08:30 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Wed, 10 Nov 1999 14:08:30 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <Pine.LNX.4.10.9911110236000.18059-100000@nebula.lyra.org> Message-ID: <38296E4E.914C0ED7@lemburg.com> Greg Stein wrote: > > On Wed, 10 Nov 1999, M.-A. Lemburg wrote: > >... > > Well almost... it depends on the current value of <default encoding>. > > Default encodings are kind of nasty when they can be altered. The same > problem occurred with import hooks. Only one can be present at a time. > This implies that modules, packages, subsystems, whatever, cannot set a > default encoding because something else might depend on it having a > different value. In the end, nobody uses the default encoding because it > is unreliable, so you end up with extra implementation/semantics that > aren't used/needed. I know, but this is a little different: you use strings a lot while import hooks are rarely used directly by the user. E.g. people in Europe will probably prefer Latin-1 as default encoding while people in Asia will use one of the common CJK encodings. The <default encoding> decides what encoding to use for many typical tasks: printing, str(u), "s" argument parsing, etc. Note that setting the <default encoding> is not intended to be done prior to single operations. It is meant to be settable at thread creation time. > [...] > > > BTW, I'm still not too sure about the underlying internal format. > > The problem here is that Unicode started out as 2-byte fixed length > > representation (UCS2) but then shifted towards a 4-byte fixed length > > reprensetation known as UCS4. Since having 4 bytes per character > > is hard sell to customers, UTF16 was created to stuff the UCS4 > > code points (this is how character entities are called in Unicode) > > into 2 bytes... with a variable length encoding. > > History is basically irrelevant. What is the situation today? What is in > use, and what are people planning for right now? > > >... > > The downside of using UTF16: it is a variable length format, > > so iterations over it will be slower than for UCS4. > > Bzzt. May as well go with UTF-8 as the internal format, much like Perl is > doing (as I recall). > > Why go with a variable length format, when people seem to be doing fine > with UCS-2? The reason for UTF-16 is simply that it is identical to UCS-2 over large ranges which makes optimizations (e.g. the UCS2 flag I mentioned in an earlier post) feasable and effective. UTF-8 slows things down for CJK encodings, since the APIs will very often have to scan the string to find the correct logical position in the data. Here's a quote from the Unicode FAQ (http://www.unicode.org/unicode/faq/ ): """ Q: How about using UCS-4 interfaces in my APIs? Given an internal UTF-16 storage, you can, of course, still index into text using UCS-4 indices. However, while converting from a UCS-4 index to a UTF-16 index or vice versa is fairly straightforward, it does involve a scan through the 16-bit units up to the index point. In a test run, for example, accessing UTF-16 storage as UCS-4 characters results in a 10X degradation. Of course, the precise differences will depend on the compiler, and there are some interesting optimizations that can be performed, but it will always be slower on average. This kind of performance hit is unacceptable in many environments. Most Unicode APIs are using UTF-16. The low-level character indexing are at the common storage level, with higher-level mechanisms for graphemes or words specifying their boundaries in terms of the storage units. This provides efficiency at the low levels, and the required functionality at the high levels. Convenience APIs can be produced that take parameters in UCS-4 methods for common utilities: e.g. converting UCS-4 indices back and forth, accessing character properties, etc. Outside of indexing, differences between UCS-4 and UTF-16 are not as important. For most other APIs outside of indexing, characters values cannot really be considered outside of their context--not when you are writing internationalized code. For such operations as display, input, collation, editing, and even upper and lowercasing, characters need to be considered in the context of a string. That means that in any event you end up looking at more than one character. In our experience, the incremental cost of doing surrogates is pretty small. """ > Like I said in the other mail note: two large platforms out there are > UCS-2 based. They seem to be doing quite well with that approach. > > If people truly need UCS-4, then they can work with that on their own. One > of the major reasons for putting Unicode into Python is to > increase/simplify its ability to speak to the underlying platform. Hey! > Guess what? That generally means UCS2. All those formats are upward compatible (within certain ranges) and the Python Unicode API will provide converters between its internal format and the few common Unicode implementations, e.g. for MS compilers (16-bit UCS2 AFAIK), GLIBC (32-bit UCS4). > If we didn't need to speak to the OS with these Unicode values, then > people can work with the values entirely in Python, > PyUnicodeType-be-damned. > > Are we digging a hole for ourselves? Maybe. But there are two other big > platforms that have the same hole to dig out of *IF* it ever comes to > that. I posit that it won't be necessary; that the people needing UCS-4 > can do so entirely in Python. > > Maybe we can allow the encoder to do UCS-4 to UTF-8 encoding and > vice-versa. But: it only does it from String to String -- you can't use > Unicode objects anywhere in there. See above. > > Simply sticking to UCS2 is probably out of the question, > > since Unicode 3.0 requires UCS4 and we are targetting > > Unicode 3.0. > > Oh? Who says? >From the FAQ: """ Q: What is UTF-16? Unicode was originally designed as a pure 16-bit encoding, aimed at representing all modern scripts. (Ancient scripts were to be represented with private-use characters.) Over time, and especially after the addition of over 14,500 composite characters for compatibility with legacy sets, it became clear that 16-bits were not sufficient for the user community. Out of this arose UTF-16. """ Note that there currently are no defined surrogate pairs for UTF-16, meaning that in practice the difference between UCS-2 and UTF-16 is probably negligable, e.g. we could define the internal format to be UTF-16 and raise exception whenever the border between UTF-16 and UCS-2 is crossed -- sort of as political compromise ;-). But... I think HP has the last word on this one. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Wed Nov 10 13:36:44 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Wed, 10 Nov 1999 13:36:44 +0100 Subject: Internal Format (Re: [Python-Dev] Internationalization Toolkit) References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> <199911091646.LAA21467@eric.cnri.reston.va.us> <00f401bf2b53$3c234e90$f29b12c2@secret.pythonware.com> <382942F8.1921158E@lemburg.com> <038501bf2b77$a06046f0$f29b12c2@secret.pythonware.com> Message-ID: <382966DC.F33E340E@lemburg.com> Fredrik Lundh wrote: > > > What I don't like is using wchar_t if available (and then addressing > > it as if it were defined as unsigned integer). IMO, it's better > > to define a Python Unicode representation which then gets converted > > to whatever wchar_t represents on the target machine. > > you should read the unicode.h file a bit more carefully: > > ... > > /* Unicode declarations. Tweak these to match your platform */ > > /* set this flag if the platform has "wchar.h", "wctype.h" and the > wchar_t type is a 16-bit unsigned type */ > #define HAVE_USABLE_WCHAR_H > > #if defined(WIN32) || defined(HAVE_USABLE_WCHAR_H) > > (this uses wchar_t, and also iswspace and friends) > > ... > > #else > > /* Use if you have a standard ANSI compiler, without wchar_t support. > If a short is not 16 bits on your platform, you have to fix the > typedef below, or the module initialization code will complain. */ > > (this maps iswspace to isspace, for 8-bit characters). > > #endif > > ... > > the plan was to use the second solution (using "configure" > to figure out what integer type to use), and its own uni- > code database table for the is/to primitives Oh, I did read unicode.h, stumbled across the mixed usage and decided not to like it ;-) Seriously, I find the second solution where you use the 'unsigned short' much more portable and straight forward. You never know what the compiler does for isw*() and it's probably better sticking to one format for all platforms. Only endianness gets in the way, but that's easy to handle. So I opt for 'unsigned short'. The encoding used in these 2 bytes is a different question though. If HP insists on Unicode 3.0, there's probably no other way than to use UTF-16. > (iirc, the unicode.txt file discussed this, but that one > seems to be missing from the zip archive). It's not in the file I downloaded from your site. Could you post it here ? -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Wed Nov 10 14:13:10 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Wed, 10 Nov 1999 14:13:10 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <Pine.LNX.4.10.9911110236000.18059-100000@nebula.lyra.org> <38295389.397DDE5E@equi4.com> Message-ID: <38296F66.5DF9263E@lemburg.com> Jean-Claude Wippler wrote: > > Greg Stein wrote: > [MAL:] > > > The downside of using UTF16: it is a variable length format, > > > so iterations over it will be slower than for UCS4. > > > > Bzzt. May as well go with UTF-8 as the internal format, much like Perl > > is doing (as I recall). > > Ehm, pardon me for asking - what is the brief rationale for selecting > UCS2/4, or whetever it ends up being, over UTF8? UCS-2 is the native format on major platforms (meaning straight fixed length encoding using 2 bytes), ie. interfacing between Python's Unicode object and the platform APIs will be simple and fast. UTF-8 is short for ASCII users, but imposes a performance hit for the CJK (Asian character sets) world, since UTF8 uses *variable* length encodings. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From akuchlin at mems-exchange.org Wed Nov 10 15:56:16 1999 From: akuchlin at mems-exchange.org (Andrew M. Kuchling) Date: Wed, 10 Nov 1999 09:56:16 -0500 (EST) Subject: [Python-Dev] Re: regexp performance In-Reply-To: <027c01bf2b69$59e60330$f29b12c2@secret.pythonware.com> References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> <199911091646.LAA21467@eric.cnri.reston.va.us> <14376.22527.323888.677816@amarok.cnri.reston.va.us> <199911091726.MAA21754@eric.cnri.reston.va.us> <14376.23671.250752.637144@amarok.cnri.reston.va.us> <14376.25768.368164.88151@anthem.cnri.reston.va.us> <14376.30652.201552.116828@amarok.cnri.reston.va.us> <027c01bf2b69$59e60330$f29b12c2@secret.pythonware.com> Message-ID: <14377.34704.639462.794509@amarok.cnri.reston.va.us> [Cc'ed to the String-SIG; sheesh, what's the point of having SIGs otherwise?] Fredrik Lundh writes: >any special pattern constructs that are in need of per- >formance improvements? (compared to Perl, that is). In the 1.5 source tree, I think one major slowdown is coming from the malloc'ed failure stack. This was introduced in order to prevent an expression like (x)* from filling the stack when applied to a string contained 50,000 'x' characters (hence 50,000 recursive function calls). I'd like to get rid of this stack because it's slow and requires much tedious patching of the upstream PCRE. >or maybe anyone has an extensive performance test >suite for perlish regular expressions? (preferrably based >on how real people use regular expressions, not only on >things that are known to be slow if not optimized) Friedl's book describes several optimizations which aren't implemented in PCRE. The problem is that PCRE never builds a parse tree, and parse trees are easy to analyse recursively. Instead, PCRE's functions actually look at the compiled byte codes (for example, look at find_firstchar or is_anchored in pypcre.c), but this makes analysis functions hard to write, and rearranging the code near-impossible. -- A.M. Kuchling http://starship.python.net/crew/amk/ I didn't say it was my fault. I said it was my responsibility. I know the difference. -- Rose Walker, in SANDMAN #60: "The Kindly Ones:4" From jack at oratrix.nl Wed Nov 10 16:04:58 1999 From: jack at oratrix.nl (Jack Jansen) Date: Wed, 10 Nov 1999 16:04:58 +0100 Subject: [Python-Dev] I18N Toolkit In-Reply-To: Message by "Fredrik Lundh" <fredrik@pythonware.com> , Wed, 10 Nov 1999 11:52:28 +0100 , <029c01bf2b69$af0da250$f29b12c2@secret.pythonware.com> Message-ID: <19991110150458.B542735BB1E@snelboot.oratrix.nl> > a slightly hairer design issue is what combinations > of pattern and string the new 're' will handle. > > the first two are obvious: > > ordinary pattern, ordinary string > unicode pattern, unicode string > > but what about these? > > ordinary pattern, unicode string > unicode pattern, ordinary string I think the logical thing to do would be to "promote" the ordinary pattern or string to unicode, in a similar way to what happens if you combine ints and floats in a single expression. The result may be a bit surprising if your pattern is in ascii and you've never been aware of unicode and are given such a string from somewhere else, but then if you're only aware of integer arithmetic and are suddenly presented with a couple of floats you'll also be pretty surprised at the result. At least it's easily explained. -- Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++ Jack.Jansen at oratrix.com | ++++ if you agree copy these lines to your sig ++++ www.oratrix.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm From fdrake at acm.org Wed Nov 10 16:22:17 1999 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Wed, 10 Nov 1999 10:22:17 -0500 (EST) Subject: Internal Format (Re: [Python-Dev] Internationalization Toolkit) In-Reply-To: <00f401bf2b53$3c234e90$f29b12c2@secret.pythonware.com> References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> <199911091646.LAA21467@eric.cnri.reston.va.us> <00f401bf2b53$3c234e90$f29b12c2@secret.pythonware.com> Message-ID: <14377.36265.315127.788319@weyr.cnri.reston.va.us> Fredrik Lundh writes: > having been there and done that, I strongly suggest > a third option: a 16-bit unsigned integer, in platform > specific byte order (PY_UNICODE_T). along all other I actually like this best, but I understand that there are reasons for using wchar_t, especially for interfacing with other code that uses Unicode. Perhaps someone who knows more about the specific issues with interfacing using wchar_t can summarize them, or point me to whatever I've already missed. p-) -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> Corporation for National Research Initiatives From skip at mojam.com Wed Nov 10 16:54:30 1999 From: skip at mojam.com (Skip Montanaro) Date: Wed, 10 Nov 1999 09:54:30 -0600 (CST) Subject: cached encoding (Re: [Python-Dev] Internationalization Toolkit) In-Reply-To: <010b01bf2b54$fb107430$f29b12c2@secret.pythonware.com> References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> <199911091646.LAA21467@eric.cnri.reston.va.us> <010b01bf2b54$fb107430$f29b12c2@secret.pythonware.com> Message-ID: <14377.38198.793496.870273@dolphin.mojam.com> Just a couple observations from the peanut gallery... 1. I'm glad I don't have to do this Unicode/UTF/internationalization stuff. Seems like it would be easier to just get the whole world speaking Esperanto. 2. Are there plans for an internationalization session at IPC8? Perhaps a few key players could be locked into a room for a couple days, to emerge bloodied, but with an implementation in-hand... Skip Montanaro | http://www.mojam.com/ skip at mojam.com | http://www.musi-cal.com/ 847-971-7098 | Python: Programming the way Guido indented... From fdrake at acm.org Wed Nov 10 16:58:30 1999 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Wed, 10 Nov 1999 10:58:30 -0500 (EST) Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <38295A08.D3928401@lemburg.com> References: <19991110080926.2400.rocketmail@web602.mail.yahoo.com> <38295A08.D3928401@lemburg.com> Message-ID: <14377.38438.615701.231437@weyr.cnri.reston.va.us> M.-A. Lemburg writes: > def encode(self,u): > > """ Return the Unicode object u encoded as Python string. This should accept an optional slice parameter, and use it in the same way as .dump(). > def dump(self,u,stream,slice=None): ... > def load(self,stream,length=None): Why not have something like .wrapFile(f) that returns a file-like object with all the file methods implemented, and doing to "right thing" regarding encoding/decoding? That way, the new file-like object can be used directly with code that works with files and doesn't care whether it uses 8-bit or unicode strings. > Codecs should raise an UnicodeError in case the conversion is > not possible. I think that should be ValueError, or UnicodeError should be a subclass of ValueError. (Can the -X interpreter option be removed yet?) -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> Corporation for National Research Initiatives From bwarsaw at cnri.reston.va.us Wed Nov 10 17:41:29 1999 From: bwarsaw at cnri.reston.va.us (Barry A. Warsaw) Date: Wed, 10 Nov 1999 11:41:29 -0500 (EST) Subject: cached encoding (Re: [Python-Dev] Internationalization Toolkit) References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> <199911091646.LAA21467@eric.cnri.reston.va.us> <010b01bf2b54$fb107430$f29b12c2@secret.pythonware.com> <14377.38198.793496.870273@dolphin.mojam.com> Message-ID: <14377.41017.413515.887236@anthem.cnri.reston.va.us> >>>>> "SM" == Skip Montanaro <skip at mojam.com> writes: SM> 2. Are there plans for an internationalization session at SM> IPC8? Perhaps a few key players could be locked into a room SM> for a couple days, to emerge bloodied, but with an SM> implementation in-hand... I'm starting to think about devday topics. Sounds like an I18n session would be very useful. Champions? -Barry From mal at lemburg.com Wed Nov 10 14:31:47 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Wed, 10 Nov 1999 14:31:47 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <Pine.LNX.4.10.9911110236000.18059-100000@nebula.lyra.org> <039c01bf2b78$b234d520$f29b12c2@secret.pythonware.com> Message-ID: <382973C3.DCA77051@lemburg.com> Fredrik Lundh wrote: > > Greg Stein <gstein at lyra.org> wrote: > > Have you ever noticed how Python modules, packages, tools, etc, never > > define an import hook? > > hey, didn't MAL use one in one of his mx kits? ;-) Not yet, but I will unless my last patch ("walk me up, Scotty" - import) goes into the core interpreter. > > I say axe it and say "UTF-8" is the fixed, default encoding. If you want > > something else, then do that explicitly. > > exactly. > > modes are evil. python is not perl. etc. But a requirement by the customer... they want to be able to set the locale on a per thread basis. Not exactly my preference (I think all locale settings should be passed as parameters, not via globals). > > Are we digging a hole for ourselves? Maybe. But there are two other big > > platforms that have the same hole to dig out of *IF* it ever comes to > > that. I posit that it won't be necessary; that the people needing UCS-4 > > can do so entirely in Python. > > last time I checked, there were no characters (even in the > ISO standard) outside the 16-bit range. has that changed? No, but people are already thinking about it and there is a defined range in the >16-bit area for private encodings (F0000..FFFFD and 100000..10FFFD). -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mhammond at skippinet.com.au Wed Nov 10 22:36:04 1999 From: mhammond at skippinet.com.au (Mark Hammond) Date: Thu, 11 Nov 1999 08:36:04 +1100 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <382973C3.DCA77051@lemburg.com> Message-ID: <005701bf2bc3$980f4d60$0501a8c0@bobcat> Marc writes: > > modes are evil. python is not perl. etc. > > But a requirement by the customer... they want to be able to > set the locale > on a per thread basis. Not exactly my preference (I think all locale > settings should be passed as parameters, not via globals). Sure - that is what this customer wants, but we need to be clear about the "best thing" for Python generally versus what this particular client wants. For example, if we went with UTF-8 as the only default encoding, then HP may be forced to use a helper function to perform the conversion, rather than the built-in functions. This helper function can use TLS (in Python) to store the encoding. At least it is localized. I agree that having a default encoding that can be changed is a bad idea. It may make 3 line scripts that need to print something easier to work with, but at the cost of reliability in large systems. Kinda like the existing "locale" support, which is thread specific, and is well known to cause these sorts of problems. The end result is that in your app, you find _someone_ has changed the default encoding, and some code no longer works. So the solution is to change the default encoding back, so _your_ code works again. You just know that whoever it was that changed the default encoding in the first place is now going to break - but what else can you do? Having a fixed, default encoding may make life slightly more difficult when you want to work primarily in a different encoding, but at least your system is predictable and reliable. Mark. > > > > Are we digging a hole for ourselves? Maybe. But there are > two other big > > > platforms that have the same hole to dig out of *IF* it > ever comes to > > > that. I posit that it won't be necessary; that the people > needing UCS-4 > > > can do so entirely in Python. > > > > last time I checked, there were no characters (even in the > > ISO standard) outside the 16-bit range. has that changed? > > No, but people are already thinking about it and there is > a defined range in the >16-bit area for private encodings > (F0000..FFFFD and 100000..10FFFD). > > -- > Marc-Andre Lemburg > ______________________________________________________________________ > Y2000: 51 days left > Business: http://www.lemburg.com/ > Python Pages: http://www.lemburg.com/python/ > > > _______________________________________________ > Python-Dev maillist - Python-Dev at python.org > http://www.python.org/mailman/listinfo/python-dev > From gstein at lyra.org Fri Nov 12 00:14:55 1999 From: gstein at lyra.org (Greg Stein) Date: Thu, 11 Nov 1999 15:14:55 -0800 (PST) Subject: [Python-Dev] default encodings (was: Internationalization Toolkit) In-Reply-To: <005701bf2bc3$980f4d60$0501a8c0@bobcat> Message-ID: <Pine.LNX.4.10.9911111502360.18059-100000@nebula.lyra.org> On Thu, 11 Nov 1999, Mark Hammond wrote: > Marc writes: > > > modes are evil. python is not perl. etc. > > > > But a requirement by the customer... they want to be able to > > set the locale > > on a per thread basis. Not exactly my preference (I think all locale > > settings should be passed as parameters, not via globals). > > Sure - that is what this customer wants, but we need to be clear about > the "best thing" for Python generally versus what this particular > client wants. Ha! I was getting ready to say exactly the same thing. Are building Python for a particular customer, or are we building it to Do The Right Thing? I've been getting increasingly annoyed at "well, HP says this" or "HP wants that." I'm ecstatic that they are a Consortium member and are helping to fund the development of Python. However, if that means we are selling Python's soul to corporate wishes rather than programming and design ideals... well, it reduces my enthusiasm :-) >... > I agree that having a default encoding that can be changed is a bad > idea. It may make 3 line scripts that need to print something easier > to work with, but at the cost of reliability in large systems. Kinda > like the existing "locale" support, which is thread specific, and is > well known to cause these sorts of problems. The end result is that > in your app, you find _someone_ has changed the default encoding, and > some code no longer works. So the solution is to change the default > encoding back, so _your_ code works again. You just know that whoever > it was that changed the default encoding in the first place is now > going to break - but what else can you do? Yes! Yes! Example #2. My first example (import hooks) was shrugged off by some as "well, nobody uses those." Okay, maybe people don't use them (but I believe that is *because* of this kind of problem). In Mark's example, however... this is a definite problem. I ran into this when I was building some code for Microsoft Site Server. IIS was setting a different locale on my thread -- one that I definitely was not expecting. All of a sudden, strlwr() no longer worked as I expected -- certain characters didn't get lower-cased, so my dictionary lookups failed because the keys were not all lower-cased. Solution? Before passing control from C++ into Python, I set the locale to the default locale. Restored it on the way back out. Extreme measures, and costly to do, but it had to be done. I think I'll pick up Fredrik's phrase here... (chanting) "Modes Are Evil!" "Modes Are Evil!" "Down with Modes!" :-) > Having a fixed, default encoding may make life slightly more difficult > when you want to work primarily in a different encoding, but at least > your system is predictable and reliable. *bing* I'm with Mark on this one. Global modes and state are a serious pain when it comes to developing a system. Python is very amenable to utility functions and classes. Any "customer" can use a utility function to manually do the encoding according to a per-thread setting stashed in some module-global dictionary (map thread-id to default-encoding). Done. Keep it out of the interpreter... Cheers, -g -- Greg Stein, http://www.lyra.org/ From da at ski.org Thu Nov 11 00:21:54 1999 From: da at ski.org (David Ascher) Date: Wed, 10 Nov 1999 15:21:54 -0800 (Pacific Standard Time) Subject: [Python-Dev] default encodings (was: Internationalization Toolkit) In-Reply-To: <Pine.LNX.4.10.9911111502360.18059-100000@nebula.lyra.org> Message-ID: <Pine.WNT.4.04.9911101519110.244-100000@rigoletto.ski.org> On Thu, 11 Nov 1999, Greg Stein wrote: > Ha! I was getting ready to say exactly the same thing. Are building Python > for a particular customer, or are we building it to Do The Right Thing? > > I've been getting increasingly annoyed at "well, HP says this" or "HP > wants that." I'm ecstatic that they are a Consortium member and are > helping to fund the development of Python. However, if that means we are > selling Python's soul to corporate wishes rather than programming and > design ideals... well, it reduces my enthusiasm :-) What about just explaining the rationale for the default-less point of view to whoever is in charge of this at HP and see why they came up with their rationale in the first place? They might have a good reason, or they might be willing to change said requirement. --david From gstein at lyra.org Fri Nov 12 00:31:43 1999 From: gstein at lyra.org (Greg Stein) Date: Thu, 11 Nov 1999 15:31:43 -0800 (PST) Subject: [Python-Dev] default encodings (was: Internationalization Toolkit) In-Reply-To: <Pine.WNT.4.04.9911101519110.244-100000@rigoletto.ski.org> Message-ID: <Pine.LNX.4.10.9911111531200.18059-100000@nebula.lyra.org> Damn, you're smooth... maybe you should have run for SF Mayor... :-) On Wed, 10 Nov 1999, David Ascher wrote: > On Thu, 11 Nov 1999, Greg Stein wrote: > > > Ha! I was getting ready to say exactly the same thing. Are building Python > > for a particular customer, or are we building it to Do The Right Thing? > > > > I've been getting increasingly annoyed at "well, HP says this" or "HP > > wants that." I'm ecstatic that they are a Consortium member and are > > helping to fund the development of Python. However, if that means we are > > selling Python's soul to corporate wishes rather than programming and > > design ideals... well, it reduces my enthusiasm :-) > > What about just explaining the rationale for the default-less point of > view to whoever is in charge of this at HP and see why they came up with > their rationale in the first place? They might have a good reason, or > they might be willing to change said requirement. > > --david > -- Greg Stein, http://www.lyra.org/ From tim_one at email.msn.com Thu Nov 11 07:25:27 1999 From: tim_one at email.msn.com (Tim Peters) Date: Thu, 11 Nov 1999 01:25:27 -0500 Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit) In-Reply-To: <00f501bf2b53$9872e610$f29b12c2@secret.pythonware.com> Message-ID: <000201bf2c0d$8b866160$262d153f@tim> [/F, dripping with code] > ... > Note that the 'u' must be followed by four hexadecimal digits. If > fewer digits are given, the sequence is left in the resulting string > exactly as given. Yuck -- don't let probable error pass without comment. "must be" == "must be"! [moving backwards] > \uxxxx -- Unicode character with hexadecimal value xxxx. The > character is stored using UTF-8 encoding, which means that this > sequence can result in up to three encoded characters. The code is fine, but I've gotten confused about what the intent is now. Expanding \uxxxx to its UTF-8 encoding made sense when MAL had UTF-8 literals, but now he's got Unicode-escaped literals instead -- and you favor an internal 2-byte-per-char Unicode storage format. In that combination of worlds, is there any use in the *language* (as opposed to in a runtime module) for \uxxxx -> UTF-8 conversion? And MAL, if you're listening, I'm not clear on what a Unicode-escaped literal means. When you had UTF-8 literals, the meaning of something like u"a\340\341" was clear, since UTF-8 is defined as a byte stream and UTF-8 string literals were just a way of specifying a byte stream. As a Unicode-escaped string, I assume the "a" maps to the Unicode "a", but what of the rest? Are the octal escapes to be taken as two separate Latin-1 characters (in their role as a Unicode subset), or as an especially clumsy way to specify a single 16-bit Unicode character? I'm afraid I'd vote for the former. Same issue wrt \x escapes. One other issue: are there "raw" Unicode strings too, as in ur"\u20ac"? There probably should be; and while Guido will hate this, a ur string should probably *not* leave \uxxxx escapes untouched. Nasties like this are why Java defines \uxxxx expansion as occurring in a preprocessing step. BTW, the meaning of \uxxxx in a non-Unicode string is now also unclear (or isn't \uxxxx allowed in a non-Unicode string? that's what I would do ...). From tim_one at email.msn.com Thu Nov 11 07:49:16 1999 From: tim_one at email.msn.com (Tim Peters) Date: Thu, 11 Nov 1999 01:49:16 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <Pine.LNX.4.10.9911110315330.18059-100000@nebula.lyra.org> Message-ID: <000501bf2c10$df4679e0$262d153f@tim> [ Greg Stein] > ... > Things will be a lot faster if we have a fixed-size character. Variable > length formats like UTF-8 are a lot harder to slice, search, etc. The initial byte of any UTF-8 encoded character never appears in a *non*-initial position of any UTF-8 encoded character. Which means searching is not only tractable in UTF-8, but also that whatever optimized 8-bit clean string searching routines you happen to have sitting around today can be used as-is on UTF-8 encoded strings. This is not true of UCS-2 encoded strings (in which "the first" byte is not distinguished, so 8-bit search is vulnerable to finding a hit starting "in the middle" of a character). More, to the extent that the bulk of your text is plain ASCII, the UTF-8 search will run much faster than when using a 2-byte encoding, simply because it has half as many bytes to chew over. UTF-8 is certainly slower for random-access indexing, including slicing. I don't know what "etc" means, but if it follows the pattern so far, sometimes it's faster and sometimes it's slower <wink>. > (IMO) a big reason for this new type is for interaction with the > underlying OS/platform. I don't know of any platforms right now that > really use UTF-8 as their Unicode string representation (meaning we'd > have to convert back/forth from our UTF-8 representation to talk to the > OS). No argument here. From tim_one at email.msn.com Thu Nov 11 07:56:35 1999 From: tim_one at email.msn.com (Tim Peters) Date: Thu, 11 Nov 1999 01:56:35 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <382968B7.ABFFD4C0@lemburg.com> Message-ID: <000601bf2c11$e4b07920$262d153f@tim> [MAL, on Unicode chr() and ord() > ... > Because unichr() will always have to return Unicode objects. You don't > want chr(i) to return Unicode for i>255 and strings for i<256. Indeed I do not! > OTOH, ord() could probably be extended to also work on Unicode objects. I think should be -- it's a good & natural use of polymorphism; introducing a new function *here* would be as odd as introducing a unilen() function to get the length of a Unicode string. From tim_one at email.msn.com Thu Nov 11 08:03:34 1999 From: tim_one at email.msn.com (Tim Peters) Date: Thu, 11 Nov 1999 02:03:34 -0500 Subject: [Python-Dev] RE: [String-SIG] Re: regexp performance In-Reply-To: <14377.34704.639462.794509@amarok.cnri.reston.va.us> Message-ID: <000701bf2c12$de8bca80$262d153f@tim> [Andrew M. Kuchling] > ... > Friedl's book describes several optimizations which aren't implemented > in PCRE. The problem is that PCRE never builds a parse tree, and > parse trees are easy to analyse recursively. Instead, PCRE's > functions actually look at the compiled byte codes (for example, look > at find_firstchar or is_anchored in pypcre.c), but this makes analysis > functions hard to write, and rearranging the code near-impossible. This is wonderfully & ironically Pythonic. That is, the Python compiler itself goes straight to byte code, and the optimization that's done works at the latter low level. Luckily <wink>, very little optimization is attempted, and what's there only replaces one bytecode with another of the same length. If it tried to do more, it would have to rearrange the code ... the-more-things-differ-the-more-things-don't-ly y'rs - tim From tim_one at email.msn.com Thu Nov 11 08:27:52 1999 From: tim_one at email.msn.com (Tim Peters) Date: Thu, 11 Nov 1999 02:27:52 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <382973C3.DCA77051@lemburg.com> Message-ID: <000801bf2c16$43f9a4c0$262d153f@tim> [/F] > last time I checked, there were no characters (even in the > ISO standard) outside the 16-bit range. has that changed? [MAL] > No, but people are already thinking about it and there is > a defined range in the >16-bit area for private encodings > (F0000..FFFFD and 100000..10FFFD). Over the decades I've developed a rule of thumb that has never wound up stuck in my ass <wink>: If I engineer code that I expect to be in use for N years, I make damn sure that every internal limit is at least 10x larger than the largest I can conceive of a user making reasonable use of at the end of those N years. The invariable result is that the N years pass, and fewer than half of the users have bumped into the limit <0.5 wink>. At the risk of offending everyone, I'll suggest that, qualitatively speaking, Unicode is as Eurocentric as ASCII is Anglocentric. We've just replaced "256 characters?! We'll *never* run out of those!" with 64K. But when Asian languages consume them 7K at a pop, 64K isn't even in my 10x comfort range for some individual languages. In just a few months, Unicode 3 will already have used up > 56K of the 64K slots. As I understand it, UTF-16 "only" adds 1M new code points. That's in my 10x zone, for about a decade. predicting-we'll-live-to-regret-it-either-way-ly y'rs - tim From captainrobbo at yahoo.com Thu Nov 11 08:29:05 1999 From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=) Date: Wed, 10 Nov 1999 23:29:05 -0800 (PST) Subject: cached encoding (Re: [Python-Dev] Internationalization Toolkit) Message-ID: <19991111072905.25203.rocketmail@web607.mail.yahoo.com> > 2. Are there plans for an internationalization > session at IPC8? Perhaps a > few key players could be locked into a room for a > couple days, to emerge > bloodied, but with an implementation in-hand... Excellent idea. - Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From tim_one at email.msn.com Thu Nov 11 08:29:50 1999 From: tim_one at email.msn.com (Tim Peters) Date: Thu, 11 Nov 1999 02:29:50 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <005701bf2bc3$980f4d60$0501a8c0@bobcat> Message-ID: <000901bf2c16$8a107420$262d153f@tim> [Mark Hammond] > Sure - that is what this customer wants, but we need to be clear about > the "best thing" for Python generally versus what this particular > client wants. > ... > Having a fixed, default encoding may make life slightly more difficult > when you want to work primarily in a different encoding, but at least > your system is predictable and reliable. Well said, Mark! Me too. It's like HP is suffering from Windows envy <wink>. From captainrobbo at yahoo.com Thu Nov 11 08:30:53 1999 From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=) Date: Wed, 10 Nov 1999 23:30:53 -0800 (PST) Subject: cached encoding (Re: [Python-Dev] Internationalization Toolkit) Message-ID: <19991111073053.7884.rocketmail@web602.mail.yahoo.com> --- "Barry A. Warsaw" <bwarsaw at cnri.reston.va.us> wrote: > > I'm starting to think about devday topics. Sounds > like an I18n > session would be very useful. Champions? > I'm willing to explain what the fuss is about to bemused onlookers and give some examples of problems it should be able to solve - plenty of good slides and screen shots. I'll stay well away from the C implementation issues. Regards, Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From captainrobbo at yahoo.com Thu Nov 11 08:33:25 1999 From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=) Date: Wed, 10 Nov 1999 23:33:25 -0800 (PST) Subject: [Python-Dev] default encodings (was: Internationalization Toolkit) Message-ID: <19991111073325.8024.rocketmail@web602.mail.yahoo.com> > > What about just explaining the rationale for the > default-less point of > view to whoever is in charge of this at HP and see > why they came up with > their rationale in the first place? They might have > a good reason, or > they might be willing to change said requirement. > > --david For that matter (I came into this a bit late), is there a statement somewhere of what HP actually want to do? - Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From captainrobbo at yahoo.com Thu Nov 11 08:44:50 1999 From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=) Date: Wed, 10 Nov 1999 23:44:50 -0800 (PST) Subject: [Python-Dev] Internationalization Toolkit Message-ID: <19991111074450.20451.rocketmail@web606.mail.yahoo.com> > I say axe it and say "UTF-8" is the fixed, default > encoding. If you want > something else, then do that explicitly. > Let me tell you why you would want to have an encoding which can be set: (1) sday I am on a Japanese Windows box, I have a string called 'address' and I do 'print address'. If I see utf8, I see garbage. If I see Shift-JIS, I see the correct Japanese address. At this point in time, utf8 is an interchange format but 99% of the world's data is in various native encodings. Analogous problems occur on input. (2) I'm using htmlgen, which 'prints' objects to standard output. My web site is supposed to be encoded in Shift-JIS (or EUC, or Big 5 for Taiwan, etc.) Yes, browsers CAN detect and display UTF8 but you just don't find UTF8 sites in the real world - and most users just don't know about the encoding menu, and will get pissed off if they have to reach for it. Ditto for streaming output in some protocol. Java solves this (and we could too by hacking stdout) using Writer classes which are created as wrappers around an output stream and can take an encoding, but you lose the flexibility to 'just print'. I think being able to change encoding would be useful. What I do not want is to auto-detect it from the operating system when Python boots - that would be a portability nightmare. Regards, Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From fredrik at pythonware.com Thu Nov 11 09:06:04 1999 From: fredrik at pythonware.com (Fredrik Lundh) Date: Thu, 11 Nov 1999 09:06:04 +0100 Subject: [Python-Dev] RE: [String-SIG] Re: regexp performance References: <000701bf2c12$de8bca80$262d153f@tim> Message-ID: <009201bf2c1b$9a5c1b90$f29b12c2@secret.pythonware.com> Tim Peters <tim_one at email.msn.com> wrote: > > The problem is that PCRE never builds a parse tree, and > > parse trees are easy to analyse recursively. Instead, PCRE's > > functions actually look at the compiled byte codes (for example, look > > at find_firstchar or is_anchored in pypcre.c), but this makes analysis > > functions hard to write, and rearranging the code near-impossible. > > This is wonderfully & ironically Pythonic. That is, the Python compiler > itself goes straight to byte code, and the optimization that's done works at > the latter low level. yeah, but by some reason, people (including GvR) expect a regular expression machinery to be more optimized than the language interpreter ;-) </F> From tim_one at email.msn.com Thu Nov 11 09:01:58 1999 From: tim_one at email.msn.com (Tim Peters) Date: Thu, 11 Nov 1999 03:01:58 -0500 Subject: [Python-Dev] default encodings (was: Internationalization Toolkit) In-Reply-To: <19991111073325.8024.rocketmail@web602.mail.yahoo.com> Message-ID: <000c01bf2c1b$0734c060$262d153f@tim> [Andy Robinson] > For that matter (I came into this a bit late), is > there a statement somewhere of what HP actually want > to do? On this list, the best explanation we got was from Guido: they want "internationalization", and "Perl-compatible Unicode regexps". I'm not sure they even know the two aren't identical <0.9 wink>. code-without-requirements-is-like-sex-without-consequences-ly y'rs - tim From guido at CNRI.Reston.VA.US Thu Nov 11 13:03:51 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Thu, 11 Nov 1999 07:03:51 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: Your message of "Wed, 10 Nov 1999 23:44:50 PST." <19991111074450.20451.rocketmail@web606.mail.yahoo.com> References: <19991111074450.20451.rocketmail@web606.mail.yahoo.com> Message-ID: <199911111203.HAA24221@eric.cnri.reston.va.us> > Let me tell you why you would want to have an encoding > which can be set: > > (1) sday I am on a Japanese Windows box, I have a > string called 'address' and I do 'print address'. If > I see utf8, I see garbage. If I see Shift-JIS, I see > the correct Japanese address. At this point in time, > utf8 is an interchange format but 99% of the world's > data is in various native encodings. > > Analogous problems occur on input. > > (2) I'm using htmlgen, which 'prints' objects to > standard output. My web site is supposed to be > encoded in Shift-JIS (or EUC, or Big 5 for Taiwan, > etc.) Yes, browsers CAN detect and display UTF8 but > you just don't find UTF8 sites in the real world - and > most users just don't know about the encoding menu, > and will get pissed off if they have to reach for it. > > Ditto for streaming output in some protocol. > > Java solves this (and we could too by hacking stdout) > using Writer classes which are created as wrappers > around an output stream and can take an encoding, but > you lose the flexibility to 'just print'. > > I think being able to change encoding would be useful. > What I do not want is to auto-detect it from the > operating system when Python boots - that would be a > portability nightmare. You almost convinced me there, but I think this can still be done without changing the default encoding: simply reopen stdout with a different encoding. This is how Java does it. I/O streams with an encoding specified at open() are a very powerful feature. You can hide this in your $PYTHONSTARTUP. Fran?ois Pinard might not like it though... BTW, someone asked what HP asked for: I can't reveal what exactly they asked for, basically because they don't seem to agree amongst themselves. The only firm statements I have is that they want i18n and that they want it fast (before the end of the year). The desire from Perl-compatible regexps comes from me, and the only reason is compatibility with re.py. (HP did ask for regexps, but they don't know the difference between POSIX and Perl if it poked them in the eye.) --Guido van Rossum (home page: http://www.python.org/~guido/) From gstein at lyra.org Thu Nov 11 13:20:39 1999 From: gstein at lyra.org (Greg Stein) Date: Thu, 11 Nov 1999 04:20:39 -0800 (PST) Subject: [Python-Dev] Internationalization Toolkit (fwd) Message-ID: <Pine.LNX.4.10.9911110419400.27203-100000@nebula.lyra.org> Andy originally sent this just to me... I replied in kind, but saw that he sent another copy to python-dev. Sending my reply there... ---------- Forwarded message ---------- Date: Thu, 11 Nov 1999 04:00:38 -0800 (PST) From: Greg Stein <gstein at lyra.org> To: andy at robanal.demon.co.uk Subject: Re: [Python-Dev] Internationalization Toolkit [ note: you sent direct to me; replying in kind in case that was your intent ] On Wed, 10 Nov 1999, [iso-8859-1] Andy Robinson wrote: >... > Let me tell you why you would want to have an encoding > which can be set: >...snip: two examples of how "print" fails... Neither of those examples are solid reasons for having a default encoding that can be changed. Both can easily be altered at the Python level by using an encoding function before printing. You're asking for convenience, *not* providing a reason. > Java solves this (and we could too) using Writer > classes which are created as wrappers around an output > stream and can take an encoding, but you lose the > flexibility to just print. Not flexibility: convenience. You can certainly do: print encode(u,'Shift-JIS') > I think being able to change encoding would be useful. > What I do not want is to auto-detect it from the > operating system when Python boots - that would be a > portability nightmare. Useful, but not a requirement. Keep the interpreter simple, understandable, and predictable. A module that changes the default over to 'utf-8' because it is interacting with a network object is going to screw up your app if you're relying on an encoding of 'shift-jis' to be present. Cheers, -g -- Greg Stein, http://www.lyra.org/ From captainrobbo at yahoo.com Thu Nov 11 13:49:10 1999 From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=) Date: Thu, 11 Nov 1999 04:49:10 -0800 (PST) Subject: [Python-Dev] Internationalization Toolkit Message-ID: <19991111124910.6373.rocketmail@web603.mail.yahoo.com> > You almost convinced me there, but I think this can > still be done > without changing the default encoding: simply reopen > stdout with a > different encoding. This is how Java does it. I/O > streams with an > encoding specified at open() are a very powerful > feature. You can > hide this in your $PYTHONSTARTUP. Good point, I'm happy with this. Make sure we specify it in the docs as the right way to do it. In an IDE, we'd have an Options screen somewhere for the output encoding. What the Java code I have seen does is to open a raw file and construct wrappers (InputStreamReader, OutputStreamWriter) around it to do an encoding conversion. This kind of obfuscates what is going on - Python just needs the extra argument. - Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From mal at lemburg.com Thu Nov 11 13:42:51 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Thu, 11 Nov 1999 13:42:51 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <19991110080926.2400.rocketmail@web602.mail.yahoo.com> <38295A08.D3928401@lemburg.com> <14377.38438.615701.231437@weyr.cnri.reston.va.us> Message-ID: <382AB9CB.634A9782@lemburg.com> "Fred L. Drake, Jr." wrote: > > M.-A. Lemburg writes: > > def encode(self,u): > > > > """ Return the Unicode object u encoded as Python string. > > This should accept an optional slice parameter, and use it in the > same way as .dump(). Ok. > > def dump(self,u,stream,slice=None): > ... > > def load(self,stream,length=None): > > Why not have something like .wrapFile(f) that returns a file-like > object with all the file methods implemented, and doing to "right > thing" regarding encoding/decoding? That way, the new file-like > object can be used directly with code that works with files and > doesn't care whether it uses 8-bit or unicode strings. See File Output of the latest version: File/Stream Output: ------------------- Since file.write(object) and most other stream writers use the 's#' argument parsing marker, the buffer interface implementation determines the encoding to use (see Buffer Interface). For explicit handling of Unicode using files, the unicodec module could provide stream wrappers which provide transparent encoding/decoding for any open stream (file-like object): import unicodec file = open('mytext.txt','rb') ufile = unicodec.stream(file,'utf-16') u = ufile.read() ... ufile.close() XXX unicodec.file(<filename>,<mode>,<encname>) could be provided as short-hand for unicodec.file(open(<filename>,<mode>),<encname>) which also assures that <mode> contains the 'b' character when needed. > > Codecs should raise an UnicodeError in case the conversion is > > not possible. > > I think that should be ValueError, or UnicodeError should be a > subclass of ValueError. Ok. > (Can the -X interpreter option be removed yet?) Doesn't Python convert class exceptions to strings when -X is used ? I would guess that many scripts already rely on the class based mechanism (much of my stuff does for sure), so by the time 1.6 is out, I think -X should be considered an option to run pre 1.5 code rather than using it for performance reasons. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 50 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Thu Nov 11 14:01:40 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Thu, 11 Nov 1999 14:01:40 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <005701bf2bc3$980f4d60$0501a8c0@bobcat> Message-ID: <382ABE34.5D27C701@lemburg.com> Mark Hammond wrote: > > Marc writes: > > > > modes are evil. python is not perl. etc. > > > > But a requirement by the customer... they want to be able to > > set the locale > > on a per thread basis. Not exactly my preference (I think all locale > > settings should be passed as parameters, not via globals). > > Sure - that is what this customer wants, but we need to be clear about > the "best thing" for Python generally versus what this particular > client wants. > > For example, if we went with UTF-8 as the only default encoding, then > HP may be forced to use a helper function to perform the conversion, > rather than the built-in functions. This helper function can use TLS > (in Python) to store the encoding. At least it is localized. > > I agree that having a default encoding that can be changed is a bad > idea. It may make 3 line scripts that need to print something easier > to work with, but at the cost of reliability in large systems. Kinda > like the existing "locale" support, which is thread specific, and is > well known to cause these sorts of problems. The end result is that > in your app, you find _someone_ has changed the default encoding, and > some code no longer works. So the solution is to change the default > encoding back, so _your_ code works again. You just know that whoever > it was that changed the default encoding in the first place is now > going to break - but what else can you do? > > Having a fixed, default encoding may make life slightly more difficult > when you want to work primarily in a different encoding, but at least > your system is predictable and reliable. I think the discussion on this is getting a little too hot. The point is simply that the option of changing the per-thread default encoding is there. You are not required to use it and if you do you are on your own when something breaks. Think of it as a HP specific feature... perhaps I should wrap the code in #ifdefs and leave it undocumented. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 50 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From fdrake at acm.org Thu Nov 11 16:02:32 1999 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Thu, 11 Nov 1999 10:02:32 -0500 (EST) Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <382AB9CB.634A9782@lemburg.com> References: <19991110080926.2400.rocketmail@web602.mail.yahoo.com> <38295A08.D3928401@lemburg.com> <14377.38438.615701.231437@weyr.cnri.reston.va.us> <382AB9CB.634A9782@lemburg.com> Message-ID: <14378.55944.371933.613604@weyr.cnri.reston.va.us> M.-A. Lemburg writes: > For explicit handling of Unicode using files, the unicodec module > could provide stream wrappers which provide transparent > encoding/decoding for any open stream (file-like object): Sounds good to me! I guess I just missed, there's been so much going on lately. > XXX unicodec.file(<filename>,<mode>,<encname>) could be provided as > short-hand for unicodec.file(open(<filename>,<mode>),<encname>) which > also assures that <mode> contains the 'b' character when needed. Actually, I'd call it unicodec.open(). I asked: > (Can the -X interpreter option be removed yet?) You commented: > Doesn't Python convert class exceptions to strings when -X is > used ? I would guess that many scripts already rely on the class > based mechanism (much of my stuff does for sure), so by the time > 1.6 is out, I think -X should be considered an option to run > pre 1.5 code rather than using it for performance reasons. Gosh, I never thought of it as a performance issue! What I'd like to do is avoid code like this: try: class UnicodeError(ValueError): # well, something would probably go here... pass except TypeError: class UnicodeError: # something slightly different for this one... pass Trying to use class exceptions can be really tedious, and often I'd like to pick up the stuff from Exception. -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> Corporation for National Research Initiatives From mal at lemburg.com Thu Nov 11 15:21:50 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Thu, 11 Nov 1999 15:21:50 +0100 Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit) References: <000201bf2c0d$8b866160$262d153f@tim> Message-ID: <382AD0FE.B604876A@lemburg.com> Tim Peters wrote: > > [/F, dripping with code] > > ... > > Note that the 'u' must be followed by four hexadecimal digits. If > > fewer digits are given, the sequence is left in the resulting string > > exactly as given. > > Yuck -- don't let probable error pass without comment. "must be" == "must > be"! I second that. > [moving backwards] > > \uxxxx -- Unicode character with hexadecimal value xxxx. The > > character is stored using UTF-8 encoding, which means that this > > sequence can result in up to three encoded characters. > > The code is fine, but I've gotten confused about what the intent is now. > Expanding \uxxxx to its UTF-8 encoding made sense when MAL had UTF-8 > literals, but now he's got Unicode-escaped literals instead -- and you favor > an internal 2-byte-per-char Unicode storage format. In that combination of > worlds, is there any use in the *language* (as opposed to in a runtime > module) for \uxxxx -> UTF-8 conversion? No, no... :-) I think it was a simple misunderstanding... \uXXXX is only to be used within u'' strings and then gets expanded to *one* character encoded in the internal Python format (which is heading towards UTF-16 without surrogates). > And MAL, if you're listening, I'm not clear on what a Unicode-escaped > literal means. When you had UTF-8 literals, the meaning of something like > > u"a\340\341" > > was clear, since UTF-8 is defined as a byte stream and UTF-8 string literals > were just a way of specifying a byte stream. As a Unicode-escaped string, I > assume the "a" maps to the Unicode "a", but what of the rest? Are the octal > escapes to be taken as two separate Latin-1 characters (in their role as a > Unicode subset), or as an especially clumsy way to specify a single 16-bit > Unicode character? I'm afraid I'd vote for the former. Same issue wrt \x > escapes. Good points. The conversion goes as follows: ? for single characters (and this includes all \XXX sequences except \uXXXX), take the ordinal and interpret it as Unicode ordinal ? for \uXXXX sequences, insert the Unicode character with ordinal 0xXXXX instead > One other issue: are there "raw" Unicode strings too, as in ur"\u20ac"? > There probably should be; and while Guido will hate this, a ur string should > probably *not* leave \uxxxx escapes untouched. Nasties like this are why > Java defines \uxxxx expansion as occurring in a preprocessing step. Not sure whether we really need to make this even more complicated... The \uXXXX strings look ugly, adding a few \\\\ for e.g. REs or filenames won't hurt much in the context of those \uXXXX monsters :-) > BTW, the meaning of \uxxxx in a non-Unicode string is now also unclear (or > isn't \uxxxx allowed in a non-Unicode string? that's what I would do ...). Right. \uXXXX will only be allowed in u'' strings, not in "normal" strings. BTW, if you want to type in UTF-8 strings and have them converted to Unicode, you can use the standard: u = unicode('...string with UTF-8 encoded characters...','utf-8') -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 50 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Thu Nov 11 15:23:45 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Thu, 11 Nov 1999 15:23:45 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <000601bf2c11$e4b07920$262d153f@tim> Message-ID: <382AD171.D22A1D6E@lemburg.com> Tim Peters wrote: > > [MAL, on Unicode chr() and ord() > > ... > > Because unichr() will always have to return Unicode objects. You don't > > want chr(i) to return Unicode for i>255 and strings for i<256. > > Indeed I do not! > > > OTOH, ord() could probably be extended to also work on Unicode objects. > > I think should be -- it's a good & natural use of polymorphism; introducing > a new function *here* would be as odd as introducing a unilen() function to > get the length of a Unicode string. Fine. So I'll drop the uniord() API and extend ord() instead. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 50 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Thu Nov 11 15:36:41 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Thu, 11 Nov 1999 15:36:41 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <000901bf2c16$8a107420$262d153f@tim> Message-ID: <382AD479.5261B43B@lemburg.com> Tim Peters wrote: > > [Mark Hammond] > > Sure - that is what this customer wants, but we need to be clear about > > the "best thing" for Python generally versus what this particular > > client wants. > > ... > > Having a fixed, default encoding may make life slightly more difficult > > when you want to work primarily in a different encoding, but at least > > your system is predictable and reliable. > > Well said, Mark! Me too. It's like HP is suffering from Windows envy > <wink>. See my other post on the subject... Note that if we make UTF-8 the standard encoding, nearly all special Latin-1 characters will produce UTF-8 errors on input and unreadable garbage on output. That will probably be unacceptable in Europe. To remedy this, one would *always* have to use u.encode('latin-1') to get readable output for Latin-1 strings repesented in Unicode. I'd rather see this happen the other way around: *always* explicitly state the encoding you want in case you rely on it, e.g. write file.write(u.encode('utf-8')) instead of file.write(u) # let's hope this goes out as UTF-8... Using the <default encoding> as site dependent setting is useful for convenience in those cases where the output format should be readable rather than parseable. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 50 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Thu Nov 11 15:26:59 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Thu, 11 Nov 1999 15:26:59 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <000801bf2c16$43f9a4c0$262d153f@tim> Message-ID: <382AD233.BE6DE888@lemburg.com> Tim Peters wrote: > > [/F] > > last time I checked, there were no characters (even in the > > ISO standard) outside the 16-bit range. has that changed? > > [MAL] > > No, but people are already thinking about it and there is > > a defined range in the >16-bit area for private encodings > > (F0000..FFFFD and 100000..10FFFD). > > Over the decades I've developed a rule of thumb that has never wound up > stuck in my ass <wink>: If I engineer code that I expect to be in use for N > years, I make damn sure that every internal limit is at least 10x larger > than the largest I can conceive of a user making reasonable use of at the > end of those N years. The invariable result is that the N years pass, and > fewer than half of the users have bumped into the limit <0.5 wink>. > > At the risk of offending everyone, I'll suggest that, qualitatively > speaking, Unicode is as Eurocentric as ASCII is Anglocentric. We've just > replaced "256 characters?! We'll *never* run out of those!" with 64K. But > when Asian languages consume them 7K at a pop, 64K isn't even in my 10x > comfort range for some individual languages. In just a few months, Unicode > 3 will already have used up > 56K of the 64K slots. > > As I understand it, UTF-16 "only" adds 1M new code points. That's in my 10x > zone, for about a decade. If HP approves, I'd propose to use UTF-16 as if it were UCS-2 and signal failure of this assertion at Unicode object construction time via an exception. That way we are within the standard, can use reasonably fast code for Unicode manipulation and add those extra 1M character at a later stage. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 50 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Thu Nov 11 15:47:49 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Thu, 11 Nov 1999 15:47:49 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <19991111074450.20451.rocketmail@web606.mail.yahoo.com> <199911111203.HAA24221@eric.cnri.reston.va.us> Message-ID: <382AD715.66DBA125@lemburg.com> Guido van Rossum wrote: > > > Let me tell you why you would want to have an encoding > > which can be set: > > > > (1) sday I am on a Japanese Windows box, I have a > > string called 'address' and I do 'print address'. If > > I see utf8, I see garbage. If I see Shift-JIS, I see > > the correct Japanese address. At this point in time, > > utf8 is an interchange format but 99% of the world's > > data is in various native encodings. > > > > Analogous problems occur on input. > > > > (2) I'm using htmlgen, which 'prints' objects to > > standard output. My web site is supposed to be > > encoded in Shift-JIS (or EUC, or Big 5 for Taiwan, > > etc.) Yes, browsers CAN detect and display UTF8 but > > you just don't find UTF8 sites in the real world - and > > most users just don't know about the encoding menu, > > and will get pissed off if they have to reach for it. > > > > Ditto for streaming output in some protocol. > > > > Java solves this (and we could too by hacking stdout) > > using Writer classes which are created as wrappers > > around an output stream and can take an encoding, but > > you lose the flexibility to 'just print'. > > > > I think being able to change encoding would be useful. > > What I do not want is to auto-detect it from the > > operating system when Python boots - that would be a > > portability nightmare. > > You almost convinced me there, but I think this can still be done > without changing the default encoding: simply reopen stdout with a > different encoding. This is how Java does it. I/O streams with an > encoding specified at open() are a very powerful feature. You can > hide this in your $PYTHONSTARTUP. True and it probably covers all cases where setting the default encoding to something other than UTF-8 makes sense. I guess you've convinced me there ;-) The current proposal has wrappers around stream for this purpose: For explicit handling of Unicode using files, the unicodec module could provide stream wrappers which provide transparent encoding/decoding for any open stream (file-like object): import unicodec file = open('mytext.txt','rb') ufile = unicodec.stream(file,'utf-16') u = ufile.read() ... ufile.close() XXX unicodec.file(<filename>,<mode>,<encname>) could be provided as short-hand for unicodec.file(open(<filename>,<mode>),<encname>) which also assures that <mode> contains the 'b' character when needed. The above can be done using: import sys,unicodec sys.stdin = unicodec.stream(sys.stdin,'jis') sys.stdout = unicodec.stream(sys.stdout,'jis') -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 50 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From jack at oratrix.nl Thu Nov 11 16:58:39 1999 From: jack at oratrix.nl (Jack Jansen) Date: Thu, 11 Nov 1999 16:58:39 +0100 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: Message by "M.-A. Lemburg" <mal@lemburg.com> , Thu, 11 Nov 1999 15:23:45 +0100 , <382AD171.D22A1D6E@lemburg.com> Message-ID: <19991111155839.BFB0235BB1E@snelboot.oratrix.nl> > > [MAL, on Unicode chr() and ord() > > > ... > > > Because unichr() will always have to return Unicode objects. You don't > > > want chr(i) to return Unicode for i>255 and strings for i<256. > > > OTOH, ord() could probably be extended to also work on Unicode objects. > Fine. So I'll drop the uniord() API and extend ord() instead. Hmm, then wouldn't it be more logical to drop unichr() too, but add an optional parameter to chr() to specify what sort of a string you want? The type-object of a unicode string comes to mind... -- Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++ Jack.Jansen at oratrix.com | ++++ if you agree copy these lines to your sig ++++ www.oratrix.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm From bwarsaw at cnri.reston.va.us Thu Nov 11 17:04:29 1999 From: bwarsaw at cnri.reston.va.us (Barry A. Warsaw) Date: Thu, 11 Nov 1999 11:04:29 -0500 (EST) Subject: [Python-Dev] Internationalization Toolkit References: <19991110080926.2400.rocketmail@web602.mail.yahoo.com> <38295A08.D3928401@lemburg.com> <14377.38438.615701.231437@weyr.cnri.reston.va.us> <382AB9CB.634A9782@lemburg.com> Message-ID: <14378.59661.376434.449820@anthem.cnri.reston.va.us> >>>>> "M" == M <mal at lemburg.com> writes: M> Doesn't Python convert class exceptions to strings when -X is M> used ? I would guess that many scripts already rely on the M> class based mechanism (much of my stuff does for sure), so by M> the time 1.6 is out, I think -X should be considered an option M> to run pre 1.5 code rather than using it for performance M> reasons. This is a little off-topic so I'll be brief. When using -X Python never even creates the class exceptions, so it isn't really a conversion. It just uses string exceptions and tries to craft tuples for what would be the superclasses in the class-based exception hierarchy. Yes, class-based exceptions are a bit of a performance hit when you are catching exceptions in Python (because they need to be instantiated), but they're just so darn *useful*. I wouldn't mind seeing the -X option go away for 1.6. -Barry From captainrobbo at yahoo.com Thu Nov 11 17:08:15 1999 From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=) Date: Thu, 11 Nov 1999 08:08:15 -0800 (PST) Subject: [Python-Dev] Internationalization Toolkit Message-ID: <19991111160815.5235.rocketmail@web608.mail.yahoo.com> > See my other post on the subject... > > Note that if we make UTF-8 the standard encoding, > nearly all > special Latin-1 characters will produce UTF-8 errors > on input > and unreadable garbage on output. That will probably > be unacceptable > in Europe. To remedy this, one would *always* have > to use > u.encode('latin-1') to get readable output for > Latin-1 strings > repesented in Unicode. You beat me to it - a colleague and I were just discussing this verbally. Specifically we Brits will get annoyed as soon as we read in a text file with pound (sterling) signs. We concluded that the only reasonable default (if you have one at all) is pure ASCII. At least that way I will get a clear and intelligible warning when I load in such a file, and will remember to specify ISO-Latin-1. - Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From mal at lemburg.com Thu Nov 11 16:59:21 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Thu, 11 Nov 1999 16:59:21 +0100 Subject: [Python-Dev] Unicode proposal: %-formatting ? Message-ID: <382AE7D9.147D58CB@lemburg.com> I wonder how we could add %-formatting to Unicode strings without duplicating the PyString_Format() logic. First, do we need Unicode object %-formatting at all ? Second, here is an emulation using strings and <default encoding> that should give an idea of one could work with the different encodings: s = '%s %i abc???' # a Latin-1 encoded string t = (u,3) # Convert Latin-1 s to a <default encoding> string via Unicode s1 = unicode(s,'latin-1').encode() # The '%s' will now add u in <default encoding> s2 = s1 % t # Finally, convert the <default encoding> encoded string to Unicode u1 = unicode(s2) Note that .encode() defaults to the current setting of <default encoding>. Provided u maps to Latin-1, an alternative would be: u1 = unicode('%s %i abc???' % (u.encode('latin-1'),3), 'latin-1') -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 50 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Thu Nov 11 18:04:37 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Thu, 11 Nov 1999 18:04:37 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <19991111155839.BFB0235BB1E@snelboot.oratrix.nl> Message-ID: <382AF725.FC66C9B6@lemburg.com> Jack Jansen wrote: > > > > [MAL, on Unicode chr() and ord() > > > > ... > > > > Because unichr() will always have to return Unicode objects. You don't > > > > want chr(i) to return Unicode for i>255 and strings for i<256. > > > > > OTOH, ord() could probably be extended to also work on Unicode objects. > > > Fine. So I'll drop the uniord() API and extend ord() instead. > > Hmm, then wouldn't it be more logical to drop unichr() too, but add an > optional parameter to chr() to specify what sort of a string you want? The > type-object of a unicode string comes to mind... Like: import types uc = chr(12,types.UnicodeType) ... looks overly complicated, IMHO. uc = unichr(12) and u = unicode('abc') look pretty intuitive to me. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 50 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Thu Nov 11 16:59:21 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Thu, 11 Nov 1999 16:59:21 +0100 Subject: [Python-Dev] Unicode proposal: %-formatting ? Message-ID: <382AE7D9.147D58CB@lemburg.com> I wonder how we could add %-formatting to Unicode strings without duplicating the PyString_Format() logic. First, do we need Unicode object %-formatting at all ? Second, here is an emulation using strings and <default encoding> that should give an idea of one could work with the different encodings: s = '%s %i abc???' # a Latin-1 encoded string t = (u,3) # Convert Latin-1 s to a <default encoding> string via Unicode s1 = unicode(s,'latin-1').encode() # The '%s' will now add u in <default encoding> s2 = s1 % t # Finally, convert the <default encoding> encoded string to Unicode u1 = unicode(s2) Note that .encode() defaults to the current setting of <default encoding>. Provided u maps to Latin-1, an alternative would be: u1 = unicode('%s %i abc???' % (u.encode('latin-1'),3), 'latin-1') -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 50 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Thu Nov 11 18:31:34 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Thu, 11 Nov 1999 18:31:34 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <19991111160815.5235.rocketmail@web608.mail.yahoo.com> Message-ID: <382AFD76.A0D3FEC4@lemburg.com> Andy Robinson wrote: > > > See my other post on the subject... > > > > Note that if we make UTF-8 the standard encoding, > > nearly all > > special Latin-1 characters will produce UTF-8 errors > > on input > > and unreadable garbage on output. That will probably > > be unacceptable > > in Europe. To remedy this, one would *always* have > > to use > > u.encode('latin-1') to get readable output for > > Latin-1 strings > > repesented in Unicode. > > You beat me to it - a colleague and I were just > discussing this verbally. Specifically we Brits will > get annoyed as soon as we read in a text file with > pound (sterling) signs. > > We concluded that the only reasonable default (if you > have one at all) is pure ASCII. At least that way I > will get a clear and intelligible warning when I load > in such a file, and will remember to specify > ISO-Latin-1. Well, Guido's post made me rethink the approach... 1. Setting <default encoding> to any non UTF encoding will result in data lossage due to the encoding limits imposed by the other formats -- this is dangerous and will result in errors (some of which may not even be noticed due to the interpreter ignoring them) in case your strings use non encodable characters. 2. You basically only want to set <default encoding> to anything other than UTF-8 for stream input and output. This can be done using the unicodec stream wrapper without too much inconvenience. (We'll have to extend the wrapper a little, though, because it currently only accept Unicode objects for writing and always return Unicode object when reading.) 3. We should leave the issue open until some code is there to be tested... I have a feeling that there will be quite a few strange effects when APIs expecting strings are fed with Unicode objects returning UTF-8. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 50 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mhammond at skippinet.com.au Fri Nov 12 02:10:09 1999 From: mhammond at skippinet.com.au (Mark Hammond) Date: Fri, 12 Nov 1999 12:10:09 +1100 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <382ABE34.5D27C701@lemburg.com> Message-ID: <007a01bf2caa$aabdef60$0501a8c0@bobcat> > Mark Hammond wrote: > > Having a fixed, default encoding may make life slightly > more difficult > > when you want to work primarily in a different encoding, > but at least > > your system is predictable and reliable. > > I think the discussion on this is getting a little too hot. Really - I see it as moving to a rational consensus that doesnt support the proposal in this regard. I see no heat in it at all. Im sorry if you saw my post or any of the followups as "emotional", but I certainly not getting passionate about this. I dont see any of this as affecting me personally. I believe that I can replace my Unicode implementation with this either way we go. Just because a we are trying to get it right doesnt mean we are getting heated. > The point > is simply that the option of changing the per-thread default encoding > is there. You are not required to use it and if you do you are on > your own when something breaks. Hrm - Im having serious trouble following your logic here. If make _any_ assumptions about a default encoding, I am in danger of breaking. I may not choose to change the default, but as soon as _anyone_ does, unrelated code may break. I agree that I will be "on my own", but I wont necessarily have been the one that changed it :-( The only answer I can see is, as you suggest, to ignore the fact that there is _any_ default. Always specify the encoding. But obviously this is not good enough for HP: > Think of it as a HP specific feature... perhaps I should wrap the code > in #ifdefs and leave it undocumented. That would work - just ensure that no standard Python has those #ifdefs turned on :-) I would be sorely dissapointed if the fact that HP are throwing money for this means they get every whim implemented in the core language. Imagine the outcry if it were instead MS' money, and you were attempting to put an MS spin on all this. Are you writing a module for HP, or writing a module for Python that HP are assisting by providing some funding? Clear difference. IMO, it must also be seen that there is a clear difference. Maybe Im missing something. Can you explain why it is good enough everyone else to be required to assume there is no default encoding, but HP get their thread specific global? Are their requirements greater than anyone elses? Is everyone else not as important? What would you, as a consultant, recommend to people who arent HP, but have a similar requirement? It would seem obvious to me that HPs requirement can be met in "pure Python", thereby keeping this out of the core all together... Mark. From gmcm at hypernet.com Fri Nov 12 03:01:23 1999 From: gmcm at hypernet.com (Gordon McMillan) Date: Thu, 11 Nov 1999 21:01:23 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <007a01bf2caa$aabdef60$0501a8c0@bobcat> References: <382ABE34.5D27C701@lemburg.com> Message-ID: <1269750417-7621469@hypernet.com> [per-thread defaults] C'mon guys, hasn't anyone ever played consultant before? The idea is obviously brain-dead. OTOH, they asked for it specifically, meaning they have some assumptions about how they think they're going to use it. If you give them what they ask for, you'll only have to fix it when they realize there are other ways of doing things that don't work with per-thread defaults. So, you find out why they think it's a good thing; you make it easy for them to code this way (without actually using per-thread defaults) and you don't make a fuss about it. More than likely, they won't either. "requirements"-are-only-useful-as-clues-to-the-objectives- behind-them-ly y'rs - Gordon From tim_one at email.msn.com Fri Nov 12 06:04:44 1999 From: tim_one at email.msn.com (Tim Peters) Date: Fri, 12 Nov 1999 00:04:44 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <382AB9CB.634A9782@lemburg.com> Message-ID: <000a01bf2ccb$6f59c2c0$fd2d153f@tim> [MAL] >>> Codecs should raise an UnicodeError in case the conversion is >>> not possible. [Fred L. Drake, Jr.] >> I think that should be ValueError, or UnicodeError should be a >> subclass of ValueError. >> (Can the -X interpreter option be removed yet?) [MAL] > Doesn't Python convert class exceptions to strings when -X is > used ? I would guess that many scripts already rely on the class > based mechanism (much of my stuff does for sure), so by the time > 1.6 is out, I think -X should be considered an option to run > pre 1.5 code rather than using it for performance reasons. -X is a red herring. That is, do what seems best without regard for -X. I already added one subclass exception to the CVS tree (UnboundLocalError as a subclass of NameError), and in doing that had to figure out how to make it do the right thing under -X too. It's a bit clumsy to arrange, but not a problem. From tim_one at email.msn.com Fri Nov 12 06:18:09 1999 From: tim_one at email.msn.com (Tim Peters) Date: Fri, 12 Nov 1999 00:18:09 -0500 Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit) In-Reply-To: <382AD0FE.B604876A@lemburg.com> Message-ID: <000e01bf2ccd$4f4b0e60$fd2d153f@tim> [MAL] > ... > The conversion goes as follows: > ? for single characters (and this includes all \XXX sequences > except \uXXXX), take the ordinal and interpret it as Unicode > ordinal for \uXXXX sequences, insert the Unicode character > with ordinal 0xXXXX instead Perfect! [about "raw" Unicode strings] > ... > Not sure whether we really need to make this even more complicated... > The \uXXXX strings look ugly, adding a few \\\\ for e.g. REs or > filenames won't hurt much in the context of those \uXXXX monsters :-) Alas, this won't stand over the long term. Eventually people will write Python using nothing but Unicode strings -- "regular strings" will eventurally become a backward compatibility headache <0.7 wink>. IOW, Unicode regexps and Unicode docstrings and Unicode formatting ops ... nothing will escape. Nor should it. I don't think it all needs to be done at once, though -- existing languages usually take years to graft in gimmicks to cover all the fine points. So, happy to let raw Unicode strings pass for now, as a relatively minor point, but without agreeing it can be ignored forever. > ... > BTW, if you want to type in UTF-8 strings and have them converted > to Unicode, you can use the standard: > > u = unicode('...string with UTF-8 encoded characters...','utf-8') That's what I figured, and thanks for the confirmation. From tim_one at email.msn.com Fri Nov 12 06:42:32 1999 From: tim_one at email.msn.com (Tim Peters) Date: Fri, 12 Nov 1999 00:42:32 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <382AD233.BE6DE888@lemburg.com> Message-ID: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim> [MAL] > If HP approves, I'd propose to use UTF-16 as if it were UCS-2 and > signal failure of this assertion at Unicode object construction time > via an exception. That way we are within the standard, can use > reasonably fast code for Unicode manipulation and add those extra 1M > character at a later stage. I think this is reasonable. Using UTF-8 internally is also reasonable, and if it's being rejected on the grounds of supposed slowness, that deserves a closer look (it's an ingenious encoding scheme that works correctly with a surprising number of existing 8-bit string routines as-is). Indexing UTF-8 strings is greatly speeded by adding a simple finger (i.e., store along with the string an index+offset pair identifying the most recent position indexed to -- since string indexing is overwhelmingly sequential, this makes most indexing constant-time; and UTF-8 can be scanned either forward or backward from a random internal point because "the first byte" of each encoding is recognizable as such). I expect either would work well. It's at least curious that Perl and Tcl both went with UTF-8 -- does anyone think they know *why*? I don't. The people here saying UCS-2 is the obviously better choice are all from the Microsoft camp <wink>. It's not obvious to me, but then neither do I claim that UTF-8 is obviously better. From tim_one at email.msn.com Fri Nov 12 07:02:01 1999 From: tim_one at email.msn.com (Tim Peters) Date: Fri, 12 Nov 1999 01:02:01 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <382AD479.5261B43B@lemburg.com> Message-ID: <001001bf2cd3$6fa57820$fd2d153f@tim> [MAL] > Note that if we make UTF-8 the standard encoding, nearly all > special Latin-1 characters will produce UTF-8 errors on input > and unreadable garbage on output. That will probably be unacceptable > in Europe. To remedy this, one would *always* have to use > u.encode('latin-1') to get readable output for Latin-1 strings > repesented in Unicode. I think it's time for the Europeans to pronounce on what's acceptable in Europe. To the limited extent that I can pretend I'm Eurpoean, I'm happy with Guido's rebind-stdin/stdout-in-PYTHONSTARTUP idea. > I'd rather see this happen the other way around: *always* explicitly > state the encoding you want in case you rely on it, e.g. write > > file.write(u.encode('utf-8')) > > instead of > > file.write(u) # let's hope this goes out as UTF-8... By the same argument, those pesky Europeans who are relying on Latin-1 should write file.write(u.encode('latin-1')) instead of file.write(u) # let's hope this goes out as Latin-1 > Using the <default encoding> as site dependent setting is useful > for convenience in those cases where the output format should be > readable rather than parseable. Well, "convenience" is always the argument advanced in favor of modes. Conflicts and nasty intermittent bugs are always the result. The latter will happen under Guido's idea too, as various careless modules rebind stdin & stdout to their own ideas of what "the proper" encoding should be. But at least the blame doesn't fall on the core language then <0.3 wink>. Since there doesn't appear to be anything (either or good or bad) you can do (or avoid) by using Guido's scheme instead of magical core thread state, there's no *need* for the latter. That is, it can be done with a user-level API without involving the core. From tim_one at email.msn.com Fri Nov 12 07:17:08 1999 From: tim_one at email.msn.com (Tim Peters) Date: Fri, 12 Nov 1999 01:17:08 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <007a01bf2caa$aabdef60$0501a8c0@bobcat> Message-ID: <001501bf2cd5$8c380140$fd2d153f@tim> [Mark Hammond] > ... > Are you writing a module for HP, or writing a module for Python that > HP are assisting by providing some funding? Clear difference. IMO, > it must also be seen that there is a clear difference. I can resolve this easily, but only with input from Guido. Guido, did HP's check clear yet? If so, we can ignore them <wink>. From captainrobbo at yahoo.com Fri Nov 12 09:15:19 1999 From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=) Date: Fri, 12 Nov 1999 00:15:19 -0800 (PST) Subject: [Python-Dev] Internationalization Toolkit Message-ID: <19991112081519.20636.rocketmail@web603.mail.yahoo.com> --- Gordon McMillan <gmcm at hypernet.com> wrote: > [per-thread defaults] > > C'mon guys, hasn't anyone ever played consultant > before? The > idea is obviously brain-dead. OTOH, they asked for > it > specifically, meaning they have some assumptions > about how > they think they're going to use it. If you give them > what they > ask for, you'll only have to fix it when they > realize there are > other ways of doing things that don't work with > per-thread > defaults. So, you find out why they think it's a > good thing; you > make it easy for them to code this way (without > actually using > per-thread defaults) and you don't make a fuss about > it. More > than likely, they won't either. > I wrote directly to ask them exactly this last night. Let's forget the per-thread thing until we get an answer. - Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From mal at lemburg.com Fri Nov 12 10:27:29 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Fri, 12 Nov 1999 10:27:29 +0100 Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit) References: <000e01bf2ccd$4f4b0e60$fd2d153f@tim> Message-ID: <382BDD81.458D3125@lemburg.com> Tim Peters wrote: > > [MAL] > > ... > > The conversion goes as follows: > > ? for single characters (and this includes all \XXX sequences > > except \uXXXX), take the ordinal and interpret it as Unicode > > ordinal for \uXXXX sequences, insert the Unicode character > > with ordinal 0xXXXX instead > > Perfect! Thanks :-) > [about "raw" Unicode strings] > > ... > > Not sure whether we really need to make this even more complicated... > > The \uXXXX strings look ugly, adding a few \\\\ for e.g. REs or > > filenames won't hurt much in the context of those \uXXXX monsters :-) > > Alas, this won't stand over the long term. Eventually people will write > Python using nothing but Unicode strings -- "regular strings" will > eventurally become a backward compatibility headache <0.7 wink>. IOW, > Unicode regexps and Unicode docstrings and Unicode formatting ops ... > nothing will escape. Nor should it. > > I don't think it all needs to be done at once, though -- existing languages > usually take years to graft in gimmicks to cover all the fine points. So, > happy to let raw Unicode strings pass for now, as a relatively minor point, > but without agreeing it can be ignored forever. Agreed... note that you could also write your own codec for just this reason and then use: u = unicode('....\u1234...\...\...','raw-unicode-escaped') Put that into a function called 'ur' and you have: u = ur('...\u4545...\...\...') which is not that far away from ur'...' w/r to cosmetics. > > ... > > BTW, if you want to type in UTF-8 strings and have them converted > > to Unicode, you can use the standard: > > > > u = unicode('...string with UTF-8 encoded characters...','utf-8') > > That's what I figured, and thanks for the confirmation. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Fri Nov 12 10:00:47 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Fri, 12 Nov 1999 10:00:47 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <19991112081519.20636.rocketmail@web603.mail.yahoo.com> Message-ID: <382BD73E.E6729C79@lemburg.com> Andy Robinson wrote: > > --- Gordon McMillan <gmcm at hypernet.com> wrote: > > [per-thread defaults] > > > > C'mon guys, hasn't anyone ever played consultant > > before? The > > idea is obviously brain-dead. OTOH, they asked for > > it > > specifically, meaning they have some assumptions > > about how > > they think they're going to use it. If you give them > > what they > > ask for, you'll only have to fix it when they > > realize there are > > other ways of doing things that don't work with > > per-thread > > defaults. So, you find out why they think it's a > > good thing; you > > make it easy for them to code this way (without > > actually using > > per-thread defaults) and you don't make a fuss about > > it. More > > than likely, they won't either. > > > > I wrote directly to ask them exactly this last night. > Let's forget the per-thread thing until we get an > answer. That's the way to go, Andy. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Fri Nov 12 10:44:14 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Fri, 12 Nov 1999 10:44:14 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <007a01bf2caa$aabdef60$0501a8c0@bobcat> Message-ID: <382BE16E.D17C80E1@lemburg.com> Mark Hammond wrote: > > > Mark Hammond wrote: > > > Having a fixed, default encoding may make life slightly > > more difficult > > > when you want to work primarily in a different encoding, > > but at least > > > your system is predictable and reliable. > > > > I think the discussion on this is getting a little too hot. > > Really - I see it as moving to a rational consensus that doesnt > support the proposal in this regard. I see no heat in it at all. Im > sorry if you saw my post or any of the followups as "emotional", but I > certainly not getting passionate about this. I dont see any of this > as affecting me personally. I believe that I can replace my Unicode > implementation with this either way we go. Just because a we are > trying to get it right doesnt mean we are getting heated. Naa... with "heated" I meant the "HP wants this, HP wants that" side of things. We'll just have to wait for their answer on this one. > > The point > > is simply that the option of changing the per-thread default > encoding > > is there. You are not required to use it and if you do you are on > > your own when something breaks. > > Hrm - Im having serious trouble following your logic here. If make > _any_ assumptions about a default encoding, I am in danger of > breaking. I may not choose to change the default, but as soon as > _anyone_ does, unrelated code may break. > > I agree that I will be "on my own", but I wont necessarily have been > the one that changed it :-( Sure there are some very subtile dangers in setting the default to anything other than the default ;-) For some this risk may be worthwhile taking, for others not. In fact, in large projects I would never take such a risk... I'm sure we can get this message across to them. > The only answer I can see is, as you suggest, to ignore the fact that > there is _any_ default. Always specify the encoding. But obviously > this is not good enough for HP: > > > Think of it as a HP specific feature... perhaps I should wrap the > code > > in #ifdefs and leave it undocumented. > > That would work - just ensure that no standard Python has those > #ifdefs turned on :-) I would be sorely dissapointed if the fact that > HP are throwing money for this means they get every whim implemented > in the core language. Imagine the outcry if it were instead MS' > money, and you were attempting to put an MS spin on all this. > > Are you writing a module for HP, or writing a module for Python that > HP are assisting by providing some funding? Clear difference. IMO, > it must also be seen that there is a clear difference. > > Maybe Im missing something. Can you explain why it is good enough > everyone else to be required to assume there is no default encoding, > but HP get their thread specific global? Are their requirements > greater than anyone elses? Is everyone else not as important? What > would you, as a consultant, recommend to people who arent HP, but have > a similar requirement? It would seem obvious to me that HPs > requirement can be met in "pure Python", thereby keeping this out of > the core all together... Again, all I can try is convince them of not really needing settable default encodings. <IMO> Since this is the first time a Python Consortium member is pushing development, I think we can learn a lot here. For one, it should be clear that money doesn't buy everything, OTOH, we cannot put the whole thing at risk just because of some minor disagreement that cannot be solved between the parties. The standard solution for the latter should be a customized Python interpreter. </IMO> -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Fri Nov 12 10:04:31 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Fri, 12 Nov 1999 10:04:31 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <001001bf2cd3$6fa57820$fd2d153f@tim> Message-ID: <382BD81F.B2BC896A@lemburg.com> Tim Peters wrote: > > [MAL] > > Note that if we make UTF-8 the standard encoding, nearly all > > special Latin-1 characters will produce UTF-8 errors on input > > and unreadable garbage on output. That will probably be unacceptable > > in Europe. To remedy this, one would *always* have to use > > u.encode('latin-1') to get readable output for Latin-1 strings > > repesented in Unicode. > > I think it's time for the Europeans to pronounce on what's acceptable in > Europe. To the limited extent that I can pretend I'm Eurpoean, I'm happy > with Guido's rebind-stdin/stdout-in-PYTHONSTARTUP idea. Agreed. > > I'd rather see this happen the other way around: *always* explicitly > > state the encoding you want in case you rely on it, e.g. write > > > > file.write(u.encode('utf-8')) > > > > instead of > > > > file.write(u) # let's hope this goes out as UTF-8... > > By the same argument, those pesky Europeans who are relying on Latin-1 > should write > > file.write(u.encode('latin-1')) > > instead of > > file.write(u) # let's hope this goes out as Latin-1 Right. > > Using the <default encoding> as site dependent setting is useful > > for convenience in those cases where the output format should be > > readable rather than parseable. > > Well, "convenience" is always the argument advanced in favor of modes. > Conflicts and nasty intermittent bugs are always the result. The latter > will happen under Guido's idea too, as various careless modules rebind stdin > & stdout to their own ideas of what "the proper" encoding should be. But at > least the blame doesn't fall on the core language then <0.3 wink>. > > Since there doesn't appear to be anything (either or good or bad) you can do > (or avoid) by using Guido's scheme instead of magical core thread state, > there's no *need* for the latter. That is, it can be done with a user-level > API without involving the core. Dito :-) I have nothing against telling people to take care about the problem in user space (meaning: not done by the core interpreter) and I'm pretty sure that HP will agree on this too, provided we give them the proper user space tools like file wrappers et al. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Fri Nov 12 10:16:57 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Fri, 12 Nov 1999 10:16:57 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim> Message-ID: <382BDB09.55583F28@lemburg.com> Tim Peters wrote: > > [MAL] > > If HP approves, I'd propose to use UTF-16 as if it were UCS-2 and > > signal failure of this assertion at Unicode object construction time > > via an exception. That way we are within the standard, can use > > reasonably fast code for Unicode manipulation and add those extra 1M > > character at a later stage. > > I think this is reasonable. > > Using UTF-8 internally is also reasonable, and if it's being rejected on the > grounds of supposed slowness, that deserves a closer look (it's an ingenious > encoding scheme that works correctly with a surprising number of existing > 8-bit string routines as-is). Indexing UTF-8 strings is greatly speeded by > adding a simple finger (i.e., store along with the string an index+offset > pair identifying the most recent position indexed to -- since string > indexing is overwhelmingly sequential, this makes most indexing > constant-time; and UTF-8 can be scanned either forward or backward from a > random internal point because "the first byte" of each encoding is > recognizable as such). Here are some arguments for using the proposed UTF-16 strategy instead: ? all characters have the same length; indexing is fast ? conversion APIs to platform dependent wchar_t implementation are fast because they either can simply copy the content or extend the 2-bytes to 4 byte ? UTF-8 needs 2 bytes for all the compound Latin-1 characters (e.g. u with two dots) which are used in many non-English languages ? from the Unicode Consortium FAQ: "Most Unicode APIs are using UTF-16." Besides, the Unicode object will have a buffer containing the <default encoding> representation of the object, which, if all goes well, will always hold the UTF-8 value. RE engines etc. can then directly work with this buffer. > I expect either would work well. It's at least curious that Perl and Tcl > both went with UTF-8 -- does anyone think they know *why*? I don't. The > people here saying UCS-2 is the obviously better choice are all from the > Microsoft camp <wink>. It's not obvious to me, but then neither do I claim > that UTF-8 is obviously better. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From gstein at lyra.org Fri Nov 12 11:20:16 1999 From: gstein at lyra.org (Greg Stein) Date: Fri, 12 Nov 1999 02:20:16 -0800 (PST) Subject: [Python-Dev] the Benevolent Dictator (was: Internationalization Toolkit) In-Reply-To: <382BE16E.D17C80E1@lemburg.com> Message-ID: <Pine.LNX.4.10.9911120214521.27203-100000@nebula.lyra.org> On Fri, 12 Nov 1999, M.-A. Lemburg wrote: > <IMO> > Since this is the first time a Python Consortium member is > pushing development, I think we can learn a lot here. For one, > it should be clear that money doesn't buy everything, OTOH, > we cannot put the whole thing at risk just because > of some minor disagreement that cannot be solved between the > parties. The standard solution for the latter should be a > customized Python interpreter. > </IMO> hehe... funny you mention this. Go read the Consortium docs. Last time that I read them, there are no "parties" to reach consensus. *Every* technical decision regarding the Python language falls to the Technical Director (Guido, of course). I looked. I found nothing that can override the T.D.'s decisions and no way to force a particular decision. Guido is still the Benevolent Dictator :-) Cheers, -g p.s. yes, there is always the caveat that "sure, Guido has final say" but "Al can fire him at will for being too stubborn" :-) ... but hey, Guido's title does have the word Benevolent in it, so things are cool... -- Greg Stein, http://www.lyra.org/ From gstein at lyra.org Fri Nov 12 11:24:56 1999 From: gstein at lyra.org (Greg Stein) Date: Fri, 12 Nov 1999 02:24:56 -0800 (PST) Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <382BE16E.D17C80E1@lemburg.com> Message-ID: <Pine.LNX.4.10.9911120221010.27203-100000@nebula.lyra.org> On Fri, 12 Nov 1999, M.-A. Lemburg wrote: > Sure there are some very subtile dangers in setting the default > to anything other than the default ;-) For some this risk may > be worthwhile taking, for others not. In fact, in large projects > I would never take such a risk... I'm sure we can get this > message across to them. It's a lot easier to just never provide the rope (per-thread default encodings) in the first place. If the feature exists, then it will be used. Period. Try to get the message across until you're blue in the face, but it would be used. Anyhow... discussion is pretty moot until somebody can state that it is/isn't a "real requirement" and/or until The Guido takes a position. Cheers, -g -- Greg Stein, http://www.lyra.org/ From gstein at lyra.org Fri Nov 12 11:30:04 1999 From: gstein at lyra.org (Greg Stein) Date: Fri, 12 Nov 1999 02:30:04 -0800 (PST) Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim> Message-ID: <Pine.LNX.4.10.9911120225080.27203-100000@nebula.lyra.org> On Fri, 12 Nov 1999, Tim Peters wrote: >... > Using UTF-8 internally is also reasonable, and if it's being rejected on the > grounds of supposed slowness No... my main point was interaction with the underlying OS. I made a SWAG (Scientific Wild Ass Guess :-) and stated that UTF-8 is probably slower for various types of operations. As always, your infernal meddling has dashed that hypothesis, so I must retreat... >... > I expect either would work well. It's at least curious that Perl and Tcl > both went with UTF-8 -- does anyone think they know *why*? I don't. The > people here saying UCS-2 is the obviously better choice are all from the > Microsoft camp <wink>. It's not obvious to me, but then neither do I claim > that UTF-8 is obviously better. Probably for the exact reason that you stated in your messages: many 8-bit (7-bit?) functions continue to work quite well when given a UTF-8-encoded string. i.e. they didn't have to rewrite the entire Perl/TCL interpreter to deal with a new string type. I'd guess it is a helluva lot easier for us to add a Python Type than for Perl or TCL to whack around with new string types (since they use strings so heavily). Cheers, -g -- Greg Stein, http://www.lyra.org/ From mal at lemburg.com Fri Nov 12 11:30:28 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Fri, 12 Nov 1999 11:30:28 +0100 Subject: [Python-Dev] the Benevolent Dictator (was: Internationalization Toolkit) References: <Pine.LNX.4.10.9911120214521.27203-100000@nebula.lyra.org> Message-ID: <382BEC44.A2541C7E@lemburg.com> Greg Stein wrote: > > On Fri, 12 Nov 1999, M.-A. Lemburg wrote: > > <IMO> > > Since this is the first time a Python Consortium member is > > pushing development, I think we can learn a lot here. For one, > > it should be clear that money doesn't buy everything, OTOH, > > we cannot put the whole thing at risk just because > > of some minor disagreement that cannot be solved between the > > parties. The standard solution for the latter should be a > > customized Python interpreter. > > </IMO> > > hehe... funny you mention this. Go read the Consortium docs. Last time > that I read them, there are no "parties" to reach consensus. *Every* > technical decision regarding the Python language falls to the Technical > Director (Guido, of course). I looked. I found nothing that can override > the T.D.'s decisions and no way to force a particular decision. > > Guido is still the Benevolent Dictator :-) Sure, but have you considered the option of a member simply bailing out ? HP could always stop funding Unicode integration. That wouldn't help us either... > Cheers, > -g > > p.s. yes, there is always the caveat that "sure, Guido has final say" but > "Al can fire him at will for being too stubborn" :-) ... but hey, Guido's > title does have the word Benevolent in it, so things are cool... -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From gstein at lyra.org Fri Nov 12 11:39:45 1999 From: gstein at lyra.org (Greg Stein) Date: Fri, 12 Nov 1999 02:39:45 -0800 (PST) Subject: [Python-Dev] the Benevolent Dictator (was: Internationalization Toolkit) In-Reply-To: <382BEC44.A2541C7E@lemburg.com> Message-ID: <Pine.LNX.4.10.9911120238230.27203-100000@nebula.lyra.org> On Fri, 12 Nov 1999, M.-A. Lemburg wrote: >... > Sure, but have you considered the option of a member simply bailing > out ? HP could always stop funding Unicode integration. That wouldn't > help us either... I'm not that dumb... come on. That was my whole point about "Benevolent" below... Guido is a fair and reasonable Dictator... he wouldn't let that happen. >... > > p.s. yes, there is always the caveat that "sure, Guido has final say" but > > "Al can fire him at will for being too stubborn" :-) ... but hey, Guido's > > title does have the word Benevolent in it, so things are cool... Cheers, -g -- Greg Stein, http://www.lyra.org/ From Mike.Da.Silva at uk.fid-intl.com Fri Nov 12 12:00:49 1999 From: Mike.Da.Silva at uk.fid-intl.com (Da Silva, Mike) Date: Fri, 12 Nov 1999 11:00:49 -0000 Subject: [Python-Dev] Internationalization Toolkit Message-ID: <DBF3B37F7BF1D111B2A10000F6B14B1FDDAF22@ukhil704nts.hld.uk.fid-intl.com> Most of the ASCII string functions do indeed work for UTF-8. I have made extensive use of this feature when writing translation logic to harmonize ASCII text (an SQL statement) with substitution parameters that must be converted from IBM EBCDIC code pages (5035, 1027) into UTF8. Since UTF-8 is a superset of ASCII, this all works fine. Some of the character classification functions etc can be flaky when used with UTF8 characters outside the ASCII range, but simple string operations work fine. As I see it, the relative pros and cons of UTF-8 versus UTF-16 for use as an internal string representation are: 1. UTF-8 allows all characters to be displayed (in some form or other) on the users machine, with or without native fonts installed. Naturally anything outside the ASCII range will be garbage, but it is an immense debugging aid when working with character encodings to be able to touch and feel something recognizable. Trying to decode a block of raw UTF-16 is a pain. 2. UTF-8 works with most existing string manipulation libraries quite happily. It is also portable (a char is always 8 bits, regardless of platform; wchar_t varies between 16 and 32 bits depending on the underlying operating system (although unsigned short does seems to work across platforms, in my experience). 3. UTF-16 has some advantages in providing fixed width characters and, (ignoring surrogate pairs etc) a modeless encoding space. This is an advantage for fast string operations, especially on CPU's that have efficient operations for handling 16bit data. 4. UTF-16 would directly support a tightly coupled character properties engine, which would enable Unicode compliant case folding and character decomposition to be performed without an intermediate UTF-8 <----> UTF-16 translation step. 5. UTF-16 requires string operations that do not make assumptions about nulls - this means re-implementing most of the C runtime functions to work with unsigned shorts. Regards, Mike da Silva -----Original Message----- From: Greg Stein [SMTP:gstein at lyra.org] Sent: 12 November 1999 10:30 To: Tim Peters Cc: python-dev at python.org Subject: RE: [Python-Dev] Internationalization Toolkit On Fri, 12 Nov 1999, Tim Peters wrote: >... > Using UTF-8 internally is also reasonable, and if it's being rejected on the > grounds of supposed slowness No... my main point was interaction with the underlying OS. I made a SWAG (Scientific Wild Ass Guess :-) and stated that UTF-8 is probably slower for various types of operations. As always, your infernal meddling has dashed that hypothesis, so I must retreat... >... > I expect either would work well. It's at least curious that Perl and Tcl > both went with UTF-8 -- does anyone think they know *why*? I don't. The > people here saying UCS-2 is the obviously better choice are all from the > Microsoft camp <wink>. It's not obvious to me, but then neither do I claim > that UTF-8 is obviously better. Probably for the exact reason that you stated in your messages: many 8-bit (7-bit?) functions continue to work quite well when given a UTF-8-encoded string. i.e. they didn't have to rewrite the entire Perl/TCL interpreter to deal with a new string type. I'd guess it is a helluva lot easier for us to add a Python Type than for Perl or TCL to whack around with new string types (since they use strings so heavily). Cheers, -g -- Greg Stein, http://www.lyra.org/ _______________________________________________ Python-Dev maillist - Python-Dev at python.org http://www.python.org/mailman/listinfo/python-dev From fredrik at pythonware.com Fri Nov 12 12:23:24 1999 From: fredrik at pythonware.com (Fredrik Lundh) Date: Fri, 12 Nov 1999 12:23:24 +0100 Subject: [Python-Dev] just say no... References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim> <382BDB09.55583F28@lemburg.com> Message-ID: <027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com> > Besides, the Unicode object will have a buffer containing the > <default encoding> representation of the object, which, if all goes > well, will always hold the UTF-8 value. <rant> over my dead body, that one... (fwiw, over the last 20 years, I've implemented about a dozen image processing libraries, supporting loads of pixel layouts and file formats. one important lesson from that is to stick to a single internal representation, and let the application programmers build their own layers if they need to speed things up -- yes, they're actually happier that way. and text strings are not that different from pixel buffers or sound streams or scientific data sets, after all...) (and sticks and modes will break your bones, but you know that...) > RE engines etc. can then directly work with this buffer. sidebar: the RE engine that's being developed for this project can handle 8-bit, 16-bit, and (optionally) 32-bit text buffers. a single compiled expression can be used with any character size, and performance is about the same for all sizes (at least on any decent cpu). > > I expect either would work well. It's at least curious that Perl and Tcl > > both went with UTF-8 -- does anyone think they know *why*? I don't. The > > people here saying UCS-2 is the obviously better choice are all from the > > Microsoft camp <wink>. (hey, I'm not a microsofter. but I've been writing "i/o libraries" for various "object types" all my life, so I do have strong preferences on what works, and what doesn't... I use Python for good reasons, you know ;-) </rant> thanks. I feel better now. </F> From fredrik at pythonware.com Fri Nov 12 12:23:38 1999 From: fredrik at pythonware.com (Fredrik Lundh) Date: Fri, 12 Nov 1999 12:23:38 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <DBF3B37F7BF1D111B2A10000F6B14B1FDDAF22@ukhil704nts.hld.uk.fid-intl.com> Message-ID: <027f01bf2d00$648745e0$f29b12c2@secret.pythonware.com> > 5. UTF-16 requires string operations that do not make assumptions about > nulls - this means re-implementing most of the C runtime functions to work > with unsigned shorts. footnote: the mad scientist has been there and done that: http://www.pythonware.com/madscientist/ (and you can replace "unsigned short" with "whatever's suitable on this platform") </F> From fredrik at pythonware.com Fri Nov 12 12:36:03 1999 From: fredrik at pythonware.com (Fredrik Lundh) Date: Fri, 12 Nov 1999 12:36:03 +0100 Subject: [Python-Dev] the Benevolent Dictator (was: Internationalization Toolkit) References: <Pine.LNX.4.10.9911120238230.27203-100000@nebula.lyra.org> Message-ID: <02a701bf2d02$20c66280$f29b12c2@secret.pythonware.com> > Guido is a fair and reasonable Dictator... he wouldn't let that > happen. ...but where is he when we need him? ;-) </F> From Mike.Da.Silva at uk.fid-intl.com Fri Nov 12 12:43:21 1999 From: Mike.Da.Silva at uk.fid-intl.com (Da Silva, Mike) Date: Fri, 12 Nov 1999 11:43:21 -0000 Subject: [Python-Dev] Internationalization Toolkit Message-ID: <DBF3B37F7BF1D111B2A10000F6B14B1FDDAF24@ukhil704nts.hld.uk.fid-intl.com> Fredrik Lundh wrote: > 5. UTF-16 requires string operations that do not make assumptions about > nulls - this means re-implementing most of the C runtime functions to work > with unsigned shorts. footnote: the mad scientist has been there and done that: http://www.pythonware.com/madscientist/ <http://www.pythonware.com/madscientist/> (and you can replace "unsigned short" with "whatever's suitable on this platform") Surely using a different type on different platforms means that we throw away the concept of a platform independent Unicode string? I.e. on Solaris, wchar_t is 32 bits, on Windows it is 16 bits. Does this mean that to transfer a file between a Windows box and Solaris, an implicit conversion has to be done to go from 16 bits to 32 bits (and vice versa)? What about byte ordering issues? Or do you mean whatever 16 bit data type is available on the platform, with a standard (platform independent) byte ordering maintained? Mike da S From fredrik at pythonware.com Fri Nov 12 13:16:24 1999 From: fredrik at pythonware.com (Fredrik Lundh) Date: Fri, 12 Nov 1999 13:16:24 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <DBF3B37F7BF1D111B2A10000F6B14B1FDDAF24@ukhil704nts.hld.uk.fid-intl.com> Message-ID: <02f901bf2d07$bdf5d950$f29b12c2@secret.pythonware.com> Mike wrote: > Surely using a different type on different platforms means that we throw > away the concept of a platform independent Unicode string? > I.e. on Solaris, wchar_t is 32 bits, on Windows it is 16 bits. so? the interchange format doesn't have to be the same as the internal format, does it? > Does this mean that to transfer a file between a Windows box and Solaris, an > implicit conversion has to be done to go from 16 bits to 32 bits (and vice > versa)? What about byte ordering issues? no problem at all: unicode has special byte order marks for this purpose (and utf-8 doesn't care, of course). > Or do you mean whatever 16 bit data type is available on the platform, with > a standard (platform independent) byte ordering maintained? well, my preference is a 16-bit data type in the plat- form's native byte order (exactly how it's done in the unicode module -- for the moment, it can use the platform's wchar_t, but only if it happens to be a 16-bit unsigned type). gives you good performance, compact storage, and cleanest possible code. ... anyway, I think it would help the discussion a little bit if people looked at (and played with) the existing code base. at least that'll change arguments like "but then we have to implement that" to "but then we have to maintain that code" ;-) </F> From captainrobbo at yahoo.com Fri Nov 12 13:13:03 1999 From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=) Date: Fri, 12 Nov 1999 04:13:03 -0800 (PST) Subject: [Python-Dev] Internationalization Toolkit Message-ID: <19991112121303.27452.rocketmail@ web605.yahoomail.com> --- "Da Silva, Mike" <Mike.Da.Silva at uk.fid-intl.com> wrote: > As I see it, the relative pros and cons of UTF-8 > versus UTF-16 for use as an > internal string representation are: > [snip] > Regards, > Mike da Silva > Note that by going with UTF16, we get both. We will certainly have a codec for utf8, just as we will for ISO-Latin-1, Shift-JIS or whatever. And a perfectly ordinary Python string is a great place to hold UTF8; you can look at it and use most of the ordinary string algorithms on it. I presume no one is actually advocating dropping ordinary Python strings, or the ability to do rawdata = open('myfile.txt', 'rb').read() without any transformations? - Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From mhammond at skippinet.com.au Fri Nov 12 13:27:19 1999 From: mhammond at skippinet.com.au (Mark Hammond) Date: Fri, 12 Nov 1999 23:27:19 +1100 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <02f901bf2d07$bdf5d950$f29b12c2@secret.pythonware.com> Message-ID: <007e01bf2d09$44738440$0501a8c0@bobcat> /F writes > anyway, I think it would help the discussion a little bit > if people looked at (and played with) the existing code > base. at least that'll change arguments like "but then > we have to implement that" to "but then we have to > maintain that code" ;-) I second that. It is good enough for me (although my requirements arent stringent) - its been used on CE, so would slot directly into the win32 stuff. It is pretty much the consensus of the string-sig of last year, but as code! The only "problem" with it is the code that hasnt been written yet, specifically: * Encoders as streams, and a concrete proposal for them. * Decent PyArg_ParseTuple support and Py_BuildValue support. * The ord(), chr() stuff, and other stuff around the edges no doubt. Couldnt we start with Fredriks implementation, and see how the rest turns out? Even if we do choose to change the underlying Unicode implementation to use a different native encoding, the interface to the PyUnicode_Type would remain pretty similar. The advantage is that we have something now to start working with for the rest of the support we need. Mark. From mal at lemburg.com Fri Nov 12 13:38:44 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Fri, 12 Nov 1999 13:38:44 +0100 Subject: [Python-Dev] Unicode Proposal: Version 0.4 Message-ID: <382C0A54.E6E8328D@lemburg.com> I've uploaded a new version of the proposal which incorporates a lot of what has been discussed on the list. Thanks to everybody who helped so far. Note that I have extended the list of references for those who want to join in, but are in need of more background information. The latest version of the proposal is available at: http://starship.skyport.net/~lemburg/unicode-proposal.txt Older versions are available as: http://starship.skyport.net/~lemburg/unicode-proposal-X.X.txt Some POD (points of discussion) that are still open: ? support for line breaks (see http://www.unicode.org/unicode/reports/tr13/ ) ? support for case conversion: Problems: string lengths can change due to multiple characters being mapped to a single new one, capital letters starting a word can be different than ones occurring in the middle, there are locale dependent deviations from the standard mappings. ? support for numbers, digits, whitespace, etc. ? support (or no support) for private code point areas ? should Unicode objects support %-formatting ? One possibility would be to emulate this via strings and <default encoding>: s = '%s %i abc???' # a Latin-1 encoded string t = (u,3) # Convert Latin-1 s to a <default encoding> string s1 = unicode(s,'latin-1').encode() # The '%s' will now add u in <default encoding> s2 = s1 % t # Finally, convert the <default encoding> encoded string to Unicode u1 = unicode(s2) ? specifying file wrappers: Open issues: what to do with Python strings fed to the .write() method (may need to know the encoding of the strings) and when/if to return Python strings through the .read() method. Perhaps we need more than one type of wrapper here. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Fri Nov 12 14:11:26 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Fri, 12 Nov 1999 14:11:26 +0100 Subject: [Python-Dev] just say no... References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim> <382BDB09.55583F28@lemburg.com> <027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com> Message-ID: <382C11FE.D7D9F916@lemburg.com> Fredrik Lundh wrote: > > > Besides, the Unicode object will have a buffer containing the > > <default encoding> representation of the object, which, if all goes > > well, will always hold the UTF-8 value. > > <rant> > > over my dead body, that one... Such a buffer is needed to implement "s" and "s#" argument parsing. It's a simple requirement to support those two parsing markers -- there's not much to argue about, really... unless, of course, you want to give up Unicode object support for all APIs using these parsers. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Fri Nov 12 14:01:28 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Fri, 12 Nov 1999 14:01:28 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <DBF3B37F7BF1D111B2A10000F6B14B1FDDAF24@ukhil704nts.hld.uk.fid-intl.com> <02f901bf2d07$bdf5d950$f29b12c2@secret.pythonware.com> Message-ID: <382C0FA8.ACB6CCD6@lemburg.com> Fredrik Lundh wrote: > > Mike wrote: > > Surely using a different type on different platforms means that we throw > > away the concept of a platform independent Unicode string? > > I.e. on Solaris, wchar_t is 32 bits, on Windows it is 16 bits. > > so? the interchange format doesn't have to be > the same as the internal format, does it? The interchange format (marshal + pickle) is defined as UTF-8, so there's no problem with endianness or missing bits w/r to shipping Unicode data from one platform to another. > > Does this mean that to transfer a file between a Windows box and Solaris, an > > implicit conversion has to be done to go from 16 bits to 32 bits (and vice > > versa)? What about byte ordering issues? > > no problem at all: unicode has special byte order > marks for this purpose (and utf-8 doesn't care, of > course). Access to this mark will go into sys: sys.bom. > > Or do you mean whatever 16 bit data type is available on the platform, with > > a standard (platform independent) byte ordering maintained? > > well, my preference is a 16-bit data type in the plat- > form's native byte order (exactly how it's done in the > unicode module -- for the moment, it can use the > platform's wchar_t, but only if it happens to be a > 16-bit unsigned type). gives you good performance, > compact storage, and cleanest possible code. The 0.4 proposal fixes this to 16-bit unsigned short using UTF-16 encoding with checks for surrogates. This covers all defined standard Unicode character points, is fast, etc. pp... -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Fri Nov 12 12:15:15 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Fri, 12 Nov 1999 12:15:15 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <DBF3B37F7BF1D111B2A10000F6B14B1FDDAF22@ukhil704nts.hld.uk.fid-intl.com> Message-ID: <382BF6C3.D79840EC@lemburg.com> "Da Silva, Mike" wrote: > > Most of the ASCII string functions do indeed work for UTF-8. I have made > extensive use of this feature when writing translation logic to harmonize > ASCII text (an SQL statement) with substitution parameters that must be > converted from IBM EBCDIC code pages (5035, 1027) into UTF8. Since UTF-8 is > a superset of ASCII, this all works fine. > > Some of the character classification functions etc can be flaky when used > with UTF8 characters outside the ASCII range, but simple string operations > work fine. That's why there's the <defencbuf> buffer which holds the UTF-8 encoded value... > As I see it, the relative pros and cons of UTF-8 versus UTF-16 for use as an > internal string representation are: > > 1. UTF-8 allows all characters to be displayed (in some form or other) > on the users machine, with or without native fonts installed. Naturally > anything outside the ASCII range will be garbage, but it is an immense > debugging aid when working with character encodings to be able to touch and > feel something recognizable. Trying to decode a block of raw UTF-16 is a > pain. True. > 2. UTF-8 works with most existing string manipulation libraries quite > happily. It is also portable (a char is always 8 bits, regardless of > platform; wchar_t varies between 16 and 32 bits depending on the underlying > operating system (although unsigned short does seems to work across > platforms, in my experience). You mean with the compiler applying the needed 16->32 bit extension ? > 3. UTF-16 has some advantages in providing fixed width characters and, > (ignoring surrogate pairs etc) a modeless encoding space. This is an > advantage for fast string operations, especially on CPU's that have > efficient operations for handling 16bit data. Right and this is major argument for using 16 bit encodings without state internally. > 4. UTF-16 would directly support a tightly coupled character properties > engine, which would enable Unicode compliant case folding and character > decomposition to be performed without an intermediate UTF-8 <----> UTF-16 > translation step. Could you elaborate on this one ? It is one of the open issues in the proposal. > 5. UTF-16 requires string operations that do not make assumptions about > nulls - this means re-implementing most of the C runtime functions to work > with unsigned shorts. AFAIK, the RE engines in Python are 8-bit clean... BTW, wouldn't it be possible to take pcre and have it use Py_Unicode instead of char ? [Of course, there would have to be some extensions for character classes etc.] -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From fredrik at pythonware.com Fri Nov 12 14:43:12 1999 From: fredrik at pythonware.com (Fredrik Lundh) Date: Fri, 12 Nov 1999 14:43:12 +0100 Subject: [Python-Dev] just say no... References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim> <382BDB09.55583F28@lemburg.com> <027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com> <382C11FE.D7D9F916@lemburg.com> Message-ID: <005201bf2d13$ddd75ad0$f29b12c2@secret.pythonware.com> > > > Besides, the Unicode object will have a buffer containing the > > > <default encoding> representation of the object, which, if all goes > > > well, will always hold the UTF-8 value. > > > > <rant> > > > > over my dead body, that one... > > Such a buffer is needed to implement "s" and "s#" argument > parsing. It's a simple requirement to support those two > parsing markers -- there's not much to argue about, really... why? I don't understand why "s" and "s#" has to deal with encoding issues at all... > unless, of course, you want to give up Unicode object support > for all APIs using these parsers. hmm. maybe that's exactly what I want... </F> From fdrake at acm.org Fri Nov 12 15:34:56 1999 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Fri, 12 Nov 1999 09:34:56 -0500 (EST) Subject: [Python-Dev] just say no... In-Reply-To: <382C11FE.D7D9F916@lemburg.com> References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim> <382BDB09.55583F28@lemburg.com> <027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com> <382C11FE.D7D9F916@lemburg.com> Message-ID: <14380.9616.245419.138261@weyr.cnri.reston.va.us> M.-A. Lemburg writes: > Such a buffer is needed to implement "s" and "s#" argument > parsing. It's a simple requirement to support those two > parsing markers -- there's not much to argue about, really... > unless, of course, you want to give up Unicode object support > for all APIs using these parsers. Perhaps I missed the agreement that these should always receive UTF-8 from Unicode strings. Was this agreed upon, or has it simply not been argued over in favor of other topics? If this has indeed been agreed upon... at least it can be computed on demand rather than at initialization! Perhaps there should be two pointers: one to the UTF-8 buffer and one to a PyObject; if the PyObject is there it's a "old-style" string that's actually providing the buffer. This may or may not be a good idea; there's a lot of memory expense for long Unicode strings converted from UTF-8 that aren't ever converted back to UTF-8 or accessed using "s" or "s#". Ok, I've talked myself out of that. ;-) -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> Corporation for National Research Initiatives From fdrake at acm.org Fri Nov 12 15:57:15 1999 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Fri, 12 Nov 1999 09:57:15 -0500 (EST) Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <382C0FA8.ACB6CCD6@lemburg.com> References: <DBF3B37F7BF1D111B2A10000F6B14B1FDDAF24@ukhil704nts.hld.uk.fid-intl.com> <02f901bf2d07$bdf5d950$f29b12c2@secret.pythonware.com> <382C0FA8.ACB6CCD6@lemburg.com> Message-ID: <14380.10955.420102.327867@weyr.cnri.reston.va.us> M.-A. Lemburg writes: > Access to this mark will go into sys: sys.bom. Can the name in sys be a little more descriptive? sys.byte_order_mark would be reasonable. I think that a support module (possibly unicodec) should provide constants for all four byte order marks as strings (2- & 4-byte, little- and big-endian). Names could be short BOM_2_LE, BOM_4_LE, etc. -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> Corporation for National Research Initiatives From fredrik at pythonware.com Fri Nov 12 16:00:45 1999 From: fredrik at pythonware.com (Fredrik Lundh) Date: Fri, 12 Nov 1999 16:00:45 +0100 Subject: [Python-Dev] just say no... References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim><382BDB09.55583F28@lemburg.com><027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com><382C11FE.D7D9F916@lemburg.com> <14380.9616.245419.138261@weyr.cnri.reston.va.us> Message-ID: <009101bf2d1f$21f5b490$f29b12c2@secret.pythonware.com> Fred L. Drake, Jr. <fdrake at acm.org> wrote: > M.-A. Lemburg writes: > > Such a buffer is needed to implement "s" and "s#" argument > > parsing. It's a simple requirement to support those two > > parsing markers -- there's not much to argue about, really... > > unless, of course, you want to give up Unicode object support > > for all APIs using these parsers. > > Perhaps I missed the agreement that these should always receive > UTF-8 from Unicode strings. from unicode import * def getname(): # hidden in some database engine, or so... return unicode("Link?ping", "iso-8859-1") ... name = getname() # emulate automatic conversion to utf-8 name = str(name) # print it in uppercase, in the usual way import string print string.upper(name) ## LINK??PING I don't know, but I think that I think that it perhaps should raise an exception instead... </F> From mal at lemburg.com Fri Nov 12 16:17:43 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Fri, 12 Nov 1999 16:17:43 +0100 Subject: [Python-Dev] just say no... References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim> <382BDB09.55583F28@lemburg.com> <027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com> <382C11FE.D7D9F916@lemburg.com> <005201bf2d13$ddd75ad0$f29b12c2@secret.pythonware.com> Message-ID: <382C2F97.8E7D7A4D@lemburg.com> Fredrik Lundh wrote: > > > > > Besides, the Unicode object will have a buffer containing the > > > > <default encoding> representation of the object, which, if all goes > > > > well, will always hold the UTF-8 value. > > > > > > <rant> > > > > > > over my dead body, that one... > > > > Such a buffer is needed to implement "s" and "s#" argument > > parsing. It's a simple requirement to support those two > > parsing markers -- there's not much to argue about, really... > > why? I don't understand why "s" and "s#" has > to deal with encoding issues at all... > > > unless, of course, you want to give up Unicode object support > > for all APIs using these parsers. > > hmm. maybe that's exactly what I want... If we don't add that support, lot's of existing APIs won't accept Unicode object instead of strings. While it could be argued that automatic conversion to UTF-8 is not transparent enough for the user, the other solution of using str(u) everywhere would probably make writing Unicode-aware code a rather clumsy task and introduce other pitfalls, since str(obj) calls PyObject_Str() which also works on integers, floats, etc. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Fri Nov 12 16:50:33 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Fri, 12 Nov 1999 16:50:33 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <DBF3B37F7BF1D111B2A10000F6B14B1FDDAF24@ukhil704nts.hld.uk.fid-intl.com> <02f901bf2d07$bdf5d950$f29b12c2@secret.pythonware.com> <382C0FA8.ACB6CCD6@lemburg.com> <14380.10955.420102.327867@weyr.cnri.reston.va.us> Message-ID: <382C3749.198EEBC6@lemburg.com> "Fred L. Drake, Jr." wrote: > > M.-A. Lemburg writes: > > Access to this mark will go into sys: sys.bom. > > Can the name in sys be a little more descriptive? > sys.byte_order_mark would be reasonable. The abbreviation BOM is quite common w/r to Unicode. > I think that a support module (possibly unicodec) should provide > constants for all four byte order marks as strings (2- & 4-byte, > little- and big-endian). Names could be short BOM_2_LE, BOM_4_LE, > etc. Good idea... sys.bom should return the byte order mark (BOM) for the format used internally. The unicodec module should provide symbols for all possible values of this variable: BOM_BE: '\376\377' (corresponds to Unicode 0x0000FEFF in UTF-16 == ZERO WIDTH NO-BREAK SPACE) BOM_LE: '\377\376' (corresponds to Unicode 0x0000FFFE in UTF-16 == illegal Unicode character) BOM4_BE: '\000\000\377\376' (corresponds to Unicode 0x0000FEFF in UCS-4) BOM4_LE: '\376\377\000\000' (corresponds to Unicode 0x0000FFFE in UCS-4) Note that Unicode sees big endian byte order as being "correct". The swapped order is taken to be an indicator for a "wrong" format, hence the illegal character definition. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Fri Nov 12 16:24:33 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Fri, 12 Nov 1999 16:24:33 +0100 Subject: [Python-Dev] just say no... References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim> <382BDB09.55583F28@lemburg.com> <027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com> <382C11FE.D7D9F916@lemburg.com> <14380.9616.245419.138261@weyr.cnri.reston.va.us> Message-ID: <382C3131.A8965CA5@lemburg.com> "Fred L. Drake, Jr." wrote: > > M.-A. Lemburg writes: > > Such a buffer is needed to implement "s" and "s#" argument > > parsing. It's a simple requirement to support those two > > parsing markers -- there's not much to argue about, really... > > unless, of course, you want to give up Unicode object support > > for all APIs using these parsers. > > Perhaps I missed the agreement that these should always receive > UTF-8 from Unicode strings. Was this agreed upon, or has it simply > not been argued over in favor of other topics? It's been in the proposal since version 0.1. The idea is to provide a decent way of making existing script Unicode aware. > If this has indeed been agreed upon... at least it can be computed > on demand rather than at initialization! This is what I intended to implement. The <defencbuf> buffer will be filled upon the first request to the UTF-8 encoding. "s" and "s#" are examples of such requests. The buffer will remain intact until the object is destroyed (since other code could store the pointer received via e.g. "s"). > Perhaps there should be two > pointers: one to the UTF-8 buffer and one to a PyObject; if the > PyObject is there it's a "old-style" string that's actually providing > the buffer. This may or may not be a good idea; there's a lot of > memory expense for long Unicode strings converted from UTF-8 that > aren't ever converted back to UTF-8 or accessed using "s" or "s#". > Ok, I've talked myself out of that. ;-) Note that Unicode object are completely different beast ;-) String object are not touched in any way by the proposal. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From fdrake at acm.org Fri Nov 12 17:22:24 1999 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Fri, 12 Nov 1999 11:22:24 -0500 (EST) Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <382C3749.198EEBC6@lemburg.com> References: <DBF3B37F7BF1D111B2A10000F6B14B1FDDAF24@ukhil704nts.hld.uk.fid-intl.com> <02f901bf2d07$bdf5d950$f29b12c2@secret.pythonware.com> <382C0FA8.ACB6CCD6@lemburg.com> <14380.10955.420102.327867@weyr.cnri.reston.va.us> <382C3749.198EEBC6@lemburg.com> Message-ID: <14380.16064.723277.586881@weyr.cnri.reston.va.us> M.-A. Lemburg writes: > The abbreviation BOM is quite common w/r to Unicode. Yes: "w/r to Unicode". In sys, it's out of context and should receive a more descriptive name. I think using BOM in unicodec is good. > BOM_BE: '\376\377' > (corresponds to Unicode 0x0000FEFF in UTF-16 > == ZERO WIDTH NO-BREAK SPACE) I'd also add BOM to be the same as sys.byte_order_mark. Perhaps even instead of sys.byte_order_mark (just to localize the areas of code that are affected). > Note that Unicode sees big endian byte order as being "correct". The A lot of us do. ;-) -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> Corporation for National Research Initiatives From fdrake at acm.org Fri Nov 12 17:28:37 1999 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Fri, 12 Nov 1999 11:28:37 -0500 (EST) Subject: [Python-Dev] just say no... In-Reply-To: <382C3131.A8965CA5@lemburg.com> References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim> <382BDB09.55583F28@lemburg.com> <027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com> <382C11FE.D7D9F916@lemburg.com> <14380.9616.245419.138261@weyr.cnri.reston.va.us> <382C3131.A8965CA5@lemburg.com> Message-ID: <14380.16437.71847.832880@weyr.cnri.reston.va.us> M.-A. Lemburg writes: > It's been in the proposal since version 0.1. The idea is to > provide a decent way of making existing script Unicode aware. Ok, so I haven't read closely enough. > This is what I intended to implement. The <defencbuf> buffer > will be filled upon the first request to the UTF-8 encoding. > "s" and "s#" are examples of such requests. The buffer will > remain intact until the object is destroyed (since other code > could store the pointer received via e.g. "s"). Right. > Note that Unicode object are completely different beast ;-) > String object are not touched in any way by the proposal. I wasn't suggesting the PyStringObject be changed, only that the PyUnicodeObject could maintain a reference. Consider: s = fp.read() u = unicode(s, 'utf-8') u would now hold a reference to s, and s/s# would return a pointer into s instead of re-building the UTF-8 form. I talked myself out of this because it would be too easy to keep a lot more string objects around than were actually needed. -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> Corporation for National Research Initiatives From jack at oratrix.nl Fri Nov 12 17:33:46 1999 From: jack at oratrix.nl (Jack Jansen) Date: Fri, 12 Nov 1999 17:33:46 +0100 Subject: [Python-Dev] just say no... In-Reply-To: Message by "M.-A. Lemburg" <mal@lemburg.com> , Fri, 12 Nov 1999 16:24:33 +0100 , <382C3131.A8965CA5@lemburg.com> Message-ID: <19991112163347.5527635BB1E@snelboot.oratrix.nl> The problem with "s" and "s#" is that they're already semantically overloaded, and will become more so with support for multiple charsets. Some modules use "s#" when they mean "give me a pointer to an area of memory and its length". Writing to binary files is an example of this. Some modules use it to mean "give me a pointer to a string". Writing to a text file is (probably) an example of this. Some modules use it to mean "give me a pointer to an 8-bit ASCII string". This is the case if we're going to actually look at the contents (think of string.upper() and such). I think that the only real solution is to define what "s" means, come up with new getarg-formats for the other two use cases and convert all modules to use the new standard. It'll still cause grief to extension modules that aren't part of the core, but at least the problem will go away after a while. -- Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++ Jack.Jansen at oratrix.com | ++++ if you agree copy these lines to your sig ++++ www.oratrix.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm From mal at lemburg.com Fri Nov 12 19:36:55 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Fri, 12 Nov 1999 19:36:55 +0100 Subject: [Python-Dev] just say no... References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim> <382BDB09.55583F28@lemburg.com> <027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com> <382C11FE.D7D9F916@lemburg.com> <14380.9616.245419.138261@weyr.cnri.reston.va.us> <382C3131.A8965CA5@lemburg.com> <14380.16437.71847.832880@weyr.cnri.reston.va.us> Message-ID: <382C5E47.21FB4DD@lemburg.com> "Fred L. Drake, Jr." wrote: > > M.-A. Lemburg writes: > > It's been in the proposal since version 0.1. The idea is to > > provide a decent way of making existing script Unicode aware. > > Ok, so I haven't read closely enough. > > > This is what I intended to implement. The <defencbuf> buffer > > will be filled upon the first request to the UTF-8 encoding. > > "s" and "s#" are examples of such requests. The buffer will > > remain intact until the object is destroyed (since other code > > could store the pointer received via e.g. "s"). > > Right. > > > Note that Unicode object are completely different beast ;-) > > String object are not touched in any way by the proposal. > > I wasn't suggesting the PyStringObject be changed, only that the > PyUnicodeObject could maintain a reference. Consider: > > s = fp.read() > u = unicode(s, 'utf-8') > > u would now hold a reference to s, and s/s# would return a pointer > into s instead of re-building the UTF-8 form. I talked myself out of > this because it would be too easy to keep a lot more string objects > around than were actually needed. Agreed. Also, the encoding would always be correct. <defencbuf> will always hold the <default encoding> version (which should be UTF-8...). -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From gstein at lyra.org Fri Nov 12 23:19:15 1999 From: gstein at lyra.org (Greg Stein) Date: Fri, 12 Nov 1999 14:19:15 -0800 (PST) Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <007e01bf2d09$44738440$0501a8c0@bobcat> Message-ID: <Pine.LNX.4.10.9911121417530.27203-100000@nebula.lyra.org> On Fri, 12 Nov 1999, Mark Hammond wrote: > Couldnt we start with Fredriks implementation, and see how the rest > turns out? Even if we do choose to change the underlying Unicode > implementation to use a different native encoding, the interface to > the PyUnicode_Type would remain pretty similar. The advantage is that > we have something now to start working with for the rest of the > support we need. I agree with "start with" here, and will go one step further (which Mark may have implied) -- *check in* Fredrik's code. Cheers, -g -- Greg Stein, http://www.lyra.org/ From gstein at lyra.org Fri Nov 12 23:59:03 1999 From: gstein at lyra.org (Greg Stein) Date: Fri, 12 Nov 1999 14:59:03 -0800 (PST) Subject: [Python-Dev] just say no... In-Reply-To: <382C11FE.D7D9F916@lemburg.com> Message-ID: <Pine.LNX.4.10.9911121456370.2535-100000@nebula.lyra.org> On Fri, 12 Nov 1999, M.-A. Lemburg wrote: > Fredrik Lundh wrote: > > > Besides, the Unicode object will have a buffer containing the > > > <default encoding> representation of the object, which, if all goes > > > well, will always hold the UTF-8 value. > > > > <rant> > > > > over my dead body, that one... > > Such a buffer is needed to implement "s" and "s#" argument > parsing. It's a simple requirement to support those two > parsing markers -- there's not much to argue about, really... > unless, of course, you want to give up Unicode object support > for all APIs using these parsers. Bull! You can easily support "s#" support by returning the pointer to the Unicode buffer. The *entire* reason for introducing "t#" is to differentiate between returning a pointer to an 8-bit [character] buffer and a not-8-bit buffer. In other words, the work done to introduce "t#" was done *SPECIFICALLY* to allow "s#" to return a pointer to the Unicode data. I am with Fredrik on that auxilliary buffer. You'll have two dead bodies to deal with :-) Cheers, -g -- Greg Stein, http://www.lyra.org/ From gstein at lyra.org Sat Nov 13 00:05:11 1999 From: gstein at lyra.org (Greg Stein) Date: Fri, 12 Nov 1999 15:05:11 -0800 (PST) Subject: [Python-Dev] just say no... In-Reply-To: <19991112163347.5527635BB1E@snelboot.oratrix.nl> Message-ID: <Pine.LNX.4.10.9911121501460.2535-100000@nebula.lyra.org> This was done last year!! We have "s#" meaning "give me some bytes." We have "t#" meaning "give me some 8-bit characters." The Python distribution has been completely updated to use the appropriate format in each call. The was done *specifically* to support the introduction of a Unicode type. The intent was that "s#" returns the *raw* bytes of the Unicode string -- NOT a UTF-8 encoding! As a separate argument, MAL can argue that "t#" should create an internal, associated buffer to hold a UTF-8 encoding and then return that. But the "s#" should return the raw bytes! [ and I'll argue against the response to "t#" anyhow... ] -g On Fri, 12 Nov 1999, Jack Jansen wrote: > The problem with "s" and "s#" is that they're already semantically > overloaded, and will become more so with support for multiple charsets. > > Some modules use "s#" when they mean "give me a pointer to an area of memory > and its length". Writing to binary files is an example of this. > > Some modules use it to mean "give me a pointer to a string". Writing to a text > file is (probably) an example of this. > > Some modules use it to mean "give me a pointer to an 8-bit ASCII string". This > is the case if we're going to actually look at the contents (think of > string.upper() and such). > > I think that the only real solution is to define what "s" means, come up with > new getarg-formats for the other two use cases and convert all modules to use > the new standard. It'll still cause grief to extension modules that aren't > part of the core, but at least the problem will go away after a while. > -- > Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++ > Jack.Jansen at oratrix.com | ++++ if you agree copy these lines to your sig ++++ > www.oratrix.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm > > > > _______________________________________________ > Python-Dev maillist - Python-Dev at python.org > http://www.python.org/mailman/listinfo/python-dev > -- Greg Stein, http://www.lyra.org/ From gstein at lyra.org Sat Nov 13 00:09:13 1999 From: gstein at lyra.org (Greg Stein) Date: Fri, 12 Nov 1999 15:09:13 -0800 (PST) Subject: [Python-Dev] just say no... In-Reply-To: <382C2F97.8E7D7A4D@lemburg.com> Message-ID: <Pine.LNX.4.10.9911121505210.2535-100000@nebula.lyra.org> On Fri, 12 Nov 1999, M.-A. Lemburg wrote: > Fredrik Lundh wrote: >... > > why? I don't understand why "s" and "s#" has > > to deal with encoding issues at all... > > > > > unless, of course, you want to give up Unicode object support > > > for all APIs using these parsers. > > > > hmm. maybe that's exactly what I want... > > If we don't add that support, lot's of existing APIs won't > accept Unicode object instead of strings. While it could be > argued that automatic conversion to UTF-8 is not transparent > enough for the user, the other solution of using str(u) > everywhere would probably make writing Unicode-aware code a > rather clumsy task and introduce other pitfalls, since str(obj) > calls PyObject_Str() which also works on integers, floats, > etc. No no no... "s" and "s#" are NOT SUPPOSED TO return a UTF-8 encoding. They are supposed to return the raw bytes. If a caller wants 8-bit characters, then that caller will use "t#". If you want to argue for that separate, encoded buffer, then argue for it for support for the "t#" format. But do NOT say that it is needed for "s#" which simply means "give me some bytes." -g -- Greg Stein, http://www.lyra.org/ From gstein at lyra.org Sat Nov 13 00:26:08 1999 From: gstein at lyra.org (Greg Stein) Date: Fri, 12 Nov 1999 15:26:08 -0800 (PST) Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <14380.16064.723277.586881@weyr.cnri.reston.va.us> Message-ID: <Pine.LNX.4.10.9911121519440.2535-100000@nebula.lyra.org> On Fri, 12 Nov 1999, Fred L. Drake, Jr. wrote: > M.-A. Lemburg writes: > > The abbreviation BOM is quite common w/r to Unicode. True. > Yes: "w/r to Unicode". In sys, it's out of context and should > receive a more descriptive name. I think using BOM in unicodec is > good. I agree and believe that we can avoid putting it into sys altogether. > > BOM_BE: '\376\377' > > (corresponds to Unicode 0x0000FEFF in UTF-16 > > == ZERO WIDTH NO-BREAK SPACE) Are you sure about that interpretation? I thought the BOM characters (0xFEFF and 0xFFFE) were *reserved* in the UCS-2 space. > I'd also add BOM to be the same as sys.byte_order_mark. Perhaps > even instead of sys.byte_order_mark (just to localize the areas of > code that are affected). ### unicodec.py ### import struct BOM = struct.pack('h', 0x0000FEFF) BOM_BE = '\376\377' ... If somebody needs the BOM, then they should go to unicodec.py (or some other module). I do not believe we need to put that stuff into the sys module. It is just too easy to create the value in Python. Cheers, -g p.s. to be pedantic, the pack() format could be '@h' -- Greg Stein, http://www.lyra.org/ From mhammond at skippinet.com.au Sat Nov 13 00:41:16 1999 From: mhammond at skippinet.com.au (Mark Hammond) Date: Sat, 13 Nov 1999 10:41:16 +1100 Subject: [Python-Dev] just say no... In-Reply-To: <Pine.LNX.4.10.9911121501460.2535-100000@nebula.lyra.org> Message-ID: <008601bf2d67$6a9982b0$0501a8c0@bobcat> [Greg writes] > As a separate argument, MAL can argue that "t#" should create > an internal, > associated buffer to hold a UTF-8 encoding and then return > that. But the > "s#" should return the raw bytes! > [ and I'll argue against the response to "t#" anyhow... ] Hmm. Climbing over these dead bodies could get a bit smelly :-) Im inclined to agree that holding 2 internal buffers for the unicode object is not ideal. However, I _am_ concerned with getting decent PyArg_ParseTuple and Py_BuildValue support, and if the cost is an extra buffer I will survive. So lets look for solutions that dont require it, rather than holding it up as evil when no other solution is obvious. My requirements appear to me to be very simple (for an anglophile): Lets say I have a platform Unicode value - eg, I got a Unicode value from some external library (say COM :-) Lets assume for now that the Unicode string is fully representable as ASCII - say a file or directory name that COM gave me. I simply want to be able to pass this Unicode object to "open()", and have it work. This assumes that open() will not become "native unicode", simply as the underlying C support is not unicode aware - it needs to be converted to a "char *" (ie, will use the "t#" format) The second side of the equation is when I expose a Python function that talks Unicode - eg, I need to _pass_ a platform Unicode value to an external library. The Python programmer should be able to pass a Unicode object (no problem), or a PyString object. In code terms: Prob1: name = SomeComObject.GetFileName() # A Unicode object f = open(name) Prob2: SomeComObject.SetFileName("foo.txt") IMO it is important that we have a good strategy for dealing with this for extensions. MAL addresses one direction, but not the other. Maybe if we toss around general solutions for this the implementation will fall out. MALs idea of the additional buffer starts to address this, but isnt the whole story. Any ideas on this? From gstein at lyra.org Sat Nov 13 01:49:34 1999 From: gstein at lyra.org (Greg Stein) Date: Fri, 12 Nov 1999 16:49:34 -0800 (PST) Subject: [Python-Dev] argument parsing (was: just say no...) In-Reply-To: <008601bf2d67$6a9982b0$0501a8c0@bobcat> Message-ID: <Pine.LNX.4.10.9911121615170.2535-100000@nebula.lyra.org> On Sat, 13 Nov 1999, Mark Hammond wrote: >... > Im inclined to agree that holding 2 internal buffers for the unicode > object is not ideal. However, I _am_ concerned with getting decent > PyArg_ParseTuple and Py_BuildValue support, and if the cost is an > extra buffer I will survive. So lets look for solutions that dont > require it, rather than holding it up as evil when no other solution > is obvious. I believe Py_BuildValue is pretty straight-forward. Simply state that it is allowed to perform conversions and place the resulting object into the resulting tuple. (with appropriate refcounting) In other words: tuple = Py_BuildValue("U", stringOb); The stringOb will be converted to a Unicode object. The new Unicode object will go into the tuple (with the tuple holding the only reference!). The stringOb will NOT acquire any additional references. [ "U" format may be wrong; it is here for example purposes ] Okay... now the PyArg_ParseTuple() is the *real* kicker. >... > Prob1: > name = SomeComObject.GetFileName() # A Unicode object > f = open(name) > Prob2: > SomeComObject.SetFileName("foo.txt") Both of these issues are due to PyArg_ParseTuple. In Prob1, you want a string-like object which can be passed to the OS as an 8-bit string. In Prob2, you want a string-like object which can be passed to the OS as a Unicode string. I see three options for PyArg_ParseTuple: 1) allow it to return NEW objects which must be DECREF'd. [ current policy only loans out references ] This option could be difficult in the presence of errors during the parse. For example, the current idiom is: if (!PyArg_ParseTuple(args, "...")) return NULL; If an object was produced, but then a later argument cause a failure, then who is responsible for freeing the object? 2) like step 1, but PyArg_ParseTuple is smart enough to NOT return any new objects when an error occurred. This basically answers the last question in option (1) -- ParseTuple is responsible. 3) Return loaned-out-references to objects which have been tested for convertability. Helper functions perform the conversion and the caller will then free the reference. [ this is the model used in PyWin32 ] Code in PyWin32 typically looks like: if (!PyArg_ParseTuple(args, "O", &ob)) return NULL; if ((unicodeOb = GiveMeUnicode(ob)) == NULL) return NULL; ... Py_DECREF(unicodeOb); [ GiveMeUnicode is descriptive here; I forget the name used in PyWin32 ] In a "real" situation, the ParseTuple format would be "U" and the object would be type-tested for PyStringType or PyUnicodeType. Note that GiveMeUnicode() would also do a type-test, but it can't produce a *specific* error like ParseTuple (e.g. "string/unicode object expected" vs "parameter 3 must be a string/unicode object") Are there more options? Anybody? All three of these avoid the secondary buffer. The last is cleanest w.r.t. to keeping the existing "loaned references" behavior, but can get a bit wordy when you need to convert a bunch of string arguments. Option (2) adds a good amount of complexity to PyArg_ParseTuple -- it would need to keep a "free list" in case an error occurred. Option (1) adds DECREF logic to callers to ensure they clean up. The add'l logic isn't much more than the other two options (the only change is adding DECREFs before returning NULL from the "if (!PyArg_ParseTuple..." condition). Note that the caller would probably need to initialize each object to NULL before calling ParseTuple. Personally, I prefer (3) as it makes it very clear that a new object has been created and must be DECREF'd at some point. Also note that GiveMeUnicode() could also accept a second argument for the type of decoding to do (or NULL meaning "UTF-8"). Oh: note there are equivalents of all options for going from unicode-to-string; the above is all about string-to-unicode. However, the tricky part of unicode-to-string is determining whether backwards compatibility will be a requirement. i.e. does existing code that uses the "t" format suddenly achieve the capability to accept a Unicode object? This obviously causes problems in all three options: since a new reference must be created to handle the situation, then who DECREF's it? The old code certainly doesn't. [ <IMO> I'm with Fredrik in saying "no, old code *doesn't* suddenly get the ability to accept a Unicode object." The Python code must use str() to do the encoding manually (until the old code is upgraded to one of the above three options). </IMO> ] I think that's it for me. In the several years I've been thinking on this problem, I haven't come up with anything but the above three. There may be a whole new paradigm for argument parsing, but I haven't tried to think on that one (and just fit in around ParseTuple). Cheers, -g -- Greg Stein, http://www.lyra.org/ From mal at lemburg.com Fri Nov 12 19:49:52 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Fri, 12 Nov 1999 19:49:52 +0100 Subject: [Python-Dev] Internationalization Toolkit References: <DBF3B37F7BF1D111B2A10000F6B14B1FDDAF24@ukhil704nts.hld.uk.fid-intl.com> <02f901bf2d07$bdf5d950$f29b12c2@secret.pythonware.com> <382C0FA8.ACB6CCD6@lemburg.com> <14380.10955.420102.327867@weyr.cnri.reston.va.us> <382C3749.198EEBC6@lemburg.com> <14380.16064.723277.586881@weyr.cnri.reston.va.us> Message-ID: <382C6150.53BDC803@lemburg.com> "Fred L. Drake, Jr." wrote: > > M.-A. Lemburg writes: > > The abbreviation BOM is quite common w/r to Unicode. > > Yes: "w/r to Unicode". In sys, it's out of context and should > receive a more descriptive name. I think using BOM in unicodec is > good. Guido proposed to add it to sys. I originally had it defined in unicodec. Perhaps a sys.endian would be more appropriate for sys with values 'little' and 'big' or '<' and '>' to be conform to the struct module. unicodec could then define unicodec.bom depending on the setting in sys. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Sat Nov 13 10:37:35 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Sat, 13 Nov 1999 10:37:35 +0100 Subject: [Python-Dev] just say no... References: <Pine.LNX.4.10.9911121505210.2535-100000@nebula.lyra.org> Message-ID: <382D315F.A7ADEC42@lemburg.com> Greg Stein wrote: > > On Fri, 12 Nov 1999, M.-A. Lemburg wrote: > > Fredrik Lundh wrote: > >... > > > why? I don't understand why "s" and "s#" has > > > to deal with encoding issues at all... > > > > > > > unless, of course, you want to give up Unicode object support > > > > for all APIs using these parsers. > > > > > > hmm. maybe that's exactly what I want... > > > > If we don't add that support, lot's of existing APIs won't > > accept Unicode object instead of strings. While it could be > > argued that automatic conversion to UTF-8 is not transparent > > enough for the user, the other solution of using str(u) > > everywhere would probably make writing Unicode-aware code a > > rather clumsy task and introduce other pitfalls, since str(obj) > > calls PyObject_Str() which also works on integers, floats, > > etc. > > No no no... > > "s" and "s#" are NOT SUPPOSED TO return a UTF-8 encoding. They are > supposed to return the raw bytes. [I've waited quite some time for you to chime in on this one ;-)] Let me summarize a bit on the general ideas behind "s", "s#" and the extra buffer: First, we have a general design question here: should old code become Unicode compatible or not. As I recall the original idea about Unicode integration was to follow Perl's idea to have scripts become Unicode aware by simply adding a 'use utf8;'. If this is still the case, then we'll have to come with a resonable approach for integrating classical string based APIs with the new type. Since UTF-8 is a standard (some would probably prefer UTF-7,5 e.g. the Latin-1 folks) which has some very nice features (see http://czyborra.com/utf/ ) and which is a true extension of ASCII, this encoding seems best fit for the purpose. However, one should not forget that UTF-8 is in fact a variable length encoding of Unicode characters, that is up to 3 bytes form a *single* character. This is obviously not compatible with definitions that explicitly state data to be using a 8-bit single character encoding, e.g. indexing in UTF-8 doesn't work like it does in Latin-1 text. So if we are to do the integration, we'll have to choose argument parser markers that allow for multi byte characters. "t#" does not fall into this category, "s#" certainly does, "s" is argueable. Also note that we have to watch out for embedded NULL bytes. UTF-16 has NULL bytes for every character from the Latin-1 domain. If "s" were to give back a pointer to the internal buffer which is encoded in UTF-16, you would loose data. UTF-8 doesn't have this problem, since only NULL bytes map to (single) NULL bytes. Now Greg would chime in with the buffer interface and argue that it should make the underlying internal format accessible. This is a bad idea, IMHO, since you shouldn't really have to know what the internal data format is. Defining "s#" to return UTF-8 data does not only make "s" and "s#" return the same data format (which should always be the case, IMO), but also hides the internal format from the user and gives him a reliable cross-platform data representation of Unicode data (note that UTF-8 doesn't have the byte order problems of UTF-16). If you are still with, let's look at what "s" and "s#" do: they return pointers into data areas which have to be kept alive until the corresponding object dies. The only way to support this feature is by allocating a buffer for just this purpose (on the fly and only if needed to prevent excessive memory load). The other options of adding new magic parser markers or switching to more generic one all have one downside: you need to change existing code which is in conflict with the idea we started out with. So, again, the question is: do we want this magical intergration or not ? Note that this is a design question, not one of memory consumption... -- Ok, the above covered Unicode -> String conversion. Mark mentioned that he wanted the other way around to also work in the same fashion, ie. automatic String -> Unicode conversion. This could also be done in the same way by interpreting the string as UTF-8 encoded Unicode... but we have the same problem: where to put the data without generating new intermediate objects. Since only newly written code will use this feature there is a way to do this though: PyArg_ParseTuple(args,"s#",&utf8,&len); If your C API understands UTF-8 there's nothing more to do, if not, take Greg's option 3 approach: PyArg_ParseTuple(args,"O",&obj); unicode = PyUnicode_FromObject(obj); ... Py_DECREF(unicode); Here PyUnicode_FromObject() will return a new reference if obj is an Unicode object or create a new Unicode object by interpreting str(obj) as UTF-8 encoded string. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 48 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From guido at CNRI.Reston.VA.US Sat Nov 13 13:12:41 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Sat, 13 Nov 1999 07:12:41 -0500 Subject: [Python-Dev] just say no... In-Reply-To: Your message of "Fri, 12 Nov 1999 14:59:03 PST." <Pine.LNX.4.10.9911121456370.2535-100000@nebula.lyra.org> References: <Pine.LNX.4.10.9911121456370.2535-100000@nebula.lyra.org> Message-ID: <199911131212.HAA25895@eric.cnri.reston.va.us> > I am with Fredrik on that auxilliary buffer. You'll have two dead bodies > to deal with :-) I haven't made up my mind yet (due to a very successful Python-promoting visit to SD'99 east, I'm about 100 msgs behind in this thread alone) but let me warn you that I can deal with the carnage, if necessary. :-) --Guido van Rossum (home page: http://www.python.org/~guido/) From gstein at lyra.org Sat Nov 13 13:23:54 1999 From: gstein at lyra.org (Greg Stein) Date: Sat, 13 Nov 1999 04:23:54 -0800 (PST) Subject: [Python-Dev] just say no... In-Reply-To: <199911131212.HAA25895@eric.cnri.reston.va.us> Message-ID: <Pine.LNX.4.10.9911130423400.2535-100000@nebula.lyra.org> On Sat, 13 Nov 1999, Guido van Rossum wrote: > > I am with Fredrik on that auxilliary buffer. You'll have two dead bodies > > to deal with :-) > > I haven't made up my mind yet (due to a very successful > Python-promoting visit to SD'99 east, I'm about 100 msgs behind in > this thread alone) but let me warn you that I can deal with the > carnage, if necessary. :-) Bring it on, big boy! :-) -- Greg Stein, http://www.lyra.org/ From mhammond at skippinet.com.au Sat Nov 13 13:52:18 1999 From: mhammond at skippinet.com.au (Mark Hammond) Date: Sat, 13 Nov 1999 23:52:18 +1100 Subject: [Python-Dev] argument parsing (was: just say no...) In-Reply-To: <Pine.LNX.4.10.9911121615170.2535-100000@nebula.lyra.org> Message-ID: <00b301bf2dd5$ec4df840$0501a8c0@bobcat> [Lamenting about PyArg_ParseTuple and managing memory buffers for String/Unicode conversions.] So what is really wrong with Marc's proposal about the extra pointer on the Unicode object? And to double the carnage, who not add the equivilent native Unicode buffer to the PyString object? These would only ever be filled when requested by the conversion routines. They have no other effect than their memory is managed by the object itself; simply a convenience to avoid having extension modules manage the conversion buffers. The only overheads appear to be: * The conversion buffers may be slightly (or much :-) longer-lived - ie, they are not freed until the object itself is freed. * String object slightly bigger, and slightly slower to destroy. It appears to solve the problems, and the cost doesnt seem too high... Mark. From guido at CNRI.Reston.VA.US Sat Nov 13 14:06:26 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Sat, 13 Nov 1999 08:06:26 -0500 Subject: [Python-Dev] just say no... In-Reply-To: Your message of "Sat, 13 Nov 1999 10:37:35 +0100." <382D315F.A7ADEC42@lemburg.com> References: <Pine.LNX.4.10.9911121505210.2535-100000@nebula.lyra.org> <382D315F.A7ADEC42@lemburg.com> Message-ID: <199911131306.IAA26030@eric.cnri.reston.va.us> I think I have a reasonable grasp of the issues here, even though I still haven't read about 100 msgs in this thread. Note that t# and the charbuffer addition to the buffer API were added by Greg Stein with my support; I'll attempt to reconstruct our thinking at the time... [MAL] > Let me summarize a bit on the general ideas behind "s", "s#" > and the extra buffer: I think you left out t#. > First, we have a general design question here: should old code > become Unicode compatible or not. As I recall the original idea > about Unicode integration was to follow Perl's idea to have > scripts become Unicode aware by simply adding a 'use utf8;'. I've never heard of this idea before -- or am I taking it too literal? It smells of a mode to me :-) I'd rather live in a world where Unicode just works as long as you use u'...' literals or whatever convention we decide. > If this is still the case, then we'll have to come with a > resonable approach for integrating classical string based > APIs with the new type. > > Since UTF-8 is a standard (some would probably prefer UTF-7,5 e.g. > the Latin-1 folks) which has some very nice features (see > http://czyborra.com/utf/ ) and which is a true extension of ASCII, > this encoding seems best fit for the purpose. Yes, especially if we fix the default encoding as UTF-8. (I'm expecting feedback from HP on this next week, hopefully when I see the details, it'll be clear that don't need a per-thread default encoding to solve their problems; that's quite a likely outcome. If not, we have a real-world argument for allowing a variable default encoding, without carnage.) > However, one should not forget that UTF-8 is in fact a > variable length encoding of Unicode characters, that is up to > 3 bytes form a *single* character. This is obviously not compatible > with definitions that explicitly state data to be using a > 8-bit single character encoding, e.g. indexing in UTF-8 doesn't > work like it does in Latin-1 text. Sure, but where in current Python are there such requirements? > So if we are to do the integration, we'll have to choose > argument parser markers that allow for multi byte characters. > "t#" does not fall into this category, "s#" certainly does, > "s" is argueable. I disagree. I grepped through the source for s# and t#. Here's a bit of background. Before t# was introduced, s# was being used for two distinct purposes: (1) to get an 8-bit text string plus its length, in situations where the length was needed; (2) to get binary data (e.g. GIF data read from a file in "rb" mode). Greg pointed out that if we ever introduced some form of Unicode support, these two had to be disambiguated. We found that the majority of uses was for (2)! Therefore we decided to change the definition of s# to mean only (2), and introduced t# to mean (1). Also, we introduced getcharbuffer corresponding to t#, while getreadbuffer was meant for s#. Note that the definition of the 's' format was left alone -- as before, it means you need an 8-bit text string not containing null bytes. Our expectation was that a Unicode string passed to an s# situation would give a pointer to the internal format plus a byte count (not a character count!) while t# would get a pointer to some kind of 8-bit translation/encoding plus a byte count, with the explicit requirement that the 8-bit translation would have the same lifetime as the original unicode object. We decided to leave it up to the next generation (i.e., Marc-Andre :-) to decide what kind of translation to use and what to do when there is no reasonable translation. Any of the following choices is acceptable (from the point of view of not breaking the intended t# semantics; we can now start deciding which we like best): - utf-8 - latin-1 - ascii - shift-jis - lower byte of unicode ordinal - some user- or os-specified multibyte encoding As far as t# is concerned, for encodings that don't encode all of Unicode, untranslatable characters could be dealt with in any number of ways (raise an exception, ignore, replace with '?', make best effort, etc.). Given the current context, it should probably be the same as the default encoding -- i.e., utf-8. If we end up making the default user-settable, we'll have to decide what to do with untranslatable characters -- but that will probably be decided by the user too (it would be a property of a specific translation specification). In any case, I feel that t# could receive a multi-byte encoding, s# should receive raw binary data, and they should correspond to getcharbuffer and getreadbuffer, respectively. (Aside: the symmetry between 's' and 's#' is now lost; 's' matches 't#', there's no match for 's#'.) > Also note that we have to watch out for embedded NULL bytes. > UTF-16 has NULL bytes for every character from the Latin-1 > domain. If "s" were to give back a pointer to the internal > buffer which is encoded in UTF-16, you would loose data. > UTF-8 doesn't have this problem, since only NULL bytes > map to (single) NULL bytes. This is a red herring given my explanation above. > Now Greg would chime in with the buffer interface and > argue that it should make the underlying internal > format accessible. This is a bad idea, IMHO, since you > shouldn't really have to know what the internal data format > is. This is for C code. Quite likely it *does* know what the internal data format is! > Defining "s#" to return UTF-8 data does not only > make "s" and "s#" return the same data format (which should > always be the case, IMO), That was before t# was introduced. No more, alas. If you replace s# with t#, I agree with you completely. > but also hides the internal > format from the user and gives him a reliable cross-platform > data representation of Unicode data (note that UTF-8 doesn't > have the byte order problems of UTF-16). > > If you are still with, let's look at what "s" and "s#" (and t#, which is more relevant here) > do: they return pointers into data areas which have to > be kept alive until the corresponding object dies. > > The only way to support this feature is by allocating > a buffer for just this purpose (on the fly and only if > needed to prevent excessive memory load). The other > options of adding new magic parser markers or switching > to more generic one all have one downside: you need to > change existing code which is in conflict with the idea > we started out with. Agreed. I think this was our thinking when Greg & I introduced t#. My own preference would be to allocate a whole string object, not just a buffer; this could then also be used for the .encode() method using the default encoding. > So, again, the question is: do we want this magical > intergration or not ? Note that this is a design question, > not one of memory consumption... Yes, I want it. Note that this doesn't guarantee that all old extensions will work flawlessly when passed Unicode objects; but I think that it covers most cases where you could have a reasonable expectation that it works. (Hm, unfortunately many reasonable expectations seem to involve the current user's preferred encoding. :-( ) > -- > > Ok, the above covered Unicode -> String conversion. Mark > mentioned that he wanted the other way around to also > work in the same fashion, ie. automatic String -> Unicode > conversion. > > This could also be done in the same way by > interpreting the string as UTF-8 encoded Unicode... but we > have the same problem: where to put the data without > generating new intermediate objects. Since only newly > written code will use this feature there is a way to do > this though: > > PyArg_ParseTuple(args,"s#",&utf8,&len); No! That is supposed to give the native representation of the string object. I agree that Mark's problem requires a solution too, but it doesn't have to use existing formatting characters, since there's no backwards compatibility issue. > If your C API understands UTF-8 there's nothing more to do, > if not, take Greg's option 3 approach: > > PyArg_ParseTuple(args,"O",&obj); > unicode = PyUnicode_FromObject(obj); > ... > Py_DECREF(unicode); > > Here PyUnicode_FromObject() will return a new > reference if obj is an Unicode object or create a new > Unicode object by interpreting str(obj) as UTF-8 encoded string. This might work. --Guido van Rossum (home page: http://www.python.org/~guido/) From mal at lemburg.com Sat Nov 13 14:06:35 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Sat, 13 Nov 1999 14:06:35 +0100 Subject: [Python-Dev] Unicode Proposal: Version 0.5 References: <382C0A54.E6E8328D@lemburg.com> Message-ID: <382D625B.DC14DBDE@lemburg.com> FYI, I've uploaded a new version of the proposal which incorporates proposals for line breaks, case mapping, character properties and private code points support. The latest version of the proposal is available at: http://starship.skyport.net/~lemburg/unicode-proposal.txt Older versions are available as: http://starship.skyport.net/~lemburg/unicode-proposal-X.X.txt Some POD (points of discussion) that are still open: ? should Unicode objects support %-formatting ? One possibility would be to emulate this via strings and <default encoding>: s = '%s %i abc???' # a Latin-1 encoded string t = (u,3) # Convert Latin-1 s to a <default encoding> string s1 = unicode(s,'latin-1').encode() # The '%s' will now add u in <default encoding> s2 = s1 % t # Finally, convert the <default encoding> encoded string to Unicode u1 = unicode(s2) ? specifying file wrappers: Open issues: what to do with Python strings fed to the .write() method (may need to know the encoding of the strings) and when/if to return Python strings through the .read() method. Perhaps we need more than one type of wrapper here. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 48 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From jack at oratrix.nl Sat Nov 13 17:40:34 1999 From: jack at oratrix.nl (Jack Jansen) Date: Sat, 13 Nov 1999 17:40:34 +0100 Subject: [Python-Dev] just say no... In-Reply-To: Message by Greg Stein <gstein@lyra.org> , Fri, 12 Nov 1999 15:05:11 -0800 (PST) , <Pine.LNX.4.10.9911121501460.2535-100000@nebula.lyra.org> Message-ID: <19991113164039.9B697EA11A@oratrix.oratrix.nl> Recently, Greg Stein <gstein at lyra.org> said: > This was done last year!! We have "s#" meaning "give me some bytes." We > have "t#" meaning "give me some 8-bit characters." The Python distribution > has been completely updated to use the appropriate format in each call. Oops... I remember the discussion but I wasn't aware that somone had actually _implemented_ this:-). Part of my misunderstanding was also caused by the fact that I inspected what I thought would be the prime candidate for t#: file.write() to a non-binary file, and it doesn't use the new format. I also noted a few inconsistencies at first glance, by the way: most modules seem to use s# for things like filenames and other data-that-is-readable-but-shouldn't-be-messed-with, but binascii is an exception and it uses t# for uuencoded strings... -- Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++ Jack.Jansen at oratrix.com | ++++ if you agree copy these lines to your sig ++++ www.oratrix.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm From guido at CNRI.Reston.VA.US Sat Nov 13 20:20:51 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Sat, 13 Nov 1999 14:20:51 -0500 Subject: [Python-Dev] just say no... In-Reply-To: Your message of "Sat, 13 Nov 1999 17:40:34 +0100." <19991113164039.9B697EA11A@oratrix.oratrix.nl> References: <19991113164039.9B697EA11A@oratrix.oratrix.nl> Message-ID: <199911131920.OAA26165@eric.cnri.reston.va.us> > I remember the discussion but I wasn't aware that somone had actually > _implemented_ this:-). Part of my misunderstanding was also caused by > the fact that I inspected what I thought would be the prime candidate > for t#: file.write() to a non-binary file, and it doesn't use the new > format. I guess that's because file.write() doesn't distinguish between text and binary files. Maybe it should: the current implementation together with my proposed semantics for Unicode strings would mean that printing a unicode string (to stdout) would dump the internal encoding to the file. I guess it should do so only when the file is opened in binary mode; for files opened in text mode it should use an encoding (opening a file can specify an encoding; can we change the encoding of an existing file?). > I also noted a few inconsistencies at first glance, by the way: most > modules seem to use s# for things like filenames and other > data-that-is-readable-but-shouldn't-be-messed-with, but binascii is an > exception and it uses t# for uuencoded strings... Actually, binascii seems to do it right: s# for binary data, t# for text (uuencoded, hqx, base64). That is, the b2a variants use s# while the a2b variants use t#. The only thing I'm not sure about in that module are binascii_rledecode_hqx() and binascii_rlecode_hqx() -- I don't understand where these stand in the complexity of binhex en/decoding. --Guido van Rossum (home page: http://www.python.org/~guido/) From mal at lemburg.com Sun Nov 14 23:11:54 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Sun, 14 Nov 1999 23:11:54 +0100 Subject: [Python-Dev] just say no... References: <Pine.LNX.4.10.9911121505210.2535-100000@nebula.lyra.org> <382D315F.A7ADEC42@lemburg.com> <199911131306.IAA26030@eric.cnri.reston.va.us> Message-ID: <382F33AA.C3EE825A@lemburg.com> Guido van Rossum wrote: > > I think I have a reasonable grasp of the issues here, even though I > still haven't read about 100 msgs in this thread. Note that t# and > the charbuffer addition to the buffer API were added by Greg Stein > with my support; I'll attempt to reconstruct our thinking at the > time... > > [MAL] > > Let me summarize a bit on the general ideas behind "s", "s#" > > and the extra buffer: > > I think you left out t#. On purpose -- according to my thinking. I see "t#" as an interface to bf_getcharbuf which I understand as 8-bit character buffer... UTF-8 is a multi byte encoding. It still is character data, but not necessarily 8 bits in length (up to 24 bits are used). Anyway, I'm not really interested in having an argument about this. If you say, "t#" fits the purpose, then that's fine with me. Still, we should clearly define that "t#" returns text data and "s#" binary data. Encoding, bit length, etc. should explicitly remain left undefined. > > First, we have a general design question here: should old code > > become Unicode compatible or not. As I recall the original idea > > about Unicode integration was to follow Perl's idea to have > > scripts become Unicode aware by simply adding a 'use utf8;'. > > I've never heard of this idea before -- or am I taking it too literal? > It smells of a mode to me :-) I'd rather live in a world where > Unicode just works as long as you use u'...' literals or whatever > convention we decide. > > > If this is still the case, then we'll have to come with a > > resonable approach for integrating classical string based > > APIs with the new type. > > > > Since UTF-8 is a standard (some would probably prefer UTF-7,5 e.g. > > the Latin-1 folks) which has some very nice features (see > > http://czyborra.com/utf/ ) and which is a true extension of ASCII, > > this encoding seems best fit for the purpose. > > Yes, especially if we fix the default encoding as UTF-8. (I'm > expecting feedback from HP on this next week, hopefully when I see the > details, it'll be clear that don't need a per-thread default encoding > to solve their problems; that's quite a likely outcome. If not, we > have a real-world argument for allowing a variable default encoding, > without carnage.) Fair enough :-) > > However, one should not forget that UTF-8 is in fact a > > variable length encoding of Unicode characters, that is up to > > 3 bytes form a *single* character. This is obviously not compatible > > with definitions that explicitly state data to be using a > > 8-bit single character encoding, e.g. indexing in UTF-8 doesn't > > work like it does in Latin-1 text. > > Sure, but where in current Python are there such requirements? It was my understanding that "t#" refers to single byte character data. That's where the above arguments were aiming at... > > So if we are to do the integration, we'll have to choose > > argument parser markers that allow for multi byte characters. > > "t#" does not fall into this category, "s#" certainly does, > > "s" is argueable. > > I disagree. I grepped through the source for s# and t#. Here's a bit > of background. Before t# was introduced, s# was being used for two > distinct purposes: (1) to get an 8-bit text string plus its length, in > situations where the length was needed; (2) to get binary data (e.g. > GIF data read from a file in "rb" mode). Greg pointed out that if we > ever introduced some form of Unicode support, these two had to be > disambiguated. We found that the majority of uses was for (2)! > Therefore we decided to change the definition of s# to mean only (2), > and introduced t# to mean (1). Also, we introduced getcharbuffer > corresponding to t#, while getreadbuffer was meant for s#. I know its too late now, but I can't really follow the arguments here: in what ways are (1) and (2) different from the implementations point of view ? If "t#" is to return UTF-8 then <length of the buffer> will not equal <text length>, so both parser markers return essentially the same information. The only difference would be on the semantic side: (1) means: give me text data, while (2) does not specify the data type. Perhaps I'm missing something... > Note that the definition of the 's' format was left alone -- as > before, it means you need an 8-bit text string not containing null > bytes. This definition should then be changed to "text string without null bytes" dropping the 8-bit reference. > Our expectation was that a Unicode string passed to an s# situation > would give a pointer to the internal format plus a byte count (not a > character count!) while t# would get a pointer to some kind of 8-bit > translation/encoding plus a byte count, with the explicit requirement > that the 8-bit translation would have the same lifetime as the > original unicode object. We decided to leave it up to the next > generation (i.e., Marc-Andre :-) to decide what kind of translation to > use and what to do when there is no reasonable translation. Hmm, I would strongly object to making "s#" return the internal format. file.write() would then default to writing UTF-16 data instead of UTF-8 data. This could result in strange errors due to the UTF-16 format being endian dependent. It would also break the symmetry between file.write(u) and unicode(file.read()), since the default encoding is not used as internal format for other reasons (see proposal). > Any of the following choices is acceptable (from the point of view of > not breaking the intended t# semantics; we can now start deciding > which we like best): I think we have already agreed on using UTF-8 for the default encoding. It has quite a few advantages. See http://czyborra.com/utf/ for a good overview of the pros and cons. > - utf-8 > - latin-1 > - ascii > - shift-jis > - lower byte of unicode ordinal > - some user- or os-specified multibyte encoding > > As far as t# is concerned, for encodings that don't encode all of > Unicode, untranslatable characters could be dealt with in any number > of ways (raise an exception, ignore, replace with '?', make best > effort, etc.). The usual Python way would be: raise an exception. This is what the proposal defines for Codecs in case an encoding/decoding mapping is not possible, BTW. (UTF-8 will always succeed on output.) > Given the current context, it should probably be the same as the > default encoding -- i.e., utf-8. If we end up making the default > user-settable, we'll have to decide what to do with untranslatable > characters -- but that will probably be decided by the user too (it > would be a property of a specific translation specification). > > In any case, I feel that t# could receive a multi-byte encoding, > s# should receive raw binary data, and they should correspond to > getcharbuffer and getreadbuffer, respectively. Why would you want to have "s#" return the raw binary data for Unicode objects ? Note that it is not mentioned anywhere that "s#" and "t#" do have to necessarily return different things (binary being a superset of text). I'd opt for "s#" and "t#" both returning UTF-8 data. This can be implemented by delegating the buffer slots to the <defencstr> object (see below). > > Now Greg would chime in with the buffer interface and > > argue that it should make the underlying internal > > format accessible. This is a bad idea, IMHO, since you > > shouldn't really have to know what the internal data format > > is. > > This is for C code. Quite likely it *does* know what the internal > data format is! C code can use the PyUnicode_* APIs to access the data. I don't think that argument parsing is powerful enough to provide the C code with enough information about the data contents, e.g. it can only state the encoding length, not the string length. > > Defining "s#" to return UTF-8 data does not only > > make "s" and "s#" return the same data format (which should > > always be the case, IMO), > > That was before t# was introduced. No more, alas. If you replace s# > with t#, I agree with you completely. Done :-) > > but also hides the internal > > format from the user and gives him a reliable cross-platform > > data representation of Unicode data (note that UTF-8 doesn't > > have the byte order problems of UTF-16). > > > > If you are still with, let's look at what "s" and "s#" > > (and t#, which is more relevant here) > > > do: they return pointers into data areas which have to > > be kept alive until the corresponding object dies. > > > > The only way to support this feature is by allocating > > a buffer for just this purpose (on the fly and only if > > needed to prevent excessive memory load). The other > > options of adding new magic parser markers or switching > > to more generic one all have one downside: you need to > > change existing code which is in conflict with the idea > > we started out with. > > Agreed. I think this was our thinking when Greg & I introduced t#. > My own preference would be to allocate a whole string object, not > just a buffer; this could then also be used for the .encode() method > using the default encoding. Good point. I'll change <defencbuf> to <defencstr>, a Python string object created on request. > > So, again, the question is: do we want this magical > > intergration or not ? Note that this is a design question, > > not one of memory consumption... > > Yes, I want it. > > Note that this doesn't guarantee that all old extensions will work > flawlessly when passed Unicode objects; but I think that it covers > most cases where you could have a reasonable expectation that it > works. > > (Hm, unfortunately many reasonable expectations seem to involve > the current user's preferred encoding. :-( ) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 47 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From amk1 at erols.com Mon Nov 15 02:49:08 1999 From: amk1 at erols.com (A.M. Kuchling) Date: Sun, 14 Nov 1999 20:49:08 -0500 Subject: [Python-Dev] PyErr_Format security note Message-ID: <199911150149.UAA00408@mira.erols.com> I noticed this in PyErr_Format(exception, format, va_alist): char buffer[500]; /* Caller is responsible for limiting the format */ ... vsprintf(buffer, format, vargs); Making the caller responsible for this is error-prone. The danger, of course, is a buffer overflow caused by generating an error string that's larger than the buffer, possibly letting people execute arbitrary code. We could add a test to the configure script for vsnprintf() and use it when possible, but that only fixes the problem on platforms which have it. Can we find an implementation of vsnprintf() someplace? -- A.M. Kuchling http://starship.python.net/crew/amk/ One form to rule them all, one form to find them, one form to bring them all and in the darkness rewrite the hell out of them. -- Digital Equipment Corporation, in a comment from SENDMAIL Ruleset 3 From gstein at lyra.org Mon Nov 15 03:11:39 1999 From: gstein at lyra.org (Greg Stein) Date: Sun, 14 Nov 1999 18:11:39 -0800 (PST) Subject: [Python-Dev] PyErr_Format security note In-Reply-To: <199911150149.UAA00408@mira.erols.com> Message-ID: <Pine.LNX.4.10.9911141807390.2535-100000@nebula.lyra.org> On Sun, 14 Nov 1999, A.M. Kuchling wrote: > Making the caller responsible for this is error-prone. The danger, of > course, is a buffer overflow caused by generating an error string > that's larger than the buffer, possibly letting people execute > arbitrary code. We could add a test to the configure script for > vsnprintf() and use it when possible, but that only fixes the problem > on platforms which have it. Can we find an implementation of > vsnprintf() someplace? Apache has a safe implementation (they have reviewed the heck out of it for obvious reasons :-). In the Apache source distribution, it is located in src/ap/ap_snprintf.c. Cheers, -g -- Greg Stein, http://www.lyra.org/ From mal at lemburg.com Mon Nov 15 09:09:07 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Mon, 15 Nov 1999 09:09:07 +0100 Subject: [Python-Dev] PyErr_Format security note References: <199911150149.UAA00408@mira.erols.com> Message-ID: <382FBFA3.B28B8E1E@lemburg.com> "A.M. Kuchling" wrote: > > I noticed this in PyErr_Format(exception, format, va_alist): > > char buffer[500]; /* Caller is responsible for limiting the format */ > ... > vsprintf(buffer, format, vargs); > > Making the caller responsible for this is error-prone. The danger, of > course, is a buffer overflow caused by generating an error string > that's larger than the buffer, possibly letting people execute > arbitrary code. We could add a test to the configure script for > vsnprintf() and use it when possible, but that only fixes the problem > on platforms which have it. Can we find an implementation of > vsnprintf() someplace? In sysmodule.c, this check is done which should be safe enough since no "return" is issued (Py_FatalError() does an abort()): if (vsprintf(buffer, format, va) >= sizeof(buffer)) Py_FatalError("PySys_WriteStdout/err: buffer overrun"); -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 46 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From gstein at lyra.org Mon Nov 15 10:28:06 1999 From: gstein at lyra.org (Greg Stein) Date: Mon, 15 Nov 1999 01:28:06 -0800 (PST) Subject: [Python-Dev] PyErr_Format security note In-Reply-To: <382FBFA3.B28B8E1E@lemburg.com> Message-ID: <Pine.LNX.4.10.9911150127320.2535-100000@nebula.lyra.org> On Mon, 15 Nov 1999, M.-A. Lemburg wrote: >... > In sysmodule.c, this check is done which should be safe enough > since no "return" is issued (Py_FatalError() does an abort()): > > if (vsprintf(buffer, format, va) >= sizeof(buffer)) > Py_FatalError("PySys_WriteStdout/err: buffer overrun"); I believe the return from vsprintf() itself would be the problem. Cheers, -g -- Greg Stein, http://www.lyra.org/ From mal at lemburg.com Mon Nov 15 10:49:26 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Mon, 15 Nov 1999 10:49:26 +0100 Subject: [Python-Dev] PyErr_Format security note References: <Pine.LNX.4.10.9911150127320.2535-100000@nebula.lyra.org> Message-ID: <382FD726.6ACB912F@lemburg.com> Greg Stein wrote: > > On Mon, 15 Nov 1999, M.-A. Lemburg wrote: > >... > > In sysmodule.c, this check is done which should be safe enough > > since no "return" is issued (Py_FatalError() does an abort()): > > > > if (vsprintf(buffer, format, va) >= sizeof(buffer)) > > Py_FatalError("PySys_WriteStdout/err: buffer overrun"); > > I believe the return from vsprintf() itself would be the problem. Ouch, yes, you are right... but who could exploit this security hole ? Since PyErr_Format() is only reachable for C code, only bad programming style in extensions could make it exploitable via user input. Wouldn't it be possible to assign thread globals for these functions to use ? These would live on the heap instead of on the stack and eliminate the buffer overrun possibilities (I guess -- I don't have any experience with these...). -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 46 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From akuchlin at mems-exchange.org Mon Nov 15 16:17:58 1999 From: akuchlin at mems-exchange.org (Andrew M. Kuchling) Date: Mon, 15 Nov 1999 10:17:58 -0500 (EST) Subject: [Python-Dev] PyErr_Format security note In-Reply-To: <382FD726.6ACB912F@lemburg.com> References: <Pine.LNX.4.10.9911150127320.2535-100000@nebula.lyra.org> <382FD726.6ACB912F@lemburg.com> Message-ID: <14384.9254.152604.11688@amarok.cnri.reston.va.us> M.-A. Lemburg writes: >Ouch, yes, you are right... but who could exploit this security >hole ? Since PyErr_Format() is only reachable for C code, only >bad programming style in extensions could make it exploitable >via user input. 99% of security holes arise out of carelessness, and besides, this buffer size doesn't seem to be documented in either api.tex or ext.tex. I'll look into borrowing Apache's implementation and modifying it into a varargs form. -- A.M. Kuchling http://starship.python.net/crew/amk/ I can also withstand considerably more G-force than most people, even though I do say so myself. -- The Doctor, in "The Ambassadors of Death" From guido at CNRI.Reston.VA.US Mon Nov 15 16:23:57 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Mon, 15 Nov 1999 10:23:57 -0500 Subject: [Python-Dev] PyErr_Format security note In-Reply-To: Your message of "Sun, 14 Nov 1999 20:49:08 EST." <199911150149.UAA00408@mira.erols.com> References: <199911150149.UAA00408@mira.erols.com> Message-ID: <199911151523.KAA27163@eric.cnri.reston.va.us> > I noticed this in PyErr_Format(exception, format, va_alist): > > char buffer[500]; /* Caller is responsible for limiting the format */ > ... > vsprintf(buffer, format, vargs); > > Making the caller responsible for this is error-prone. Agreed. The limit of 500 chars, while technically undocumented, is part of the specs for PyErr_Format (which is currently wholly undocumented). The current callers all have explicit precautions, but of course I agree that this is a potential danger. > The danger, of > course, is a buffer overflow caused by generating an error string > that's larger than the buffer, possibly letting people execute > arbitrary code. We could add a test to the configure script for > vsnprintf() and use it when possible, but that only fixes the problem > on platforms which have it. Can we find an implementation of > vsnprintf() someplace? Assuming that Linux and Solaris have vsnprintf(), can't we just use the configure script to detect it, and issue a warning blaming the platform for those platforms that don't have it? That seems much simpler (from a maintenance perspective) than carrying our own implementation around (even if we can borrow the Apache version). --Guido van Rossum (home page: http://www.python.org/~guido/) From fdrake at acm.org Mon Nov 15 16:24:27 1999 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Mon, 15 Nov 1999 10:24:27 -0500 (EST) Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <382C6150.53BDC803@lemburg.com> References: <DBF3B37F7BF1D111B2A10000F6B14B1FDDAF24@ukhil704nts.hld.uk.fid-intl.com> <02f901bf2d07$bdf5d950$f29b12c2@secret.pythonware.com> <382C0FA8.ACB6CCD6@lemburg.com> <14380.10955.420102.327867@weyr.cnri.reston.va.us> <382C3749.198EEBC6@lemburg.com> <14380.16064.723277.586881@weyr.cnri.reston.va.us> <382C6150.53BDC803@lemburg.com> Message-ID: <14384.9643.145759.816037@weyr.cnri.reston.va.us> M.-A. Lemburg writes: > Guido proposed to add it to sys. I originally had it defined in > unicodec. Well, he clearly didn't ask me! ;-) > Perhaps a sys.endian would be more appropriate for sys > with values 'little' and 'big' or '<' and '>' to be conform > to the struct module. > > unicodec could then define unicodec.bom depending on the setting > in sys. This seems more reasonable, though I'd go with BOM instead of bom. But that's a style issue, so not so important. If your write bom, I'll write bom. -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> Corporation for National Research Initiatives From captainrobbo at yahoo.com Mon Nov 15 16:30:45 1999 From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=) Date: Mon, 15 Nov 1999 07:30:45 -0800 (PST) Subject: [Python-Dev] Some thoughts on the codecs... Message-ID: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> Some thoughts on the codecs... 1. Stream interface At the moment a codec has dump and load methods which read a (slice of a) stream into a string in memory and vice versa. As the proposal notes, this could lead to errors if you take a slice out of a stream. This is not just due to character truncation; some Asian encodings are modal and have shift-in and shift-out sequences as they move from Western single-byte characters to double-byte ones. It also seems a bit pointless to me as the source (or target) is still a Unicode string in memory. This is a real problem - a filter to convert big files between two encodings should be possible without knowledge of the particular encoding, as should one on the input/output of some server. We can still give a default implementation for single-byte encodings. What's a good API for real stream conversion? just Codec.encodeStream(infile, outfile) ? or is it more useful to feed the codec with data a chunk at a time? 2. Data driven codecs I really like codecs being objects, and believe we could build support for a lot more encodings, a lot sooner than is otherwise possible, by making them data driven rather making each one compiled C code with static mapping tables. What do people think about the approach below? First of all, the ISO8859-1 series are straight mappings to Unicode code points. So one Python script could parse these files and build the mapping table, and a very small data file could hold these encodings. A compiled helper function analogous to string.translate() could deal with most of them. Secondly, the double-byte ones involve a mixture of algorithms and data. The worst cases I know are modal encodings which need a single-byte lookup table, a double-byte lookup table, and have some very simple rules about escape sequences in between them. A simple state machine could still handle these (and the single-byte mappings above become extra-simple special cases); I could imagine feeding it a totally data-driven set of rules. Third, we can massively compress the mapping tables using a notation which just lists contiguous ranges; and very often there are relationships between encodings. For example, "cpXYZ is just like cpXYY but with an extra 'smiley' at 0XFE32". In these cases, a script can build a family of related codecs in an auditable manner. 3. What encodings to distribute? The only clean answers to this are 'almost none', or 'everything that Unicode 3.0 has a mapping for'. The latter is going to add some weight to the distribution. What are people's feelings? Do we ship any at all apart from the Unicode ones? Should new encodings be downloadable from www.python.org? Should there be an optional package outside the main distribution? Thanks, Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From akuchlin at mems-exchange.org Mon Nov 15 16:36:47 1999 From: akuchlin at mems-exchange.org (Andrew M. Kuchling) Date: Mon, 15 Nov 1999 10:36:47 -0500 (EST) Subject: [Python-Dev] PyErr_Format security note In-Reply-To: <199911151523.KAA27163@eric.cnri.reston.va.us> References: <199911150149.UAA00408@mira.erols.com> <199911151523.KAA27163@eric.cnri.reston.va.us> Message-ID: <14384.10383.718373.432606@amarok.cnri.reston.va.us> Guido van Rossum writes: >Assuming that Linux and Solaris have vsnprintf(), can't we just use >the configure script to detect it, and issue a warning blaming the >platform for those platforms that don't have it? That seems much But people using an already-installed Python binary won't see any such configure-time warning, and won't find out about the potential problem. Plus, how do people fix the problem on platforms that don't have vsnprintf() -- switch to Solaris or Linux? Not much of a solution. (vsnprintf() isn't ANSI C, though it's a common extension, so platforms that lack it aren't really deficient.) Hmm... could we maybe use Python's existing (string % vars) machinery? <think think> No, that seems to be hard, because it would want PyObjects, and we can't know what Python types to convert the varargs to, unless we parse the format string (at which point we may as well get a vsnprintf() implementation. -- A.M. Kuchling http://starship.python.net/crew/amk/ A successful tool is one that was used to do something undreamed of by its author. -- S.C. Johnson From guido at CNRI.Reston.VA.US Mon Nov 15 16:50:24 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Mon, 15 Nov 1999 10:50:24 -0500 Subject: [Python-Dev] just say no... In-Reply-To: Your message of "Sun, 14 Nov 1999 23:11:54 +0100." <382F33AA.C3EE825A@lemburg.com> References: <Pine.LNX.4.10.9911121505210.2535-100000@nebula.lyra.org> <382D315F.A7ADEC42@lemburg.com> <199911131306.IAA26030@eric.cnri.reston.va.us> <382F33AA.C3EE825A@lemburg.com> Message-ID: <199911151550.KAA27188@eric.cnri.reston.va.us> > On purpose -- according to my thinking. I see "t#" as an interface > to bf_getcharbuf which I understand as 8-bit character buffer... > UTF-8 is a multi byte encoding. It still is character data, but > not necessarily 8 bits in length (up to 24 bits are used). > > Anyway, I'm not really interested in having an argument about > this. If you say, "t#" fits the purpose, then that's fine with > me. Still, we should clearly define that "t#" returns > text data and "s#" binary data. Encoding, bit length, etc. should > explicitly remain left undefined. Thanks for not picking an argument. Multibyte encodings typically have ASCII as a subset (in such a way that an ASCII string is represented as itself in bytes). This is the characteristic that's needed in my view. > > > First, we have a general design question here: should old code > > > become Unicode compatible or not. As I recall the original idea > > > about Unicode integration was to follow Perl's idea to have > > > scripts become Unicode aware by simply adding a 'use utf8;'. > > > > I've never heard of this idea before -- or am I taking it too literal? > > It smells of a mode to me :-) I'd rather live in a world where > > Unicode just works as long as you use u'...' literals or whatever > > convention we decide. > > > > > If this is still the case, then we'll have to come with a > > > resonable approach for integrating classical string based > > > APIs with the new type. > > > > > > Since UTF-8 is a standard (some would probably prefer UTF-7,5 e.g. > > > the Latin-1 folks) which has some very nice features (see > > > http://czyborra.com/utf/ ) and which is a true extension of ASCII, > > > this encoding seems best fit for the purpose. > > > > Yes, especially if we fix the default encoding as UTF-8. (I'm > > expecting feedback from HP on this next week, hopefully when I see the > > details, it'll be clear that don't need a per-thread default encoding > > to solve their problems; that's quite a likely outcome. If not, we > > have a real-world argument for allowing a variable default encoding, > > without carnage.) > > Fair enough :-) > > > > However, one should not forget that UTF-8 is in fact a > > > variable length encoding of Unicode characters, that is up to > > > 3 bytes form a *single* character. This is obviously not compatible > > > with definitions that explicitly state data to be using a > > > 8-bit single character encoding, e.g. indexing in UTF-8 doesn't > > > work like it does in Latin-1 text. > > > > Sure, but where in current Python are there such requirements? > > It was my understanding that "t#" refers to single byte character > data. That's where the above arguments were aiming at... t# refers to byte-encoded data. Multibyte encodings are explicitly designed to be passed cleanly through processing steps that handle single-byte character data, as long as they are 8-bit clean and don't do too much processing. > > > So if we are to do the integration, we'll have to choose > > > argument parser markers that allow for multi byte characters. > > > "t#" does not fall into this category, "s#" certainly does, > > > "s" is argueable. > > > > I disagree. I grepped through the source for s# and t#. Here's a bit > > of background. Before t# was introduced, s# was being used for two > > distinct purposes: (1) to get an 8-bit text string plus its length, in > > situations where the length was needed; (2) to get binary data (e.g. > > GIF data read from a file in "rb" mode). Greg pointed out that if we > > ever introduced some form of Unicode support, these two had to be > > disambiguated. We found that the majority of uses was for (2)! > > Therefore we decided to change the definition of s# to mean only (2), > > and introduced t# to mean (1). Also, we introduced getcharbuffer > > corresponding to t#, while getreadbuffer was meant for s#. > > I know its too late now, but I can't really follow the arguments > here: in what ways are (1) and (2) different from the implementations > point of view ? If "t#" is to return UTF-8 then <length of the > buffer> will not equal <text length>, so both parser markers return > essentially the same information. The only difference would be > on the semantic side: (1) means: give me text data, while (2) does > not specify the data type. > > Perhaps I'm missing something... The idea is that (1)/s# disallows any translation of the data, while (2)/t# requires translation of the data to an ASCII superset (possibly multibyte, such as UTF-8 or shift-JIS). (2)/t# assumes that the data contains text and that if the text consists of only ASCII characters they are represented as themselves. (1)/s# makes no such assumption. In terms of implementation, Unicode objects should translate themselves to the default encoding for t# (if possible), but they should make the native representation available for s#. For example, take an encryption engine. While it is defined in terms of byte streams, there's no requirement that the bytes represent characters -- they could be the bytes of a GIF file, an MP3 file, or a gzipped tar file. If we pass Unicode to an encryption engine, we want Unicode to come out at the other end, not UTF-8. (If we had wanted to encrypt UTF-8, we should have fed it UTF-8.) > > Note that the definition of the 's' format was left alone -- as > > before, it means you need an 8-bit text string not containing null > > bytes. > > This definition should then be changed to "text string without > null bytes" dropping the 8-bit reference. Aha, I think there's a confusion about what "8-bit" means. For me, a multibyte encoding like UTF-8 is still 8-bit. Am I alone in this? (As far as I know, C uses char* to represent multibyte characters.) Maybe we should disambiguate it more explicitly? > > Our expectation was that a Unicode string passed to an s# situation > > would give a pointer to the internal format plus a byte count (not a > > character count!) while t# would get a pointer to some kind of 8-bit > > translation/encoding plus a byte count, with the explicit requirement > > that the 8-bit translation would have the same lifetime as the > > original unicode object. We decided to leave it up to the next > > generation (i.e., Marc-Andre :-) to decide what kind of translation to > > use and what to do when there is no reasonable translation. > > Hmm, I would strongly object to making "s#" return the internal > format. file.write() would then default to writing UTF-16 data > instead of UTF-8 data. This could result in strange errors > due to the UTF-16 format being endian dependent. But this was the whole design. file.write() needs to be changed to use s# when the file is open in binary mode and t# when the file is open in text mode. > It would also break the symmetry between file.write(u) and > unicode(file.read()), since the default encoding is not used as > internal format for other reasons (see proposal). If the file is encoded using UTF-16 or UCS-2, you should open it in binary mode and use unicode(file.read(), 'utf-16'). (Or perhaps the app should read the first 2 bytes and check for a BOM and then decide to choose bewteen 'utf-16-be' and 'utf-16-le'.) > > Any of the following choices is acceptable (from the point of view of > > not breaking the intended t# semantics; we can now start deciding > > which we like best): > > I think we have already agreed on using UTF-8 for the default > encoding. It has quite a few advantages. See > > http://czyborra.com/utf/ > > for a good overview of the pros and cons. Of course. I was just presenting the list as an argument that if we changed our mind about the default encoding, t# should follow the default encoding (and not pick an encoding by other means). > > - utf-8 > > - latin-1 > > - ascii > > - shift-jis > > - lower byte of unicode ordinal > > - some user- or os-specified multibyte encoding > > > > As far as t# is concerned, for encodings that don't encode all of > > Unicode, untranslatable characters could be dealt with in any number > > of ways (raise an exception, ignore, replace with '?', make best > > effort, etc.). > > The usual Python way would be: raise an exception. This is what > the proposal defines for Codecs in case an encoding/decoding > mapping is not possible, BTW. (UTF-8 will always succeed on > output.) Did you read Andy Robinson's case study? He suggested that for certain encodings there may be other things you can do that are more user-friendly than raising an exception, depending on the application. I am proposing to leave this a detail of each specific translation. There may even be translations that do the same thing except they have a different behavior for untranslatable cases -- e.g. a strict version that raises an exception and a non-strict version that replaces bad characters with '?'. I think this is one of the powers of having an extensible set of encodings. > > Given the current context, it should probably be the same as the > > default encoding -- i.e., utf-8. If we end up making the default > > user-settable, we'll have to decide what to do with untranslatable > > characters -- but that will probably be decided by the user too (it > > would be a property of a specific translation specification). > > > > In any case, I feel that t# could receive a multi-byte encoding, > > s# should receive raw binary data, and they should correspond to > > getcharbuffer and getreadbuffer, respectively. > > Why would you want to have "s#" return the raw binary data for > Unicode objects ? Because file.write() for a binary file, and other similar things (e.g. the encryption engine example I mentioned above) must have *some* way to get at the raw bits. > Note that it is not mentioned anywhere that > "s#" and "t#" do have to necessarily return different things > (binary being a superset of text). I'd opt for "s#" and "t#" both > returning UTF-8 data. This can be implemented by delegating the > buffer slots to the <defencstr> object (see below). This would defeat the whole purpose of introducing t#. We might as well drop t# then altogether if we adopt this. > > > Now Greg would chime in with the buffer interface and > > > argue that it should make the underlying internal > > > format accessible. This is a bad idea, IMHO, since you > > > shouldn't really have to know what the internal data format > > > is. > > > > This is for C code. Quite likely it *does* know what the internal > > data format is! > > C code can use the PyUnicode_* APIs to access the data. I > don't think that argument parsing is powerful enough to > provide the C code with enough information about the data > contents, e.g. it can only state the encoding length, not the > string length. Typically, all the C code does is pass multibyte encoded strings on to other library routines that know what to do to them, or simply give them back unchanged at a later time. It is essential to know the number of bytes, for memory allocation purposes. The number of characters is totally immaterial (and multibyte-handling code knows how to calculate the number of characters anyway). > > > Defining "s#" to return UTF-8 data does not only > > > make "s" and "s#" return the same data format (which should > > > always be the case, IMO), > > > > That was before t# was introduced. No more, alas. If you replace s# > > with t#, I agree with you completely. > > Done :-) > > > > but also hides the internal > > > format from the user and gives him a reliable cross-platform > > > data representation of Unicode data (note that UTF-8 doesn't > > > have the byte order problems of UTF-16). > > > > > > If you are still with, let's look at what "s" and "s#" > > > > (and t#, which is more relevant here) > > > > > do: they return pointers into data areas which have to > > > be kept alive until the corresponding object dies. > > > > > > The only way to support this feature is by allocating > > > a buffer for just this purpose (on the fly and only if > > > needed to prevent excessive memory load). The other > > > options of adding new magic parser markers or switching > > > to more generic one all have one downside: you need to > > > change existing code which is in conflict with the idea > > > we started out with. > > > > Agreed. I think this was our thinking when Greg & I introduced t#. > > My own preference would be to allocate a whole string object, not > > just a buffer; this could then also be used for the .encode() method > > using the default encoding. > > Good point. I'll change <defencbuf> to <defencstr>, a Python > string object created on request. > > > > So, again, the question is: do we want this magical > > > intergration or not ? Note that this is a design question, > > > not one of memory consumption... > > > > Yes, I want it. > > > > Note that this doesn't guarantee that all old extensions will work > > flawlessly when passed Unicode objects; but I think that it covers > > most cases where you could have a reasonable expectation that it > > works. > > > > (Hm, unfortunately many reasonable expectations seem to involve > > the current user's preferred encoding. :-( ) > > -- > Marc-Andre Lemburg --Guido van Rossum (home page: http://www.python.org/~guido/) From Mike.Da.Silva at uk.fid-intl.com Mon Nov 15 17:01:59 1999 From: Mike.Da.Silva at uk.fid-intl.com (Da Silva, Mike) Date: Mon, 15 Nov 1999 16:01:59 -0000 Subject: [Python-Dev] Some thoughts on the codecs... Message-ID: <DBF3B37F7BF1D111B2A10000F6B14B1FDDAF2C@ukhil704nts.hld.uk.fid-intl.com> Andy Robinson wrote: 1. Stream interface At the moment a codec has dump and load methods which read a (slice of a) stream into a string in memory and vice versa. As the proposal notes, this could lead to errors if you take a slice out of a stream. This is not just due to character truncation; some Asian encodings are modal and have shift-in and shift-out sequences as they move from Western single-byte characters to double-byte ones. It also seems a bit pointless to me as the source (or target) is still a Unicode string in memory. This is a real problem - a filter to convert big files between two encodings should be possible without knowledge of the particular encoding, as should one on the input/output of some server. We can still give a default implementation for single-byte encodings. What's a good API for real stream conversion? just Codec.encodeStream(infile, outfile) ? or is it more useful to feed the codec with data a chunk at a time? A user defined chunking factor (suitably defaulted) would be useful for processing large files. 2. Data driven codecs I really like codecs being objects, and believe we could build support for a lot more encodings, a lot sooner than is otherwise possible, by making them data driven rather making each one compiled C code with static mapping tables. What do people think about the approach below? First of all, the ISO8859-1 series are straight mappings to Unicode code points. So one Python script could parse these files and build the mapping table, and a very small data file could hold these encodings. A compiled helper function analogous to string.translate() could deal with most of them. Secondly, the double-byte ones involve a mixture of algorithms and data. The worst cases I know are modal encodings which need a single-byte lookup table, a double-byte lookup table, and have some very simple rules about escape sequences in between them. A simple state machine could still handle these (and the single-byte mappings above become extra-simple special cases); I could imagine feeding it a totally data-driven set of rules. Third, we can massively compress the mapping tables using a notation which just lists contiguous ranges; and very often there are relationships between encodings. For example, "cpXYZ is just like cpXYY but with an extra 'smiley' at 0XFE32". In these cases, a script can build a family of related codecs in an auditable manner. The problem here is that we need to decide whether we are Unicode-centric, or whether Unicode is just another supported encoding. If we are Unicode-centric, then all code-page translations will require static mapping tables between the appropriate Unicode character and the relevant code points in the other encoding. This would involve (worst case) 64k static tables for each supported encoding. Unfortunately this also precludes the use of algorithmic conversions and or sparse conversion tables because most of these transformations are relative to a source and target non-Unicode encoding, eg JIS <---->EUCJIS. If we are taking the IBM approach (see CDRA), then we can mix and match approaches, and treat Unicode strings as just Unicode, and normal strings as being any arbitrary MBCS encoding. To guarantee the utmost interoperability and Unicode 3.0 (and beyond) compliance, we should probably assume that all core encodings are relative to Unicode as the pivot encoding. This should hopefully avoid any gotcha's with roundtrips between any two arbitrary native encodings. The downside is this will probably be slower than an optimised algorithmic transformation. 3. What encodings to distribute? The only clean answers to this are 'almost none', or 'everything that Unicode 3.0 has a mapping for'. The latter is going to add some weight to the distribution. What are people's feelings? Do we ship any at all apart from the Unicode ones? Should new encodings be downloadable from www.python.org <http://www.python.org> ? Should there be an optional package outside the main distribution? Ship with Unicode encodings in the core, the rest should be an add on package. If we are truly Unicode-centric, this gives us the most value in terms of accessing a Unicode character properties database, which will provide language neutral case folding, Hankaku <----> Zenkaku folding (Japan specific), and composition / normalisation between composed characters and their component nonspacing characters. Regards, Mike da Silva From captainrobbo at yahoo.com Mon Nov 15 17:18:13 1999 From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=) Date: Mon, 15 Nov 1999 08:18:13 -0800 (PST) Subject: [Python-Dev] just say no... Message-ID: <19991115161813.13111.rocketmail@web606.mail.yahoo.com> --- Guido van Rossum <guido at CNRI.Reston.VA.US> wrote: > Did you read Andy Robinson's case study? He > suggested that for certain encodings there may be > other things you can do that are more > user-friendly than raising an exception, depending > on the application. I am proposing to leave this a > detail of each specific translation. > There may even be translations that do the same thing > except they have a different behavior for > untranslatable cases -- e.g. a strict version > that raises an exception and a non-strict version > that replaces bad characters with '?'. I think this > is one of the powers of having an extensible set of > encodings. This would be a desirable option in almost every case. Default is an exception (I want to know my data is not clean), but an option to specify an error character. It is usually a question mark but Mike tells me that some encodings specify the error character to use. Example - I query a Sybase Unicode database containing European accents or Japanese. By default it will give me question marks. If I issue the command 'set char_convert utf8', then I see the lot (as garbage, but never mind). If it always errored whenever a query result contained unexpected data, it would be almost impossible to maintain the database. If I wrote my own codec class for a family of encodings, I'd give it an even wider variety of error-logging options - maybe a mode where it told me where in the file the dodgy characters were. We've already taken the key step by allowing codecs to be separate objects registered at run-time, implemented in either C or Python. This means that once again Python will have the most flexible solution around. - Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From jim at digicool.com Mon Nov 15 17:29:13 1999 From: jim at digicool.com (Jim Fulton) Date: Mon, 15 Nov 1999 11:29:13 -0500 Subject: [Python-Dev] PyErr_Format security note References: <199911150149.UAA00408@mira.erols.com> Message-ID: <383034D9.6E1E74D4@digicool.com> "A.M. Kuchling" wrote: > > I noticed this in PyErr_Format(exception, format, va_alist): > > char buffer[500]; /* Caller is responsible for limiting the format */ > ... > vsprintf(buffer, format, vargs); > > Making the caller responsible for this is error-prone. The danger, of > course, is a buffer overflow caused by generating an error string > that's larger than the buffer, possibly letting people execute > arbitrary code. We could add a test to the configure script for > vsnprintf() and use it when possible, but that only fixes the problem > on platforms which have it. Can we find an implementation of > vsnprintf() someplace? I would prefer to see a different interface altogether: PyObject *PyErr_StringFormat(errtype, format, buildformat, ...) So, you could generate an error like this: return PyErr_StringFormat(ErrorObject, "You had too many, %d, foos. The last one was %s", "iO", n, someObject) I implemented this in cPickle. See cPickle_ErrFormat. (Note that it always returns NULL.) Jim -- Jim Fulton mailto:jim at digicool.com Python Powered! Technical Director (888) 344-4332 http://www.python.org Digital Creations http://www.digicool.com http://www.zope.org Under US Code Title 47, Sec.227(b)(1)(C), Sec.227(a)(2)(B) This email address may not be added to any commercial mail list with out my permission. Violation of my privacy with advertising or SPAM will result in a suit for a MINIMUM of $500 damages/incident, $1500 for repeats. From bwarsaw at cnri.reston.va.us Mon Nov 15 17:54:10 1999 From: bwarsaw at cnri.reston.va.us (Barry A. Warsaw) Date: Mon, 15 Nov 1999 11:54:10 -0500 (EST) Subject: [Python-Dev] PyErr_Format security note References: <199911150149.UAA00408@mira.erols.com> <199911151523.KAA27163@eric.cnri.reston.va.us> Message-ID: <14384.15026.392781.151886@anthem.cnri.reston.va.us> >>>>> "Guido" == Guido van Rossum <guido at cnri.reston.va.us> writes: Guido> Assuming that Linux and Solaris have vsnprintf(), can't we Guido> just use the configure script to detect it, and issue a Guido> warning blaming the platform for those platforms that don't Guido> have it? That seems much simpler (from a maintenance Guido> perspective) than carrying our own implementation around Guido> (even if we can borrow the Apache version). Mailman uses vsnprintf in it's C wrapper. There's a simple configure test... # Checks for library functions. AC_CHECK_FUNCS(vsnprintf) ...and for systems that don't have a vsnprintf, I modified a version from GNU screen. It may not have gone through the scrutiny of Apache's implementation, but for Mailman it was more important that it be GPL'd (not a Python requirement). -Barry From jim at digicool.com Mon Nov 15 17:56:38 1999 From: jim at digicool.com (Jim Fulton) Date: Mon, 15 Nov 1999 11:56:38 -0500 Subject: [Python-Dev] PyErr_Format security note References: <199911150149.UAA00408@mira.erols.com> <199911151523.KAA27163@eric.cnri.reston.va.us> <14384.10383.718373.432606@amarok.cnri.reston.va.us> Message-ID: <38303B46.F6AEEDF1@digicool.com> "Andrew M. Kuchling" wrote: > > Guido van Rossum writes: > >Assuming that Linux and Solaris have vsnprintf(), can't we just use > >the configure script to detect it, and issue a warning blaming the > >platform for those platforms that don't have it? That seems much > > But people using an already-installed Python binary won't see any such > configure-time warning, and won't find out about the potential > problem. Plus, how do people fix the problem on platforms that don't > have vsnprintf() -- switch to Solaris or Linux? Not much of a > solution. (vsnprintf() isn't ANSI C, though it's a common extension, > so platforms that lack it aren't really deficient.) > > Hmm... could we maybe use Python's existing (string % vars) machinery? > <think think> No, that seems to be hard, because it would want > PyObjects, and we can't know what Python types to convert the varargs > to, unless we parse the format string (at which point we may as well > get a vsnprintf() implementation. It's easy. You use two format strings. One a Python string format, and the other a Py_BuildValue format. See my other note. Jim -- Jim Fulton mailto:jim at digicool.com Python Powered! Technical Director (888) 344-4332 http://www.python.org Digital Creations http://www.digicool.com http://www.zope.org Under US Code Title 47, Sec.227(b)(1)(C), Sec.227(a)(2)(B) This email address may not be added to any commercial mail list with out my permission. Violation of my privacy with advertising or SPAM will result in a suit for a MINIMUM of $500 damages/incident, $1500 for repeats. From tismer at appliedbiometrics.com Mon Nov 15 18:02:20 1999 From: tismer at appliedbiometrics.com (Christian Tismer) Date: Mon, 15 Nov 1999 18:02:20 +0100 Subject: [Python-Dev] PyErr_Format security note References: <199911150149.UAA00408@mira.erols.com> <199911151523.KAA27163@eric.cnri.reston.va.us> Message-ID: <38303C9C.42C5C830@appliedbiometrics.com> Guido van Rossum wrote: > > > I noticed this in PyErr_Format(exception, format, va_alist): > > > > char buffer[500]; /* Caller is responsible for limiting the format */ > > ... > > vsprintf(buffer, format, vargs); > > > > Making the caller responsible for this is error-prone. > > Agreed. The limit of 500 chars, while technically undocumented, is > part of the specs for PyErr_Format (which is currently wholly > undocumented). The current callers all have explicit precautions, but > of course I agree that this is a potential danger. All but one (checked them all): In ceval.c, function call_builtin, there is a possible security hole. If an extension module happens to create a very long type name (maybe just via a bug), we will crash. } PyErr_Format(PyExc_TypeError, "call of non-function (type %s)", func->ob_type->tp_name); return NULL; } ciao - chris -- Christian Tismer :^) <mailto:tismer at appliedbiometrics.com> Applied Biometrics GmbH : Have a break! Take a ride on Python's Kaiserin-Augusta-Allee 101 : *Starship* http://starship.python.net 10553 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF we're tired of banana software - shipped green, ripens at home From guido at CNRI.Reston.VA.US Mon Nov 15 20:32:00 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Mon, 15 Nov 1999 14:32:00 -0500 Subject: [Python-Dev] PyErr_Format security note In-Reply-To: Your message of "Mon, 15 Nov 1999 18:02:20 +0100." <38303C9C.42C5C830@appliedbiometrics.com> References: <199911150149.UAA00408@mira.erols.com> <199911151523.KAA27163@eric.cnri.reston.va.us> <38303C9C.42C5C830@appliedbiometrics.com> Message-ID: <199911151932.OAA28008@eric.cnri.reston.va.us> > All but one (checked them all): Thanks for checking. > In ceval.c, function call_builtin, there is a possible security hole. > If an extension module happens to create a very long type name > (maybe just via a bug), we will crash. > > } > PyErr_Format(PyExc_TypeError, "call of non-function (type %s)", > func->ob_type->tp_name); > return NULL; > } I would think that an extension module with a name of nearly 500 characters would draw a lot of attention as being ridiculous. If there was a bug through which you could make tp_name point to such a long string, you could probably exploit that bug without having to use this particular PyErr_Format() statement. However, I agree it's better to be safe than sorry, so I've checked in a fix making it %.400s. --Guido van Rossum (home page: http://www.python.org/~guido/) From tismer at appliedbiometrics.com Mon Nov 15 20:41:14 1999 From: tismer at appliedbiometrics.com (Christian Tismer) Date: Mon, 15 Nov 1999 20:41:14 +0100 Subject: [Python-Dev] PyErr_Format security note References: <199911150149.UAA00408@mira.erols.com> <199911151523.KAA27163@eric.cnri.reston.va.us> <38303C9C.42C5C830@appliedbiometrics.com> <199911151932.OAA28008@eric.cnri.reston.va.us> Message-ID: <383061DA.CA5CB373@appliedbiometrics.com> Guido van Rossum wrote: > > > All but one (checked them all): [ceval.c without limits] > I would think that an extension module with a name of nearly 500 > characters would draw a lot of attention as being ridiculous. If > there was a bug through which you could make tp_name point to such a > long string, you could probably exploit that bug without having to use > this particular PyErr_Format() statement. Of course this case is very unlikely. My primary intent was to create such a mess without an extension, and ExtensionClass seemed to be a candidate since it synthetizes a type name at runtime (!). This would have been dangerous since EC is in the heart of Zope. But, I could not get at this special case since EC always stands the class/instance checks and so this case can never happen :( The above lousy result was just to say *something* after no success. > However, I agree it's better to be safe than sorry, so I've checked in > a fix making it %.400s. cheap, consistent, fine - thanks - chris -- Christian Tismer :^) <mailto:tismer at appliedbiometrics.com> Applied Biometrics GmbH : Have a break! Take a ride on Python's Kaiserin-Augusta-Allee 101 : *Starship* http://starship.python.net 10553 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF we're tired of banana software - shipped green, ripens at home From mal at lemburg.com Mon Nov 15 20:04:59 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Mon, 15 Nov 1999 20:04:59 +0100 Subject: [Python-Dev] just say no... References: <Pine.LNX.4.10.9911121505210.2535-100000@nebula.lyra.org> <382D315F.A7ADEC42@lemburg.com> <199911131306.IAA26030@eric.cnri.reston.va.us> <382F33AA.C3EE825A@lemburg.com> <199911151550.KAA27188@eric.cnri.reston.va.us> Message-ID: <3830595B.348E8CC7@lemburg.com> Guido van Rossum wrote: > > [Misunderstanding in the reasoning behind "t#" and "s#"] > > Thanks for not picking an argument. Multibyte encodings typically > have ASCII as a subset (in such a way that an ASCII string is > represented as itself in bytes). This is the characteristic that's > needed in my view. > > > It was my understanding that "t#" refers to single byte character > > data. That's where the above arguments were aiming at... > > t# refers to byte-encoded data. Multibyte encodings are explicitly > designed to be passed cleanly through processing steps that handle > single-byte character data, as long as they are 8-bit clean and don't > do too much processing. Ah, ok. I interpreted 8-bit to mean: 8 bits in length, not "8-bit clean" as you obviously did. > > Perhaps I'm missing something... > > The idea is that (1)/s# disallows any translation of the data, while > (2)/t# requires translation of the data to an ASCII superset (possibly > multibyte, such as UTF-8 or shift-JIS). (2)/t# assumes that the data > contains text and that if the text consists of only ASCII characters > they are represented as themselves. (1)/s# makes no such assumption. > > In terms of implementation, Unicode objects should translate > themselves to the default encoding for t# (if possible), but they > should make the native representation available for s#. > > For example, take an encryption engine. While it is defined in terms > of byte streams, there's no requirement that the bytes represent > characters -- they could be the bytes of a GIF file, an MP3 file, or a > gzipped tar file. If we pass Unicode to an encryption engine, we want > Unicode to come out at the other end, not UTF-8. (If we had wanted to > encrypt UTF-8, we should have fed it UTF-8.) > > > > Note that the definition of the 's' format was left alone -- as > > > before, it means you need an 8-bit text string not containing null > > > bytes. > > > > This definition should then be changed to "text string without > > null bytes" dropping the 8-bit reference. > > Aha, I think there's a confusion about what "8-bit" means. For me, a > multibyte encoding like UTF-8 is still 8-bit. Am I alone in this? > (As far as I know, C uses char* to represent multibyte characters.) > Maybe we should disambiguate it more explicitly? There should be some definition for the two markers and the ideas behind them in the API guide, I guess. > > Hmm, I would strongly object to making "s#" return the internal > > format. file.write() would then default to writing UTF-16 data > > instead of UTF-8 data. This could result in strange errors > > due to the UTF-16 format being endian dependent. > > But this was the whole design. file.write() needs to be changed to > use s# when the file is open in binary mode and t# when the file is > open in text mode. Ok, that would make the situation a little clearer (even though I expect the two different encodings to produce some FAQs). I still don't feel very comfortable about the fact that all existing APIs using "s#" will suddenly receive UTF-16 data if being passed Unicode objects: this probably won't get us the "magical" Unicode integration we invision, since "t#" usage is not very wide spread and character handling code will probably not work well with UTF-16 encoded strings. Anyway, we should probably try out both methods... > > It would also break the symmetry between file.write(u) and > > unicode(file.read()), since the default encoding is not used as > > internal format for other reasons (see proposal). > > If the file is encoded using UTF-16 or UCS-2, you should open it in > binary mode and use unicode(file.read(), 'utf-16'). (Or perhaps the > app should read the first 2 bytes and check for a BOM and then decide > to choose bewteen 'utf-16-be' and 'utf-16-le'.) Right, that's the idea (there is a note on this in the Standard Codec section of the proposal). > > > Any of the following choices is acceptable (from the point of view of > > > not breaking the intended t# semantics; we can now start deciding > > > which we like best): > > > > I think we have already agreed on using UTF-8 for the default > > encoding. It has quite a few advantages. See > > > > http://czyborra.com/utf/ > > > > for a good overview of the pros and cons. > > Of course. I was just presenting the list as an argument that if > we changed our mind about the default encoding, t# should follow the > default encoding (and not pick an encoding by other means). Ok. > > > - utf-8 > > > - latin-1 > > > - ascii > > > - shift-jis > > > - lower byte of unicode ordinal > > > - some user- or os-specified multibyte encoding > > > > > > As far as t# is concerned, for encodings that don't encode all of > > > Unicode, untranslatable characters could be dealt with in any number > > > of ways (raise an exception, ignore, replace with '?', make best > > > effort, etc.). > > > > The usual Python way would be: raise an exception. This is what > > the proposal defines for Codecs in case an encoding/decoding > > mapping is not possible, BTW. (UTF-8 will always succeed on > > output.) > > Did you read Andy Robinson's case study? He suggested that for > certain encodings there may be other things you can do that are more > user-friendly than raising an exception, depending on the application. > I am proposing to leave this a detail of each specific translation. > There may even be translations that do the same thing except they have > a different behavior for untranslatable cases -- e.g. a strict version > that raises an exception and a non-strict version that replaces bad > characters with '?'. I think this is one of the powers of having an > extensible set of encodings. Agreed, the Codecs should decide for themselves what to do. I'll add a note to the next version of the proposal. > > > Given the current context, it should probably be the same as the > > > default encoding -- i.e., utf-8. If we end up making the default > > > user-settable, we'll have to decide what to do with untranslatable > > > characters -- but that will probably be decided by the user too (it > > > would be a property of a specific translation specification). > > > > > > In any case, I feel that t# could receive a multi-byte encoding, > > > s# should receive raw binary data, and they should correspond to > > > getcharbuffer and getreadbuffer, respectively. > > > > Why would you want to have "s#" return the raw binary data for > > Unicode objects ? > > Because file.write() for a binary file, and other similar things > (e.g. the encryption engine example I mentioned above) must have > *some* way to get at the raw bits. What for ? Any lossless encoding should do the trick... UTF-8 is just as good as UTF-16 for binary files; plus it's more compact for ASCII data. I don't really see a need to get explicitly at the internal data representation because both encodings are in fact "internal" w/r to Unicode objects. The only argument I can come up with is that using UTF-16 for binary files could (possibly) eliminate the UTF-8 conversion step which is otherwise always needed. > > Note that it is not mentioned anywhere that > > "s#" and "t#" do have to necessarily return different things > > (binary being a superset of text). I'd opt for "s#" and "t#" both > > returning UTF-8 data. This can be implemented by delegating the > > buffer slots to the <defencstr> object (see below). > > This would defeat the whole purpose of introducing t#. We might as > well drop t# then altogether if we adopt this. Well... yes ;-) > > > > Now Greg would chime in with the buffer interface and > > > > argue that it should make the underlying internal > > > > format accessible. This is a bad idea, IMHO, since you > > > > shouldn't really have to know what the internal data format > > > > is. > > > > > > This is for C code. Quite likely it *does* know what the internal > > > data format is! > > > > C code can use the PyUnicode_* APIs to access the data. I > > don't think that argument parsing is powerful enough to > > provide the C code with enough information about the data > > contents, e.g. it can only state the encoding length, not the > > string length. > > Typically, all the C code does is pass multibyte encoded strings on to > other library routines that know what to do to them, or simply give > them back unchanged at a later time. It is essential to know the > number of bytes, for memory allocation purposes. The number of > characters is totally immaterial (and multibyte-handling code knows > how to calculate the number of characters anyway). -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 46 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Mon Nov 15 20:20:55 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Mon, 15 Nov 1999 20:20:55 +0100 Subject: [Python-Dev] Some thoughts on the codecs... References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> Message-ID: <38305D17.60EC94D0@lemburg.com> Andy Robinson wrote: > > Some thoughts on the codecs... > > 1. Stream interface > At the moment a codec has dump and load methods which > read a (slice of a) stream into a string in memory and > vice versa. As the proposal notes, this could lead to > errors if you take a slice out of a stream. This is > not just due to character truncation; some Asian > encodings are modal and have shift-in and shift-out > sequences as they move from Western single-byte > characters to double-byte ones. It also seems a bit > pointless to me as the source (or target) is still a > Unicode string in memory. > > This is a real problem - a filter to convert big files > between two encodings should be possible without > knowledge of the particular encoding, as should one on > the input/output of some server. We can still give a > default implementation for single-byte encodings. > > What's a good API for real stream conversion? just > Codec.encodeStream(infile, outfile) ? or is it more > useful to feed the codec with data a chunk at a time? The idea was to use Unicode as intermediate for all encoding conversions. What you invision here are stream recoders. The can easily be implemented as an useful addition to the Codec subclasses, but I don't think that these have to go into the core. > 2. Data driven codecs > I really like codecs being objects, and believe we > could build support for a lot more encodings, a lot > sooner than is otherwise possible, by making them data > driven rather making each one compiled C code with > static mapping tables. What do people think about the > approach below? > > First of all, the ISO8859-1 series are straight > mappings to Unicode code points. So one Python script > could parse these files and build the mapping table, > and a very small data file could hold these encodings. > A compiled helper function analogous to > string.translate() could deal with most of them. The problem with these large tables is that currently Python modules are not shared among processes since every process builds its own table. Static C data has the advantage of being shareable at the OS level. You can of course implement Python based lookup tables, but these should be too large... > Secondly, the double-byte ones involve a mixture of > algorithms and data. The worst cases I know are modal > encodings which need a single-byte lookup table, a > double-byte lookup table, and have some very simple > rules about escape sequences in between them. A > simple state machine could still handle these (and the > single-byte mappings above become extra-simple special > cases); I could imagine feeding it a totally > data-driven set of rules. > > Third, we can massively compress the mapping tables > using a notation which just lists contiguous ranges; > and very often there are relationships between > encodings. For example, "cpXYZ is just like cpXYY but > with an extra 'smiley' at 0XFE32". In these cases, a > script can build a family of related codecs in an > auditable manner. These are all great ideas, but I think they unnecessarily complicate the proposal. > 3. What encodings to distribute? > The only clean answers to this are 'almost none', or > 'everything that Unicode 3.0 has a mapping for'. The > latter is going to add some weight to the > distribution. What are people's feelings? Do we ship > any at all apart from the Unicode ones? Should new > encodings be downloadable from www.python.org? Should > there be an optional package outside the main > distribution? Since Codecs can be registered at runtime, there is quite some potential there for extension writers coding their own fast codecs. E.g. one could use mxTextTools as codec engine working at C speeds. I would propose to only add some very basic encodings to the standard distribution, e.g. the ones mentioned under Standard Codecs in the proposal: 'utf-8': 8-bit variable length encoding 'utf-16': 16-bit variable length encoding (litte/big endian) 'utf-16-le': utf-16 but explicitly little endian 'utf-16-be': utf-16 but explicitly big endian 'ascii': 7-bit ASCII codepage 'latin-1': Latin-1 codepage 'html-entities': Latin-1 + HTML entities; see htmlentitydefs.py from the standard Pythin Lib 'jis' (a popular version XXX): Japanese character encoding 'unicode-escape': See Unicode Constructors for a definition 'native': Dump of the Internal Format used by Python Perhaps not even 'html-entities' (even though it would make a cool replacement for cgi.escape()) and maybe we should also place the JIS encoding into a separate Unicode package. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 46 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Mon Nov 15 20:26:16 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Mon, 15 Nov 1999 20:26:16 +0100 Subject: [Python-Dev] Some thoughts on the codecs... References: <DBF3B37F7BF1D111B2A10000F6B14B1FDDAF2C@ukhil704nts.hld.uk.fid-intl.com> Message-ID: <38305E58.28B20E24@lemburg.com> "Da Silva, Mike" wrote: > > Andy Robinson wrote: > -- > 1. Stream interface > At the moment a codec has dump and load methods which read a (slice of a) > stream into a string in memory and vice versa. As the proposal notes, this > could lead to errors if you take a slice out of a stream. This is not just > due to character truncation; some Asian encodings are modal and have > shift-in and shift-out sequences as they move from Western single-byte > characters to double-byte ones. It also seems a bit pointless to me as the > source (or target) is still a Unicode string in memory. > This is a real problem - a filter to convert big files between two encodings > should be possible without knowledge of the particular encoding, as should > one on the input/output of some server. We can still give a default > implementation for single-byte encodings. > What's a good API for real stream conversion? just > Codec.encodeStream(infile, outfile) ? or is it more useful to feed the > codec with data a chunk at a time? > -- > A user defined chunking factor (suitably defaulted) would be useful for > processing large files. > -- > 2. Data driven codecs > I really like codecs being objects, and believe we could build support for a > lot more encodings, a lot sooner than is otherwise possible, by making them > data driven rather making each one compiled C code with static mapping > tables. What do people think about the approach below? > First of all, the ISO8859-1 series are straight mappings to Unicode code > points. So one Python script could parse these files and build the mapping > table, and a very small data file could hold these encodings. A compiled > helper function analogous to string.translate() could deal with most of > them. > Secondly, the double-byte ones involve a mixture of algorithms and data. > The worst cases I know are modal encodings which need a single-byte lookup > table, a double-byte lookup table, and have some very simple rules about > escape sequences in between them. A simple state machine could still handle > these (and the single-byte mappings above become extra-simple special > cases); I could imagine feeding it a totally data-driven set of rules. > Third, we can massively compress the mapping tables using a notation which > just lists contiguous ranges; and very often there are relationships between > encodings. For example, "cpXYZ is just like cpXYY but with an extra > 'smiley' at 0XFE32". In these cases, a script can build a family of related > codecs in an auditable manner. > -- > The problem here is that we need to decide whether we are Unicode-centric, > or whether Unicode is just another supported encoding. If we are > Unicode-centric, then all code-page translations will require static mapping > tables between the appropriate Unicode character and the relevant code > points in the other encoding. This would involve (worst case) 64k static > tables for each supported encoding. Unfortunately this also precludes the > use of algorithmic conversions and or sparse conversion tables because most > of these transformations are relative to a source and target non-Unicode > encoding, eg JIS <---->EUCJIS. If we are taking the IBM approach (see > CDRA), then we can mix and match approaches, and treat Unicode strings as > just Unicode, and normal strings as being any arbitrary MBCS encoding. > > To guarantee the utmost interoperability and Unicode 3.0 (and beyond) > compliance, we should probably assume that all core encodings are relative > to Unicode as the pivot encoding. This should hopefully avoid any gotcha's > with roundtrips between any two arbitrary native encodings. The downside is > this will probably be slower than an optimised algorithmic transformation. Optimizations should go into separate packages for direct EncodingA -> EncodingB conversions. I don't think we need them in the core. > -- > 3. What encodings to distribute? > The only clean answers to this are 'almost none', or 'everything that > Unicode 3.0 has a mapping for'. The latter is going to add some weight to > the distribution. What are people's feelings? Do we ship any at all apart > from the Unicode ones? Should new encodings be downloadable from > www.python.org <http://www.python.org> ? Should there be an optional > package outside the main distribution? > -- > Ship with Unicode encodings in the core, the rest should be an add on > package. > > If we are truly Unicode-centric, this gives us the most value in terms of > accessing a Unicode character properties database, which will provide > language neutral case folding, Hankaku <----> Zenkaku folding (Japan > specific), and composition / normalisation between composed characters and > their component nonspacing characters. >From the proposal: """ Unicode Character Properties: ----------------------------- A separate module "unicodedata" should provide a compact interface to all Unicode character properties defined in the standard's UnicodeData.txt file. Among other things, these properties provide ways to recognize numbers, digits, spaces, whitespace, etc. Since this module will have to provide access to all Unicode characters, it will eventually have to contain the data from UnicodeData.txt which takes up around 200kB. For this reason, the data should be stored in static C data. This enables compilation as shared module which the underlying OS can shared between processes (unlike normal Python code modules). XXX Define the interface... """ Special CJK packages can then access this data for the purposes you mentioned above. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 46 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From guido at CNRI.Reston.VA.US Mon Nov 15 22:37:28 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Mon, 15 Nov 1999 16:37:28 -0500 Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: Your message of "Mon, 15 Nov 1999 20:20:55 +0100." <38305D17.60EC94D0@lemburg.com> References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com> Message-ID: <199911152137.QAA28280@eric.cnri.reston.va.us> > Andy Robinson wrote: > > > > Some thoughts on the codecs... > > > > 1. Stream interface > > At the moment a codec has dump and load methods which > > read a (slice of a) stream into a string in memory and > > vice versa. As the proposal notes, this could lead to > > errors if you take a slice out of a stream. This is > > not just due to character truncation; some Asian > > encodings are modal and have shift-in and shift-out > > sequences as they move from Western single-byte > > characters to double-byte ones. It also seems a bit > > pointless to me as the source (or target) is still a > > Unicode string in memory. > > > > This is a real problem - a filter to convert big files > > between two encodings should be possible without > > knowledge of the particular encoding, as should one on > > the input/output of some server. We can still give a > > default implementation for single-byte encodings. > > > > What's a good API for real stream conversion? just > > Codec.encodeStream(infile, outfile) ? or is it more > > useful to feed the codec with data a chunk at a time? M.-A. Lemburg responds: > The idea was to use Unicode as intermediate for all > encoding conversions. > > What you invision here are stream recoders. The can > easily be implemented as an useful addition to the Codec > subclasses, but I don't think that these have to go > into the core. What I wanted was a codec API that acts somewhat like a buffered file; the buffer makes it possible to efficient handle shift states. This is not exactly what Andy shows, but it's not what Marc's current spec has either. I had thought something more like what Java does: an output stream codec's constructor takes a writable file object and the object returned by the constructor has a write() method, a flush() method and a close() method. It acts like a buffering interface to the underlying file; this allows it to generate the minimal number of shift sequeuces. Similar for input stream codecs. Andy's file translation example could then be written as follows: # assuming variables input_file, input_encoding, output_file, # output_encoding, and constant BUFFER_SIZE f = open(input_file, "rb") f1 = unicodec.codecs[input_encoding].stream_reader(f) g = open(output_file, "wb") g1 = unicodec.codecs[output_encoding].stream_writer(f) while 1: buffer = f1.read(BUFFER_SIZE) if not buffer: break f2.write(buffer) f2.close() f1.close() Note that we could possibly make these the only API that a codec needs to provide; the string object <--> unicode object conversions can be done using this and the cStringIO module. (On the other hand it seems a common case that would be quite useful.) > > 2. Data driven codecs > > I really like codecs being objects, and believe we > > could build support for a lot more encodings, a lot > > sooner than is otherwise possible, by making them data > > driven rather making each one compiled C code with > > static mapping tables. What do people think about the > > approach below? > > > > First of all, the ISO8859-1 series are straight > > mappings to Unicode code points. So one Python script > > could parse these files and build the mapping table, > > and a very small data file could hold these encodings. > > A compiled helper function analogous to > > string.translate() could deal with most of them. > > The problem with these large tables is that currently > Python modules are not shared among processes since > every process builds its own table. > > Static C data has the advantage of being shareable at > the OS level. Don't worry about it. 128K is too small to care, I think... > You can of course implement Python based lookup tables, > but these should be too large... > > > Secondly, the double-byte ones involve a mixture of > > algorithms and data. The worst cases I know are modal > > encodings which need a single-byte lookup table, a > > double-byte lookup table, and have some very simple > > rules about escape sequences in between them. A > > simple state machine could still handle these (and the > > single-byte mappings above become extra-simple special > > cases); I could imagine feeding it a totally > > data-driven set of rules. > > > > Third, we can massively compress the mapping tables > > using a notation which just lists contiguous ranges; > > and very often there are relationships between > > encodings. For example, "cpXYZ is just like cpXYY but > > with an extra 'smiley' at 0XFE32". In these cases, a > > script can build a family of related codecs in an > > auditable manner. > > These are all great ideas, but I think they unnecessarily > complicate the proposal. Agreed, let's leave the *implementation* of codecs out of the current efforts. However I want to make sure that the *interface* to codecs is defined right, because changing it will be expensive. (This is Linus Torvald's philosophy on drivers -- he doesn't care about bugs in drivers, as they will get fixed; however he greatly cares about defining the driver APIs correctly.) > > 3. What encodings to distribute? > > The only clean answers to this are 'almost none', or > > 'everything that Unicode 3.0 has a mapping for'. The > > latter is going to add some weight to the > > distribution. What are people's feelings? Do we ship > > any at all apart from the Unicode ones? Should new > > encodings be downloadable from www.python.org? Should > > there be an optional package outside the main > > distribution? > > Since Codecs can be registered at runtime, there is quite > some potential there for extension writers coding their > own fast codecs. E.g. one could use mxTextTools as codec > engine working at C speeds. (Do you think you'll be able to extort some money from HP for these? :-) > I would propose to only add some very basic encodings to > the standard distribution, e.g. the ones mentioned under > Standard Codecs in the proposal: > > 'utf-8': 8-bit variable length encoding > 'utf-16': 16-bit variable length encoding (litte/big endian) > 'utf-16-le': utf-16 but explicitly little endian > 'utf-16-be': utf-16 but explicitly big endian > 'ascii': 7-bit ASCII codepage > 'latin-1': Latin-1 codepage > 'html-entities': Latin-1 + HTML entities; > see htmlentitydefs.py from the standard Pythin Lib > 'jis' (a popular version XXX): > Japanese character encoding > 'unicode-escape': See Unicode Constructors for a definition > 'native': Dump of the Internal Format used by Python > > Perhaps not even 'html-entities' (even though it would make > a cool replacement for cgi.escape()) and maybe we should > also place the JIS encoding into a separate Unicode package. I'd drop html-entities, it seems too cutesie. (And who uses these anyway, outside browsers?) For JIS (shift-JIS?) I hope that Andy can help us with some pointers and validation. And unicode-escape: now that you mention it, this is a section of the proposal that I don't understand. I quote it here: | Python should provide a built-in constructor for Unicode strings which | is available through __builtins__: | | u = unicode(<encoded Python string>[,<encoding name>=<default encoding>]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ What do you mean by this notation? Since encoding names are not always legal Python identifiers (most contain hyphens), I don't understand what you really meant here. Do you mean to say that it has to be a keyword argument? I would disagree; and then I would have expected the notation [,encoding=<default encoding>]. | With the 'unicode-escape' encoding being defined as: | | u = u'<unicode-escape encoded Python string>' | | ? for single characters (and this includes all \XXX sequences except \uXXXX), | take the ordinal and interpret it as Unicode ordinal; | | ? for \uXXXX sequences, insert the Unicode character with ordinal 0xXXXX | instead, e.g. \u03C0 to represent the character Pi. I've looked at this several times and I don't see the difference between the two bullets. (Ironically, you are using a non-ASCII character here that doesn't always display, depending on where I look at your mail :-). Can you give some examples? Is u'\u0020' different from u'\x20' (a space)? Does '\u0020' (no u prefix) have a meaning? Also, I remember reading Tim Peters who suggested that a "raw unicode" notation (ur"...") might be necessary, to encode regular expressions. I tend to agree. While I'm on the topic, I don't see in your proposal a description of the source file character encoding. Currently, this is undefined, and in fact can be (ab)used to enter non-ASCII in string literals. For example, a programmer named Fran?ois might write a file containing this statement: print "Written by Fran?ois." # (There's a cedilla in there!) (He assumes his source character encoding is Latin-1, and he doesn't want to have to type \347 when he can type a cedilla on his keyboard.) If his source file (or .pyc file!) is executed by a Japanese user, this will probably print some garbage. Using the new Unicode strings, Fran?ois could change his program as follows: print unicode("Written by Fran?ois.", "latin-1") Assuming that Fran?ois sets his sys.stdout to use Latin-1, while the Japanese user sets his to shift-JIS (or whatever his kanjiterm uses). But when the Japanese user views Fran?ois' source file, he will again see garbage. If he uses a generic tool to translate latin-1 files to shift-JIS (assuming shift-JIS has a cedilla character) the program will no longer work correctly -- the string "latin-1" has to be changed to "shift-jis". What should we do about this? The safest and most radical solution is to disallow non-ASCII source characters; Fran?ois will then have to type print u"Written by Fran\u00E7ois." but, knowing Fran?ois, he probably won't like this solution very much (since he didn't like the \347 version either). --Guido van Rossum (home page: http://www.python.org/~guido/) From andy at robanal.demon.co.uk Mon Nov 15 22:41:21 1999 From: andy at robanal.demon.co.uk (Andy Robinson) Date: Mon, 15 Nov 1999 21:41:21 GMT Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: <38305D17.60EC94D0@lemburg.com> References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com> Message-ID: <38307984.12653394@post.demon.co.uk> On Mon, 15 Nov 1999 20:20:55 +0100, you wrote: >These are all great ideas, but I think they unnecessarily >complicate the proposal. However, to claim that Python is properly internationalized, we will need a large number of multi-byte encodings to be available. It's a large amount of work, it must be provably correct, and someone's going to have to do it. So if anyone with more C expertise than me - not hard :-) - is interested I'm not suggesting putting my points in the Unicode proposal - in fact, I'm very happy we have a proposal which allows for extension, and lets us work on the encodings separately (and later). >Since Codecs can be registered at runtime, there is quite >some potential there for extension writers coding their >own fast codecs. E.g. one could use mxTextTools as codec >engine working at C speeds. Exactly my thoughts , although I was thinking of a more slimmed down and specialized one. The right tool might be usable for things like compression algorithms too. Separate project to the Unicode stuff, but if anyone is interested, talk to me. >I would propose to only add some very basic encodings to >the standard distribution, e.g. the ones mentioned under >Standard Codecs in the proposal: > > 'utf-8': 8-bit variable length encoding > 'utf-16': 16-bit variable length encoding (litte/big endian) > 'utf-16-le': utf-16 but explicitly little endian > 'utf-16-be': utf-16 but explicitly big endian > 'ascii': 7-bit ASCII codepage > 'latin-1': Latin-1 codepage > 'html-entities': Latin-1 + HTML entities; > see htmlentitydefs.py from the standard Pythin Lib > 'jis' (a popular version XXX): > Japanese character encoding > 'unicode-escape': See Unicode Constructors for a definition > 'native': Dump of the Internal Format used by Python > Leave JISXXX and the CJK stuff out. If you get into Japanese, you really need to cover ShiftJIS, EUC-JP and JIS, they are big, and there are lots of options about how to do it. The other ones are algorithmic and can be small and fast and fit into the core. Ditto with HTML, and maybe even escaped-unicode too. In summary, the current discussion is clearly doing the right things, but is only covering a small percentage of what needs to be done to internationalize Python fully. - Andy From guido at CNRI.Reston.VA.US Mon Nov 15 22:49:26 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Mon, 15 Nov 1999 16:49:26 -0500 Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: Your message of "Mon, 15 Nov 1999 21:41:21 GMT." <38307984.12653394@post.demon.co.uk> References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com> <38307984.12653394@post.demon.co.uk> Message-ID: <199911152149.QAA28345@eric.cnri.reston.va.us> > In summary, the current discussion is clearly doing the right things, > but is only covering a small percentage of what needs to be done to > internationalize Python fully. Agreed. So let's focus on defining interfaces that are correct and convenient so others who want to add codecs won't have to fight our architecture! Is the current architecture good enough so that the Japanese codecs will fit in it? (I'm particularly worried about the stream codecs, see my previous message.) --Guido van Rossum (home page: http://www.python.org/~guido/) From andy at robanal.demon.co.uk Mon Nov 15 22:58:34 1999 From: andy at robanal.demon.co.uk (Andy Robinson) Date: Mon, 15 Nov 1999 21:58:34 GMT Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: <199911152149.QAA28345@eric.cnri.reston.va.us> References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com> <38307984.12653394@post.demon.co.uk> <199911152149.QAA28345@eric.cnri.reston.va.us> Message-ID: <3831806d.14422147@post.demon.co.uk> On Mon, 15 Nov 1999 16:49:26 -0500, you wrote: >> In summary, the current discussion is clearly doing the right things, >> but is only covering a small percentage of what needs to be done to >> internationalize Python fully. > >Agreed. So let's focus on defining interfaces that are correct and >convenient so others who want to add codecs won't have to fight our >architecture! > >Is the current architecture good enough so that the Japanese codecs >will fit in it? (I'm particularly worried about the stream codecs, >see my previous message.) > No, I don't think it is good enough. We need a stream codec, and as you said the string and file interfaces can be built out of that. You guys will know better than me what the best patterns for that are... - Andy From andy at robanal.demon.co.uk Mon Nov 15 23:30:53 1999 From: andy at robanal.demon.co.uk (Andy Robinson) Date: Mon, 15 Nov 1999 22:30:53 GMT Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: <199911152137.QAA28280@eric.cnri.reston.va.us> References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com> <199911152137.QAA28280@eric.cnri.reston.va.us> Message-ID: <383086da.16067684@post.demon.co.uk> On Mon, 15 Nov 1999 16:37:28 -0500, you wrote: ># assuming variables input_file, input_encoding, output_file, ># output_encoding, and constant BUFFER_SIZE > >f = open(input_file, "rb") >f1 = unicodec.codecs[input_encoding].stream_reader(f) >g = open(output_file, "wb") >g1 = unicodec.codecs[output_encoding].stream_writer(f) > >while 1: > buffer = f1.read(BUFFER_SIZE) > if not buffer: > break > f2.write(buffer) > >f2.close() >f1.close() > >Note that we could possibly make these the only API that a codec needs >to provide; the string object <--> unicode object conversions can be >done using this and the cStringIO module. (On the other hand it seems >a common case that would be quite useful.) Perfect. I'd keep the string ones - easy to implement but a big convenience. The proposal also says: >For explicit handling of Unicode using files, the unicodec module >could provide stream wrappers which provide transparent >encoding/decoding for any open stream (file-like object): > > import unicodec > file = open('mytext.txt','rb') > ufile = unicodec.stream(file,'utf-16') > u = ufile.read() > ... > ufile.close() It seems to me that if we go for stream_reader, it replaces this bit of the proposal too - no need for unicodec to provide anything. If you want to have a convenience function there to save a line or two, you could have unicodec.open(filename, mode, encoding) which returned a stream_reader. - Andy From mal at lemburg.com Mon Nov 15 23:54:38 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Mon, 15 Nov 1999 23:54:38 +0100 Subject: [Python-Dev] Some thoughts on the codecs... References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com> <199911152137.QAA28280@eric.cnri.reston.va.us> Message-ID: <38308F2E.44B9C6BF@lemburg.com> [I'll get back on this tomorrow, just some quick notes here...] Guido van Rossum wrote: > > > Andy Robinson wrote: > > > > > > Some thoughts on the codecs... > > > > > > 1. Stream interface > > > At the moment a codec has dump and load methods which > > > read a (slice of a) stream into a string in memory and > > > vice versa. As the proposal notes, this could lead to > > > errors if you take a slice out of a stream. This is > > > not just due to character truncation; some Asian > > > encodings are modal and have shift-in and shift-out > > > sequences as they move from Western single-byte > > > characters to double-byte ones. It also seems a bit > > > pointless to me as the source (or target) is still a > > > Unicode string in memory. > > > > > > This is a real problem - a filter to convert big files > > > between two encodings should be possible without > > > knowledge of the particular encoding, as should one on > > > the input/output of some server. We can still give a > > > default implementation for single-byte encodings. > > > > > > What's a good API for real stream conversion? just > > > Codec.encodeStream(infile, outfile) ? or is it more > > > useful to feed the codec with data a chunk at a time? > > M.-A. Lemburg responds: > > > The idea was to use Unicode as intermediate for all > > encoding conversions. > > > > What you invision here are stream recoders. The can > > easily be implemented as an useful addition to the Codec > > subclasses, but I don't think that these have to go > > into the core. > > What I wanted was a codec API that acts somewhat like a buffered file; > the buffer makes it possible to efficient handle shift states. This > is not exactly what Andy shows, but it's not what Marc's current spec > has either. > > I had thought something more like what Java does: an output stream > codec's constructor takes a writable file object and the object > returned by the constructor has a write() method, a flush() method and > a close() method. It acts like a buffering interface to the > underlying file; this allows it to generate the minimal number of > shift sequeuces. Similar for input stream codecs. The Codecs provide implementations for encoding and decoding, they are not intended as complete wrappers for e.g. files or sockets. The unicodec module will define a generic stream wrapper (which is yet to be defined) for dealing with files, sockets, etc. It will use the codec registry to do the actual codec work. >From the proposal: """ For explicit handling of Unicode using files, the unicodec module could provide stream wrappers which provide transparent encoding/decoding for any open stream (file-like object): import unicodec file = open('mytext.txt','rb') ufile = unicodec.stream(file,'utf-16') u = ufile.read() ... ufile.close() XXX unicodec.file(<filename>,<mode>,<encname>) could be provided as short-hand for unicodec.file(open(<filename>,<mode>),<encname>) which also assures that <mode> contains the 'b' character when needed. XXX Specify the wrapper(s)... Open issues: what to do with Python strings fed to the .write() method (may need to know the encoding of the strings) and when/if to return Python strings through the .read() method. Perhaps we need more than one type of wrapper here. """ > Andy's file translation example could then be written as follows: > > # assuming variables input_file, input_encoding, output_file, > # output_encoding, and constant BUFFER_SIZE > > f = open(input_file, "rb") > f1 = unicodec.codecs[input_encoding].stream_reader(f) > g = open(output_file, "wb") > g1 = unicodec.codecs[output_encoding].stream_writer(f) > > while 1: > buffer = f1.read(BUFFER_SIZE) > if not buffer: > break > f2.write(buffer) > > f2.close() > f1.close() > Note that we could possibly make these the only API that a codec needs > to provide; the string object <--> unicode object conversions can be > done using this and the cStringIO module. (On the other hand it seems > a common case that would be quite useful.) You wouldn't want to go via cStringIO for *every* encoding translation. The Codec interface defines two pairs of methods on purpose: one which works internally (ie. directly between strings and Unicode objects), and one which works externally (directly between a stream and Unicode objects). > > > 2. Data driven codecs > > > I really like codecs being objects, and believe we > > > could build support for a lot more encodings, a lot > > > sooner than is otherwise possible, by making them data > > > driven rather making each one compiled C code with > > > static mapping tables. What do people think about the > > > approach below? > > > > > > First of all, the ISO8859-1 series are straight > > > mappings to Unicode code points. So one Python script > > > could parse these files and build the mapping table, > > > and a very small data file could hold these encodings. > > > A compiled helper function analogous to > > > string.translate() could deal with most of them. > > > > The problem with these large tables is that currently > > Python modules are not shared among processes since > > every process builds its own table. > > > > Static C data has the advantage of being shareable at > > the OS level. > > Don't worry about it. 128K is too small to care, I think... Huh ? 128K for every process using Python ? That quickly sums up to lots of megabytes lying around pretty much unused. > > You can of course implement Python based lookup tables, > > but these should be too large... > > > > > Secondly, the double-byte ones involve a mixture of > > > algorithms and data. The worst cases I know are modal > > > encodings which need a single-byte lookup table, a > > > double-byte lookup table, and have some very simple > > > rules about escape sequences in between them. A > > > simple state machine could still handle these (and the > > > single-byte mappings above become extra-simple special > > > cases); I could imagine feeding it a totally > > > data-driven set of rules. > > > > > > Third, we can massively compress the mapping tables > > > using a notation which just lists contiguous ranges; > > > and very often there are relationships between > > > encodings. For example, "cpXYZ is just like cpXYY but > > > with an extra 'smiley' at 0XFE32". In these cases, a > > > script can build a family of related codecs in an > > > auditable manner. > > > > These are all great ideas, but I think they unnecessarily > > complicate the proposal. > > Agreed, let's leave the *implementation* of codecs out of the current > efforts. > > However I want to make sure that the *interface* to codecs is defined > right, because changing it will be expensive. (This is Linus > Torvald's philosophy on drivers -- he doesn't care about bugs in > drivers, as they will get fixed; however he greatly cares about > defining the driver APIs correctly.) > > > > 3. What encodings to distribute? > > > The only clean answers to this are 'almost none', or > > > 'everything that Unicode 3.0 has a mapping for'. The > > > latter is going to add some weight to the > > > distribution. What are people's feelings? Do we ship > > > any at all apart from the Unicode ones? Should new > > > encodings be downloadable from www.python.org? Should > > > there be an optional package outside the main > > > distribution? > > > > Since Codecs can be registered at runtime, there is quite > > some potential there for extension writers coding their > > own fast codecs. E.g. one could use mxTextTools as codec > > engine working at C speeds. > > (Do you think you'll be able to extort some money from HP for these? :-) Don't know, it depends on what their specs look like. I use mxTextTools for fast HTML file processing. It uses a small Turing machine with some extra magic and is progammable via Python tuples. > > I would propose to only add some very basic encodings to > > the standard distribution, e.g. the ones mentioned under > > Standard Codecs in the proposal: > > > > 'utf-8': 8-bit variable length encoding > > 'utf-16': 16-bit variable length encoding (litte/big endian) > > 'utf-16-le': utf-16 but explicitly little endian > > 'utf-16-be': utf-16 but explicitly big endian > > 'ascii': 7-bit ASCII codepage > > 'latin-1': Latin-1 codepage > > 'html-entities': Latin-1 + HTML entities; > > see htmlentitydefs.py from the standard Pythin Lib > > 'jis' (a popular version XXX): > > Japanese character encoding > > 'unicode-escape': See Unicode Constructors for a definition > > 'native': Dump of the Internal Format used by Python > > > > Perhaps not even 'html-entities' (even though it would make > > a cool replacement for cgi.escape()) and maybe we should > > also place the JIS encoding into a separate Unicode package. > > I'd drop html-entities, it seems too cutesie. (And who uses these > anyway, outside browsers?) Ok. > For JIS (shift-JIS?) I hope that Andy can help us with some pointers > and validation. > > And unicode-escape: now that you mention it, this is a section of > the proposal that I don't understand. I quote it here: > > | Python should provide a built-in constructor for Unicode strings which > | is available through __builtins__: > | > | u = unicode(<encoded Python string>[,<encoding name>=<default encoding>]) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ I meant this as optional second argument defaulting to whatever we define <default encoding> to mean, e.g. 'utf-8'. u = unicode("string","utf-8") == unicode("string") The <encoding name> argument must be a string identifying one of the registered codecs. > | With the 'unicode-escape' encoding being defined as: > | > | u = u'<unicode-escape encoded Python string>' > | > | ? for single characters (and this includes all \XXX sequences except \uXXXX), > | take the ordinal and interpret it as Unicode ordinal; > | > | ? for \uXXXX sequences, insert the Unicode character with ordinal 0xXXXX > | instead, e.g. \u03C0 to represent the character Pi. > > I've looked at this several times and I don't see the difference > between the two bullets. (Ironically, you are using a non-ASCII > character here that doesn't always display, depending on where I look > at your mail :-). The first bullet covers the normal Python string characters and escapes, e.g. \n and \267 (the center dot ;-), while the second explains how \uXXXX is interpreted. > Can you give some examples? > > Is u'\u0020' different from u'\x20' (a space)? No, they both map to the same Unicode ordinal. > Does '\u0020' (no u prefix) have a meaning? No, \uXXXX is only defined for u"" strings or strings that are used to build Unicode objects with this encoding: u = u'\u0020' == unicode(r'\u0020','unicode-escape') Note that writing \uXX is an error, e.g. u"\u12 " will cause cause a syntax error. Aside: I just noticed that '\x2010' doesn't give '\x20' + '10' but instead '\x10' -- is this intended ? > Also, I remember reading Tim Peters who suggested that a "raw unicode" > notation (ur"...") might be necessary, to encode regular expressions. > I tend to agree. This can be had via unicode(): u = unicode(r'\a\b\c\u0020','unicode-escaped') If that's too long, define a ur() function which wraps up the above line in a function. > While I'm on the topic, I don't see in your proposal a description of > the source file character encoding. Currently, this is undefined, and > in fact can be (ab)used to enter non-ASCII in string literals. For > example, a programmer named Fran?ois might write a file containing > this statement: > > print "Written by Fran?ois." # (There's a cedilla in there!) > > (He assumes his source character encoding is Latin-1, and he doesn't > want to have to type \347 when he can type a cedilla on his keyboard.) > > If his source file (or .pyc file!) is executed by a Japanese user, > this will probably print some garbage. > > Using the new Unicode strings, Fran?ois could change his program as > follows: > > print unicode("Written by Fran?ois.", "latin-1") > > Assuming that Fran?ois sets his sys.stdout to use Latin-1, while the > Japanese user sets his to shift-JIS (or whatever his kanjiterm uses). > > But when the Japanese user views Fran?ois' source file, he will again > see garbage. If he uses a generic tool to translate latin-1 files to > shift-JIS (assuming shift-JIS has a cedilla character) the program > will no longer work correctly -- the string "latin-1" has to be > changed to "shift-jis". > > What should we do about this? The safest and most radical solution is > to disallow non-ASCII source characters; Fran?ois will then have to > type > > print u"Written by Fran\u00E7ois." > > but, knowing Fran?ois, he probably won't like this solution very much > (since he didn't like the \347 version either). I think best is to leave it undefined... as with all files, only the programmer knows what format and encoding it contains, e.g. a Japanese programmer might want to use a shift-JIS editor to enter strings directly in shift-JIS via u = unicode("...shift-JIS encoded text...","shift-jis") Of course, this is not readable using an ASCII editor, but Python will continue to produce the intended string. NLS strings don't belong into program text anyway: i10n usually takes the gettext() approach to handle these issues. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 46 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From andy at robanal.demon.co.uk Tue Nov 16 01:09:28 1999 From: andy at robanal.demon.co.uk (Andy Robinson) Date: Tue, 16 Nov 1999 00:09:28 GMT Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: <38308F2E.44B9C6BF@lemburg.com> References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com> <199911152137.QAA28280@eric.cnri.reston.va.us> <38308F2E.44B9C6BF@lemburg.com> Message-ID: <3839a078.22625844@post.demon.co.uk> On Mon, 15 Nov 1999 23:54:38 +0100, you wrote: >[I'll get back on this tomorrow, just some quick notes here...] >The Codecs provide implementations for encoding and decoding, >they are not intended as complete wrappers for e.g. files or >sockets. > >The unicodec module will define a generic stream wrapper >(which is yet to be defined) for dealing with files, sockets, >etc. It will use the codec registry to do the actual codec >work. > >XXX unicodec.file(<filename>,<mode>,<encname>) could be provided as > short-hand for unicodec.file(open(<filename>,<mode>),<encname>) which > also assures that <mode> contains the 'b' character when needed. > >The Codec interface defines two pairs of methods >on purpose: one which works internally (ie. directly between >strings and Unicode objects), and one which works externally >(directly between a stream and Unicode objects). That's the problem Guido and I are worried about. Your present API is not enough to build stream encoders. The 'slurp it into a unicode string in one go' approach fails for big files or for network connections. And you just cannot build a generic stream reader/writer by slicing it into strings. The solution must be specific to the codec - only it knows how much to buffer, when to flip states etc. So the codec should provide proper stream reading and writing services. Unicodec can then wrap those up in labour-saving ways - I'm not fussy which but I like the one-line file-open utility. - Andy From tim_one at email.msn.com Tue Nov 16 06:38:32 1999 From: tim_one at email.msn.com (Tim Peters) Date: Tue, 16 Nov 1999 00:38:32 -0500 Subject: [Python-Dev] Unicode proposal: %-formatting ? In-Reply-To: <382AE7D9.147D58CB@lemburg.com> Message-ID: <000001bf2ff4$d36e2540$042d153f@tim> [MAL] > I wonder how we could add %-formatting to Unicode strings without > duplicating the PyString_Format() logic. > > First, do we need Unicode object %-formatting at all ? Sure -- in the end, all the world speaks Unicode natively and encodings become historical baggage. Granted I won't live that long, but I may last long enough to see encodings become almost purely an I/O hassle, with all computation done in Unicode. > Second, here is an emulation using strings and <default encoding> > that should give an idea of one could work with the different > encodings: > > s = '%s %i abc???' # a Latin-1 encoded string > t = (u,3) What's u? A Unicode object? Another Latin-1 string? A default-encoded string? How does the following know the difference? > # Convert Latin-1 s to a <default encoding> string via Unicode > s1 = unicode(s,'latin-1').encode() > > # The '%s' will now add u in <default encoding> > s2 = s1 % t > > # Finally, convert the <default encoding> encoded string to Unicode > u1 = unicode(s2) I don't expect this actually works: for example, change %s to %4s. Assuming u is either UTF-8 or Unicode, PyString_Format isn't smart enough to know that some (or all) characters in u consume multiple bytes, so can't extract "the right" number of bytes from u. I think % formating has to know the truth of what you're doing. > Note that .encode() defaults to the current setting of > <default encoding>. > > Provided u maps to Latin-1, an alternative would be: > > u1 = unicode('%s %i abc???' % (u.encode('latin-1'),3), 'latin-1') More interesting is fmt % tuple where everything is Unicode; people can muck with Latin-1 directly today using regular strings, so the example above mostly shows artificial convolution. From tim_one at email.msn.com Tue Nov 16 06:38:40 1999 From: tim_one at email.msn.com (Tim Peters) Date: Tue, 16 Nov 1999 00:38:40 -0500 Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit) In-Reply-To: <382BDD81.458D3125@lemburg.com> Message-ID: <000101bf2ff4$d636bb20$042d153f@tim> [MAL, on raw Unicode strings] > ... > Agreed... note that you could also write your own codec for just this > reason and then use: > > u = unicode('....\u1234...\...\...','raw-unicode-escaped') > > Put that into a function called 'ur' and you have: > > u = ur('...\u4545...\...\...') > > which is not that far away from ur'...' w/r to cosmetics. Well, not quite. In general you need to pass raw strings: u = unicode(r'....\u1234...\...\...','raw-unicode-escaped') ^ u = ur(r'...\u4545...\...\...') ^ else Python will replace all the other backslash sequences. This is a crucial distinction at times; e.g., else \b in a Unicode regexp will expand into a backspace character before the regexp processor ever sees it (\b is supposed to be a word boundary assertion). From tim_one at email.msn.com Tue Nov 16 06:44:42 1999 From: tim_one at email.msn.com (Tim Peters) Date: Tue, 16 Nov 1999 00:44:42 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <Pine.LNX.4.10.9911120225080.27203-100000@nebula.lyra.org> Message-ID: <000201bf2ff5$ae6aefc0$042d153f@tim> [Tim, wonders why Perl and Tcl went w/ UTF-8 internally] [Greg Stein] > Probably for the exact reason that you stated in your messages: many > 8-bit (7-bit?) functions continue to work quite well when given a > UTF-8-encoded string. i.e. they didn't have to rewrite the entire > Perl/TCL interpreter to deal with a new string type. > > I'd guess it is a helluva lot easier for us to add a Python Type than > for Perl or TCL to whack around with new string types (since they use > strings so heavily). Sounds convincing to me! Bumped into an old thread on c.l.p.m. that suggested Perl was also worried about UCS-2's 64K code point limit. But I'm already on record as predicting we'll regret any decision <wink>. From tim_one at email.msn.com Tue Nov 16 06:52:12 1999 From: tim_one at email.msn.com (Tim Peters) Date: Tue, 16 Nov 1999 00:52:12 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <DBF3B37F7BF1D111B2A10000F6B14B1FDDAF22@ukhil704nts.hld.uk.fid-intl.com> Message-ID: <000501bf2ff6$ba943a80$042d153f@tim> [Da Silva, Mike] > ... > 5. UTF-16 requires string operations that do not make assumptions > about nulls - this means re-implementing most of the C runtime > functions to work with unsigned shorts. Python strings are already null-friendly, so Python has already recoded everything it needs to get away from the no-null assumption; stropmodule.c is < 1,500 lines of code, and MAL can turn it into C++ template functions in his sleep <wink -- but stuff "like this" really is easier in C++>. From tim_one at email.msn.com Tue Nov 16 06:56:18 1999 From: tim_one at email.msn.com (Tim Peters) Date: Tue, 16 Nov 1999 00:56:18 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <19991112121303.27452.rocketmail@ web605.yahoomail.com> Message-ID: <000601bf2ff7$4d8a4c80$042d153f@tim> [Andy Robinson] > ... > I presume no one is actually advocating dropping > ordinary Python strings, or the ability to do > rawdata = open('myfile.txt', 'rb').read() > without any transformations? If anyone has advocated either, they've successfully hidden it from me. Anyone? From tim_one at email.msn.com Tue Nov 16 07:09:04 1999 From: tim_one at email.msn.com (Tim Peters) Date: Tue, 16 Nov 1999 01:09:04 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <382BF6C3.D79840EC@lemburg.com> Message-ID: <000701bf2ff9$15cecda0$042d153f@tim> [MAL] > BTW, wouldn't it be possible to take pcre and have it > use Py_Unicode instead of char ? [Of course, there would have to > be some extensions for character classes etc.] No, alas. The assumption that characters are 8 bits is ubiquitous, in both obvious and subtle ways. if ((start_bits[c/8] & (1 << (c&7))) == 0) start_match++; else break; From tim_one at email.msn.com Tue Nov 16 07:19:16 1999 From: tim_one at email.msn.com (Tim Peters) Date: Tue, 16 Nov 1999 01:19:16 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <382C3749.198EEBC6@lemburg.com> Message-ID: <000801bf2ffa$82273400$042d153f@tim> [MAL] > sys.bom should return the byte order mark (BOM) for the format used > internally. The unicodec module should provide symbols for all > possible values of this variable: > > BOM_BE: '\376\377' > (corresponds to Unicode 0x0000FEFF in UTF-16 > == ZERO WIDTH NO-BREAK SPACE) > > BOM_LE: '\377\376' > (corresponds to Unicode 0x0000FFFE in UTF-16 > == illegal Unicode character) > > BOM4_BE: '\000\000\377\376' > (corresponds to Unicode 0x0000FEFF in UCS-4) Should be BOM4_BE: '\000\000\376\377' > BOM4_LE: '\376\377\000\000' > (corresponds to Unicode 0x0000FFFE in UCS-4) Should be BOM4_LE: '\377\376\000\000' From tim_one at email.msn.com Tue Nov 16 07:31:39 1999 From: tim_one at email.msn.com (Tim Peters) Date: Tue, 16 Nov 1999 01:31:39 -0500 Subject: [Python-Dev] just say no... In-Reply-To: <14380.16437.71847.832880@weyr.cnri.reston.va.us> Message-ID: <000901bf2ffc$3d4bb8e0$042d153f@tim> [Fred L. Drake, Jr.] > ... > I wasn't suggesting the PyStringObject be changed, only that the > PyUnicodeObject could maintain a reference. Consider: > > s = fp.read() > u = unicode(s, 'utf-8') > > u would now hold a reference to s, and s/s# would return a pointer > into s instead of re-building the UTF-8 form. I talked myself out of > this because it would be too easy to keep a lot more string objects > around than were actually needed. Yet another use for a weak reference <0.5 wink>. From tim_one at email.msn.com Tue Nov 16 07:41:44 1999 From: tim_one at email.msn.com (Tim Peters) Date: Tue, 16 Nov 1999 01:41:44 -0500 Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <Pine.LNX.4.10.9911121519440.2535-100000@nebula.lyra.org> Message-ID: <000b01bf2ffd$a5ad69a0$042d153f@tim> [MAL] > BOM_BE: '\376\377' > (corresponds to Unicode 0x0000FEFF in UTF-16 > == ZERO WIDTH NO-BREAK SPACE) [Greg Stein] > Are you sure about that interpretation? I thought the BOM characters > (0xFEFF and 0xFFFE) were *reserved* in the UCS-2 space. I can't speak to MAL's degree of certainty <wink>, but he's right about this stuff. There is only one BOM character, U+FEFF, which is the zero-width no-break space. The byte-swapped form is not only reserved, it's guaranteed never to be assigned to a character. From tim_one at email.msn.com Tue Nov 16 08:47:06 1999 From: tim_one at email.msn.com (Tim Peters) Date: Tue, 16 Nov 1999 02:47:06 -0500 Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: <199911152137.QAA28280@eric.cnri.reston.va.us> Message-ID: <000d01bf3006$c7823700$042d153f@tim> [Guido] > ... > While I'm on the topic, I don't see in your proposal a description of > the source file character encoding. Currently, this is undefined, and > in fact can be (ab)used to enter non-ASCII in string literals. > ... > What should we do about this? The safest and most radical solution is > to disallow non-ASCII source characters; Fran?ois will then have to > type > > print u"Written by Fran\u00E7ois." > > but, knowing Fran?ois, he probably won't like this solution very much > (since he didn't like the \347 version either). So long as Python opens source files using libc text mode, it can't guarantee more than C does: the presence of any character other than tab, newline, and ASCII 32-126 inclusive renders the file contents undefined. Go beyond that, and you've got the same problem as mailers and browsers, and so also the same solution: open source files in binary mode, and add a pragma specifying the intended charset. As a practical matter, declare that Python source is Latin-1 for now, and declare any *system* that doesn't support that non-conforming <wink>. python-is-the-measure-of-all-things-ly y'rs - tim From tim_one at email.msn.com Tue Nov 16 08:47:08 1999 From: tim_one at email.msn.com (Tim Peters) Date: Tue, 16 Nov 1999 02:47:08 -0500 Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: <38308F2E.44B9C6BF@lemburg.com> Message-ID: <000e01bf3006$c8c11fa0$042d153f@tim> [Guido] >> Does '\u0020' (no u prefix) have a meaning? [MAL] > No, \uXXXX is only defined for u"" strings or strings that are > used to build Unicode objects with this encoding: I believe your intent is that '\u0020' be exactly those 6 characters, just as today. That is, it does have a meaning, but its meaning differs between Unicode string literals and regular string literals. > Note that writing \uXX is an error, e.g. u"\u12 " will cause > cause a syntax error. Although I believe your intent <wink> is that, just as today, '\u12' is not an error. > Aside: I just noticed that '\x2010' doesn't give '\x20' + '10' > but instead '\x10' -- is this intended ? Yes; see 2.4.1 ("String literals") of the Lang Ref. Blame the C committee for not defining \x in a platform-independent way. Note that a Python \x escape consumes *all* following hex characters, no matter how many -- and ignores all but the last two. > This [raw Unicode strings] can be had via unicode(): > > u = unicode(r'\a\b\c\u0020','unicode-escaped') > > If that's too long, define a ur() function which wraps up the > above line in a function. As before, I think that's fine for now, but won't stand forever. From fredrik at pythonware.com Tue Nov 16 09:39:20 1999 From: fredrik at pythonware.com (Fredrik Lundh) Date: Tue, 16 Nov 1999 09:39:20 +0100 Subject: [Python-Dev] Some thoughts on the codecs... References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com> <199911152137.QAA28280@eric.cnri.reston.va.us> Message-ID: <010001bf300e$14741310$f29b12c2@secret.pythonware.com> Guido van Rossum <guido at CNRI.Reston.VA.US> wrote: > I had thought something more like what Java does: an output stream > codec's constructor takes a writable file object and the object > returned by the constructor has a write() method, a flush() method and > a close() method. It acts like a buffering interface to the > underlying file; this allows it to generate the minimal number of > shift sequeuces. Similar for input stream codecs. note that the html/sgml/xml parsers generally support the feed/close protocol. to be able to use these codecs in that context, we need 1) codes written according to the "data consumer model", instead of the "stream" model. class myDecoder: def __init__(self, target): self.target = target self.state = ... def feed(self, data): ... extract as much data as possible ... self.target.feed(extracted data) def close(self): ... extract what's left ... self.target.feed(additional data) self.target.close() or 2) make threads mandatory, just like in Java. or 3) add light-weight threads (ala stackless python) to the interpreter... (I vote for alternative 3, but that's another story ;-) </F> From fredrik at pythonware.com Tue Nov 16 09:58:50 1999 From: fredrik at pythonware.com (Fredrik Lundh) Date: Tue, 16 Nov 1999 09:58:50 +0100 Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit) References: <000101bf2ff4$d636bb20$042d153f@tim> Message-ID: <016a01bf3010$cde52620$f29b12c2@secret.pythonware.com> Tim Peters <tim_one at email.msn.com> wrote: > (\b is supposed to be a word boundary assertion). in some places, that is. </F> Main Entry: reg?u?lar Pronunciation: 're-gy&-l&r, 're-g(&-)l&r 1 : belonging to a religious order 2 a : formed, built, arranged, or ordered according to some established rule, law, principle, or type ... 3 a : ORDERLY, METHODICAL <regular habits> ... 4 a : constituted, conducted, or done in conformity with established or prescribed usages, rules, or discipline ... From jack at oratrix.nl Tue Nov 16 12:05:55 1999 From: jack at oratrix.nl (Jack Jansen) Date: Tue, 16 Nov 1999 12:05:55 +0100 Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: Message by "M.-A. Lemburg" <mal@lemburg.com> , Mon, 15 Nov 1999 20:20:55 +0100 , <38305D17.60EC94D0@lemburg.com> Message-ID: <19991116110555.8B43335BB1E@snelboot.oratrix.nl> > I would propose to only add some very basic encodings to > the standard distribution, e.g. the ones mentioned under > Standard Codecs in the proposal: > > 'utf-8': 8-bit variable length encoding > 'utf-16': 16-bit variable length encoding (litte/big endian) > 'utf-16-le': utf-16 but explicitly little endian > 'utf-16-be': utf-16 but explicitly big endian > 'ascii': 7-bit ASCII codepage > 'latin-1': Latin-1 codepage > 'html-entities': Latin-1 + HTML entities; > see htmlentitydefs.py from the standard Pythin Lib > 'jis' (a popular version XXX): > Japanese character encoding > 'unicode-escape': See Unicode Constructors for a definition > 'native': Dump of the Internal Format used by Python I would suggest adding the Dos, Windows and Macintosh standard 8-bit charsets (their equivalents of latin-1) too, as documents in these encoding are pretty ubiquitous. But maybe these should only be added on the respective platforms. -- Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++ Jack.Jansen at oratrix.com | ++++ if you agree copy these lines to your sig ++++ www.oratrix.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm From mal at lemburg.com Tue Nov 16 09:35:28 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Tue, 16 Nov 1999 09:35:28 +0100 Subject: [Python-Dev] Some thoughts on the codecs... References: <000e01bf3006$c8c11fa0$042d153f@tim> Message-ID: <38311750.22D17EC1@lemburg.com> Tim Peters wrote: > > [Guido] > >> Does '\u0020' (no u prefix) have a meaning? > > [MAL] > > No, \uXXXX is only defined for u"" strings or strings that are > > used to build Unicode objects with this encoding: > > I believe your intent is that '\u0020' be exactly those 6 characters, just > as today. That is, it does have a meaning, but its meaning differs between > Unicode string literals and regular string literals. Right. > > Note that writing \uXX is an error, e.g. u"\u12 " will cause > > cause a syntax error. > > Although I believe your intent <wink> is that, just as today, '\u12' is not > an error. Right again :-) "\u12" gives a 4 byte string, u"\u12" produces an exception. > > Aside: I just noticed that '\x2010' doesn't give '\x20' + '10' > > but instead '\x10' -- is this intended ? > > Yes; see 2.4.1 ("String literals") of the Lang Ref. Blame the C committee > for not defining \x in a platform-independent way. Note that a Python \x > escape consumes *all* following hex characters, no matter how many -- and > ignores all but the last two. Strange definition... > > This [raw Unicode strings] can be had via unicode(): > > > > u = unicode(r'\a\b\c\u0020','unicode-escaped') > > > > If that's too long, define a ur() function which wraps up the > > above line in a function. > > As before, I think that's fine for now, but won't stand forever. If Guido agrees to ur"", I can put that into the proposal too -- it's just that things are starting to get a little crowded for a strawman proposal ;-) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Tue Nov 16 11:50:31 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Tue, 16 Nov 1999 11:50:31 +0100 Subject: [Python-Dev] Some thoughts on the codecs... References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com> <38307984.12653394@post.demon.co.uk> Message-ID: <383136F7.AB73A90@lemburg.com> Andy Robinson wrote: > > Leave JISXXX and the CJK stuff out. If you get into Japanese, you > really need to cover ShiftJIS, EUC-JP and JIS, they are big, and there > are lots of options about how to do it. The other ones are > algorithmic and can be small and fast and fit into the core. > > Ditto with HTML, and maybe even escaped-unicode too. So I can drop JIS ? [I won't be able to drop the escaped unicode codec because this is needed for u"" and ur"".] -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Tue Nov 16 11:42:19 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Tue, 16 Nov 1999 11:42:19 +0100 Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit) References: <000101bf2ff4$d636bb20$042d153f@tim> Message-ID: <3831350B.8F69CB6D@lemburg.com> Tim Peters wrote: > > [MAL, on raw Unicode strings] > > ... > > Agreed... note that you could also write your own codec for just this > > reason and then use: > > > > u = unicode('....\u1234...\...\...','raw-unicode-escaped') > > > > Put that into a function called 'ur' and you have: > > > > u = ur('...\u4545...\...\...') > > > > which is not that far away from ur'...' w/r to cosmetics. > > Well, not quite. In general you need to pass raw strings: > > u = unicode(r'....\u1234...\...\...','raw-unicode-escaped') > ^ > u = ur(r'...\u4545...\...\...') > ^ > > else Python will replace all the other backslash sequences. This is a > crucial distinction at times; e.g., else \b in a Unicode regexp will expand > into a backspace character before the regexp processor ever sees it (\b is > supposed to be a word boundary assertion). Right. Here is a sample implementation of what I had in mind: """ Demo for 'unicode-escape' encoding. """ import struct,string,re pack_format = '>H' def convert_string(s): l = map(None,s) for i in range(len(l)): l[i] = struct.pack(pack_format,ord(l[i])) return l u_escape = re.compile(r'\\u([0-9a-fA-F]{0,4})') def unicode_unescape(s): l = [] start = 0 while start < len(s): m = u_escape.search(s,start) if not m: l[len(l):] = convert_string(s[start:]) break m_start,m_end = m.span() if m_start > start: l[len(l):] = convert_string(s[start:m_start]) hexcode = m.group(1) #print hexcode,start,m_start if len(hexcode) != 4: raise SyntaxError,'illegal \\uXXXX sequence: \\u%s' % hexcode ordinal = string.atoi(hexcode,16) l.append(struct.pack(pack_format,ordinal)) start = m_end #print l return string.join(l,'') def hexstr(s,sep=''): return string.join(map(lambda x,hex=hex,ord=ord: '%02x' % ord(x),s),sep) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Tue Nov 16 11:40:42 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Tue, 16 Nov 1999 11:40:42 +0100 Subject: [Python-Dev] Unicode proposal: %-formatting ? References: <000001bf2ff4$d36e2540$042d153f@tim> Message-ID: <383134AA.4B49D178@lemburg.com> Tim Peters wrote: > > [MAL] > > I wonder how we could add %-formatting to Unicode strings without > > duplicating the PyString_Format() logic. > > > > First, do we need Unicode object %-formatting at all ? > > Sure -- in the end, all the world speaks Unicode natively and encodings > become historical baggage. Granted I won't live that long, but I may last > long enough to see encodings become almost purely an I/O hassle, with all > computation done in Unicode. > > > Second, here is an emulation using strings and <default encoding> > > that should give an idea of one could work with the different > > encodings: > > > > s = '%s %i abc???' # a Latin-1 encoded string > > t = (u,3) > > What's u? A Unicode object? Another Latin-1 string? A default-encoded > string? How does the following know the difference? u refers to a Unicode object in the proposal. Sorry, forgot to mention that. > > # Convert Latin-1 s to a <default encoding> string via Unicode > > s1 = unicode(s,'latin-1').encode() > > > > # The '%s' will now add u in <default encoding> > > s2 = s1 % t > > > > # Finally, convert the <default encoding> encoded string to Unicode > > u1 = unicode(s2) > > I don't expect this actually works: for example, change %s to %4s. > Assuming u is either UTF-8 or Unicode, PyString_Format isn't smart enough to > know that some (or all) characters in u consume multiple bytes, so can't > extract "the right" number of bytes from u. I think % formating has to know > the truth of what you're doing. Hmm, guess you're right... format parameters should indeed refer to characters rather than number of encoding bytes. This means a new PyUnicode_Format() implementation mapping Unicode format objects to Unicode objects. > > Note that .encode() defaults to the current setting of > > <default encoding>. > > > > Provided u maps to Latin-1, an alternative would be: > > > > u1 = unicode('%s %i abc???' % (u.encode('latin-1'),3), 'latin-1') > > More interesting is fmt % tuple where everything is Unicode; people can muck > with Latin-1 directly today using regular strings, so the example above > mostly shows artificial convolution. ... hmm, there is a problem there: how should the PyUnicode_Format() API deal with '%s' when it sees a Unicode object as argument ? E.g. what would you get in these cases: u = u"%s %s" % (u"abc", "abc") Perhaps we need a new marker for "insert Unicode object here". -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Tue Nov 16 11:48:13 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Tue, 16 Nov 1999 11:48:13 +0100 Subject: [Python-Dev] Some thoughts on the codecs... References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com> <199911152137.QAA28280@eric.cnri.reston.va.us> <38308F2E.44B9C6BF@lemburg.com> <3839a078.22625844@post.demon.co.uk> Message-ID: <3831366D.8A09E194@lemburg.com> Andy Robinson wrote: > > On Mon, 15 Nov 1999 23:54:38 +0100, you wrote: > > >[I'll get back on this tomorrow, just some quick notes here...] > >The Codecs provide implementations for encoding and decoding, > >they are not intended as complete wrappers for e.g. files or > >sockets. > > > >The unicodec module will define a generic stream wrapper > >(which is yet to be defined) for dealing with files, sockets, > >etc. It will use the codec registry to do the actual codec > >work. > > > >XXX unicodec.file(<filename>,<mode>,<encname>) could be provided as > > short-hand for unicodec.file(open(<filename>,<mode>),<encname>) which > > also assures that <mode> contains the 'b' character when needed. > > > >The Codec interface defines two pairs of methods > >on purpose: one which works internally (ie. directly between > >strings and Unicode objects), and one which works externally > >(directly between a stream and Unicode objects). > > That's the problem Guido and I are worried about. Your present API is > not enough to build stream encoders. The 'slurp it into a unicode > string in one go' approach fails for big files or for network > connections. And you just cannot build a generic stream reader/writer > by slicing it into strings. The solution must be specific to the > codec - only it knows how much to buffer, when to flip states etc. > > So the codec should provide proper stream reading and writing > services. I guess I'll have to rethink the Codec specs. Some leads: 1. introduce a new StreamCodec class which is designed for handling stream encoding and decoding (and supports state) 2. give more information to the unicodec registry: one could register classes instead of instances which the Unicode imlementation would then instantiate whenever it needs to apply the conversion; since this is only needed for encodings maintaining state, the registery would only have to do the instantiation for these codecs and could use cached instances for stateless codecs. > Unicodec can then wrap those up in labour-saving ways - I'm not fussy > which but I like the one-line file-open utility. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From fredrik at pythonware.com Tue Nov 16 12:38:31 1999 From: fredrik at pythonware.com (Fredrik Lundh) Date: Tue, 16 Nov 1999 12:38:31 +0100 Subject: [Python-Dev] Some thoughts on the codecs... References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com> Message-ID: <024b01bf3027$1cff1480$f29b12c2@secret.pythonware.com> > I would propose to only add some very basic encodings to > the standard distribution, e.g. the ones mentioned under > Standard Codecs in the proposal: > > 'utf-8': 8-bit variable length encoding > 'utf-16': 16-bit variable length encoding (litte/big endian) > 'utf-16-le': utf-16 but explicitly little endian > 'utf-16-be': utf-16 but explicitly big endian > 'ascii': 7-bit ASCII codepage > 'latin-1': Latin-1 codepage > 'html-entities': Latin-1 + HTML entities; > see htmlentitydefs.py from the standard Pythin Lib > 'jis' (a popular version XXX): > Japanese character encoding > 'unicode-escape': See Unicode Constructors for a definition > 'native': Dump of the Internal Format used by Python since this is already very close, maybe we could adopt the naming guidelines from XML: In an encoding declaration, the values "UTF-8", "UTF-16", "ISO-10646-UCS-2", and "ISO-10646-UCS-4" should be used for the various encodings and transformations of Unicode/ISO/IEC 10646, the values "ISO-8859-1", "ISO-8859-2", ... "ISO-8859-9" should be used for the parts of ISO 8859, and the values "ISO-2022-JP", "Shift_JIS", and "EUC-JP" should be used for the various encoded forms of JIS X-0208-1997. XML processors may recognize other encodings; it is recommended that character encodings registered (as charsets) with the Internet Assigned Numbers Authority [IANA], other than those just listed, should be referred to using their registered names. Note that these registered names are defined to be case-insensitive, so processors wishing to match against them should do so in a case-insensitive way. (ie "iso-8859-1" instead of "latin-1", etc -- at least as aliases...). </F> From gstein at lyra.org Tue Nov 16 12:45:48 1999 From: gstein at lyra.org (Greg Stein) Date: Tue, 16 Nov 1999 03:45:48 -0800 (PST) Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: <024b01bf3027$1cff1480$f29b12c2@secret.pythonware.com> Message-ID: <Pine.LNX.4.10.9911160344500.2535-100000@nebula.lyra.org> On Tue, 16 Nov 1999, Fredrik Lundh wrote: >... > since this is already very close, maybe we could adopt > the naming guidelines from XML: > > In an encoding declaration, the values "UTF-8", "UTF-16", > "ISO-10646-UCS-2", and "ISO-10646-UCS-4" should be used > for the various encodings and transformations of > Unicode/ISO/IEC 10646, the values "ISO-8859-1", > "ISO-8859-2", ... "ISO-8859-9" should be used for the parts > of ISO 8859, and the values "ISO-2022-JP", "Shift_JIS", > and "EUC-JP" should be used for the various encoded > forms of JIS X-0208-1997. > > XML processors may recognize other encodings; it is > recommended that character encodings registered > (as charsets) with the Internet Assigned Numbers > Authority [IANA], other than those just listed, > should be referred to using their registered names. > > Note that these registered names are defined to be > case-insensitive, so processors wishing to match > against them should do so in a case-insensitive way. > > (ie "iso-8859-1" instead of "latin-1", etc -- at least as > aliases...). +1 (as we'd say in Apache-land... :-) -g -- Greg Stein, http://www.lyra.org/ From gstein at lyra.org Tue Nov 16 13:04:47 1999 From: gstein at lyra.org (Greg Stein) Date: Tue, 16 Nov 1999 04:04:47 -0800 (PST) Subject: [Python-Dev] just say no... In-Reply-To: <3830595B.348E8CC7@lemburg.com> Message-ID: <Pine.LNX.4.10.9911160348480.2535-100000@nebula.lyra.org> On Mon, 15 Nov 1999, M.-A. Lemburg wrote: > Guido van Rossum wrote: >... > > t# refers to byte-encoded data. Multibyte encodings are explicitly > > designed to be passed cleanly through processing steps that handle > > single-byte character data, as long as they are 8-bit clean and don't > > do too much processing. > > Ah, ok. I interpreted 8-bit to mean: 8 bits in length, not > "8-bit clean" as you obviously did. Hrm. That might be dangerous. Many of the functions that use "t#" assume that each character is 8-bits long. i.e. the returned length == the number of characters. I'm not sure what the implications would be if you interpret the semantics of "t#" as multi-byte characters. >... > > For example, take an encryption engine. While it is defined in terms > > of byte streams, there's no requirement that the bytes represent > > characters -- they could be the bytes of a GIF file, an MP3 file, or a > > gzipped tar file. If we pass Unicode to an encryption engine, we want > > Unicode to come out at the other end, not UTF-8. (If we had wanted to > > encrypt UTF-8, we should have fed it UTF-8.) Heck. I just want to quickly throw the data onto my disk. I'll write a BOM, following by the raw data. Done. It's even portable. >... > > Aha, I think there's a confusion about what "8-bit" means. For me, a > > multibyte encoding like UTF-8 is still 8-bit. Am I alone in this? Maybe. I don't see multi-byte characters as 8-bit (in the sense of the "t" format). > > (As far as I know, C uses char* to represent multibyte characters.) > > Maybe we should disambiguate it more explicitly? We can disambiguate with a new format character, or we can clarify the semantics of "t" to mean single- *or* multi- byte characters. Again, I think there may be trouble if the semantics of "t" are defined to allow multibyte characters. > There should be some definition for the two markers and the > ideas behind them in the API guide, I guess. Certainly. [ man, I'm bad... I've got doc updates there and for the buffer stuff :-( ] > > > Hmm, I would strongly object to making "s#" return the internal > > > format. file.write() would then default to writing UTF-16 data > > > instead of UTF-8 data. This could result in strange errors > > > due to the UTF-16 format being endian dependent. > > > > But this was the whole design. file.write() needs to be changed to > > use s# when the file is open in binary mode and t# when the file is > > open in text mode. Interesting idea, but that presumes that "t" will be defined for the Unicode object (i.e. it implements the getcharbuffer type slot). Because of the multi-byte problem, I don't think it will. [ not to mention, that I don't think the Unicode object should implicitly do a UTF-8 conversion and hold a ref to the resulting string ] >... > I still don't feel very comfortable about the fact that all > existing APIs using "s#" will suddenly receive UTF-16 data if > being passed Unicode objects: this probably won't get us the > "magical" Unicode integration we invision, since "t#" usage is not > very wide spread and character handling code will probably not > work well with UTF-16 encoded strings. I'm not sure that we should definitely go for "magical." Perl has magic in it, and that is one of its worst faults. Go for clean and predictable, and leave as much logic to the Python level as possible. The interpreter should provide a minimum of functionality, rather than second-guessing and trying to be neat and sneaky with its operation. >... > > Because file.write() for a binary file, and other similar things > > (e.g. the encryption engine example I mentioned above) must have > > *some* way to get at the raw bits. > > What for ? How about: "because I'm the application developer, and I say that I want the raw bytes in the file." > Any lossless encoding should do the trick... UTF-8 > is just as good as UTF-16 for binary files; plus it's more compact > for ASCII data. I don't really see a need to get explicitly > at the internal data representation because both encodings are > in fact "internal" w/r to Unicode objects. > > The only argument I can come up with is that using UTF-16 for > binary files could (possibly) eliminate the UTF-8 conversion step > which is otherwise always needed. The argument that I come up with is "don't tell me how to design my storage format, and don't make Python force me into one." If I want to write Unicode text to a file, the most natural thing to do is: open('file', 'w').write(u) If you do a conversion on me, then I'm not writing Unicode. I've got to go and do some nasty conversion which just monkeys up my program. If I have a Unicode object, but I *want* to write UTF-8 to the file, then the cleanest thing is: open('file', 'w').write(encode(u, 'utf-8')) This is clear that I've got a Unicode object input, but I'm writing UTF-8. I have a second argument, too: See my first argument. :-) Really... this is kind of what Fredrik was trying to say: don't get in the way of the application programmer. Give them tools, but avoid policy and gimmicks and other "magic". Cheers, -g -- Greg Stein, http://www.lyra.org/ From gstein at lyra.org Tue Nov 16 13:09:17 1999 From: gstein at lyra.org (Greg Stein) Date: Tue, 16 Nov 1999 04:09:17 -0800 (PST) Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...) In-Reply-To: <199911152137.QAA28280@eric.cnri.reston.va.us> Message-ID: <Pine.LNX.4.10.9911160406070.2535-100000@nebula.lyra.org> On Mon, 15 Nov 1999, Guido van Rossum wrote: >... > > The problem with these large tables is that currently > > Python modules are not shared among processes since > > every process builds its own table. > > > > Static C data has the advantage of being shareable at > > the OS level. > > Don't worry about it. 128K is too small to care, I think... This is the reason Python starts up so slow and has a large memory footprint. There hasn't been any concern for moving stuff into shared data pages. As a result, a process must map in a bunch of vmem pages, for no other reason than to allocate Python structures in that memory and copy constants in. Go start Perl 100 times, then do the same with Python. Python is significantly slower. I've actually written a web app in PHP because another one that I did in Python had slow response time. [ yah: the Real Man Answer is to write a real/good mod_python. ] Cheers, -g -- Greg Stein, http://www.lyra.org/ From captainrobbo at yahoo.com Tue Nov 16 13:18:19 1999 From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=) Date: Tue, 16 Nov 1999 04:18:19 -0800 (PST) Subject: [Python-Dev] Some thoughts on the codecs... Message-ID: <19991116121819.21509.rocketmail@web606.mail.yahoo.com> --- "M.-A. Lemburg" <mal at lemburg.com> wrote: > So I can drop JIS ? [I won't be able to drop the > escaped unicode > codec because this is needed for u"" and ur"".] Drop Japanese from the core language. JIS0208 is a big character set with three popular encodings (Shift-JIS, EUC-JP and JIS), and a host of slight variations; it has 6879 characters, and there are a range of options a user might need to set for it to be useful. So let's assume for now this a separate package. There's a good chance I'll do it but it is not a small job. If you start statically linking in tables of 7000 characters for one Asian language, you'll have to do the lot. As for the single-byte Latin ones, a prototype Python module could be whipped up in a couple of evenings, and a tiny C function which does single-byte to double-byte mappings and vice versa could make it fast. We can have an extensible, data driven solution in no time without having to build it into the core. The way I see it, to claim that python has i18n, a serious effort is needed to ensure every major encoding in the world is available to Python users. But that's separate to the core languages. Your spec should only cover what is going to be hard-coded into Python. I'd like to see one paragraph in your spec stating that our architecture seperates the encodings themselves from the core language changes, and that getting them sorted is a logically separate (but important) project. Ideally, we could put together a separate proposal for the encoding library itself and run it by some world class experts in that field, but after yours is done. - Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From guido at CNRI.Reston.VA.US Tue Nov 16 14:28:42 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Tue, 16 Nov 1999 08:28:42 -0500 Subject: [Python-Dev] Unicode proposal: %-formatting ? In-Reply-To: Your message of "Tue, 16 Nov 1999 11:40:42 +0100." <383134AA.4B49D178@lemburg.com> References: <000001bf2ff4$d36e2540$042d153f@tim> <383134AA.4B49D178@lemburg.com> Message-ID: <199911161328.IAA29042@eric.cnri.reston.va.us> > ... hmm, there is a problem there: how should the PyUnicode_Format() > API deal with '%s' when it sees a Unicode object as argument ? > > E.g. what would you get in these cases: > > u = u"%s %s" % (u"abc", "abc") From guido at CNRI.Reston.VA.US Tue Nov 16 14:45:17 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Tue, 16 Nov 1999 08:45:17 -0500 Subject: [Python-Dev] just say no... In-Reply-To: Your message of "Tue, 16 Nov 1999 04:04:47 PST." <Pine.LNX.4.10.9911160348480.2535-100000@nebula.lyra.org> References: <Pine.LNX.4.10.9911160348480.2535-100000@nebula.lyra.org> Message-ID: <199911161345.IAA29064@eric.cnri.reston.va.us> > > Ah, ok. I interpreted 8-bit to mean: 8 bits in length, not > > "8-bit clean" as you obviously did. > > Hrm. That might be dangerous. Many of the functions that use "t#" assume > that each character is 8-bits long. i.e. the returned length == the number > of characters. > > I'm not sure what the implications would be if you interpret the semantics > of "t#" as multi-byte characters. Hrm. Can you quote examples of users of t# who would be confused by multibyte characters? I guess that there are quite a few places where they will be considered illegal, but that's okay -- the string will be parsed at some point and rejected, e.g. as an illegal filename, hostname or whatever. On the other hand, there are quite some places where I would think that multibyte characters would do just the right thing. Many places using t# could just as well be using 's' except they need to know the length and they don't want to call strlen(). In all cases I've looked at, the reason they need the length because they are allocating a buffer (or checking whether it fits in a statically allocated buffer) -- and there the number of bytes in a multibyte string is just fine. Note that I take the same stance on 's' -- it should return multibyte characters. > > What for ? > > How about: "because I'm the application developer, and I say that I want > the raw bytes in the file." Here I'm with you, man! > Greg Stein, http://www.lyra.org/ --Guido van Rossum (home page: http://www.python.org/~guido/) From gward at cnri.reston.va.us Tue Nov 16 15:10:33 1999 From: gward at cnri.reston.va.us (Greg Ward) Date: Tue, 16 Nov 1999 09:10:33 -0500 Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...) In-Reply-To: <Pine.LNX.4.10.9911160406070.2535-100000@nebula.lyra.org>; from gstein@lyra.org on Tue, Nov 16, 1999 at 04:09:17AM -0800 References: <199911152137.QAA28280@eric.cnri.reston.va.us> <Pine.LNX.4.10.9911160406070.2535-100000@nebula.lyra.org> Message-ID: <19991116091032.A4063@cnri.reston.va.us> On 16 November 1999, Greg Stein said: > This is the reason Python starts up so slow and has a large memory > footprint. There hasn't been any concern for moving stuff into shared data > pages. As a result, a process must map in a bunch of vmem pages, for no > other reason than to allocate Python structures in that memory and copy > constants in. > > Go start Perl 100 times, then do the same with Python. Python is > significantly slower. I've actually written a web app in PHP because > another one that I did in Python had slow response time. > [ yah: the Real Man Answer is to write a real/good mod_python. ] I don't think this is the only factor in startup overhead. Try looking into the number of system calls for the trivial startup case of each interpreter: $ truss perl -e 1 2> perl.log $ truss python -c 1 2> python.log (This is on Solaris; I did the same thing on Linux with "strace", and on IRIX with "par -s -SS". Dunno about other Unices.) The results are interesting, and useful despite the platform and version disparities. (For the record: Python 1.5.2 on all three platforms; Perl 5.005_03 on Solaris, 5.004_05 on Linux, and 5.004_04 on IRIX. The Solaris is 2.6, using the Official CNRI Python Build by Barry, and the ditto Perl build by me; the Linux system is starship, using whatever Perl and Python the Starship Masters provide us with; the IRIX box is an elderly but well-maintained SGI Challenge running IRIX 5.3.) Also, this is with an empty PYTHONPATH. The Solaris build of Python has different prefix and exec_prefix, but on the Linux and IRIX builds, they are the same. (I think this will reflect poorly on the Solaris version.) PERLLIB, PERL5LIB, and Perl's builtin @INC should not affect startup of the trivial "1" script, so I haven't paid attention to them. First, the size of log files (in lines), i.e. number of system calls: Solaris Linux IRIX[1] Perl 88 85 70 Python 425 316 257 [1] after chopping off the summary counts from the "par" output -- ie. these really are the number of system calls, not the number of lines in the log files Next, the number of "open" calls: Solaris Linux IRIX Perl 16 10 9 Python 107 71 48 (It looks as though *all* of the Perl 'open' calls are due to the dynamic linker going through /usr/lib and/or /lib.) And the number of unsuccessful "open" calls: Solaris Linux IRIX Perl 6 1 3 Python 77 49 32 Number of "mmap" calls: Solaris Linux IRIX Perl 25 25 1 Python 36 24 1 ...nope, guess we can't blame mmap for any Perl/Python startup disparity. How about "brk": Solaris Linux IRIX Perl 6 11 12 Python 47 39 25 ...ok, looks like Greg's gripe about memory holds some water. Rerunning "truss" on Solaris with "python -S -c 1" drastically reduces the startup overhead as measured by "number of system calls". Some quick timing experiments show a drastic speedup (in wall-clock time) by adding "-S": about 37% faster under Solaris, 56% faster under Linux, and 35% under IRIX. These figures should be taken with a large grain of salt, as the Linux and IRIX systems were fairly well loaded at the time, and the wall-clock results I measured had huge variance. Still, it gets the point across. Oh, also for the record, all timings were done like: perl -e 'for $i (1 .. 100) { system "python", "-S", "-c", "1"; }' because I wanted to guarantee no shell was involved in the Python startup. Greg -- Greg Ward - software developer gward at cnri.reston.va.us Corporation for National Research Initiatives 1895 Preston White Drive voice: +1-703-620-8990 Reston, Virginia, USA 20191-5434 fax: +1-703-620-0913 From mal at lemburg.com Tue Nov 16 12:33:07 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Tue, 16 Nov 1999 12:33:07 +0100 Subject: [Python-Dev] Some thoughts on the codecs... References: <19991116110555.8B43335BB1E@snelboot.oratrix.nl> Message-ID: <383140F3.EDDB307A@lemburg.com> Jack Jansen wrote: > > > I would propose to only add some very basic encodings to > > the standard distribution, e.g. the ones mentioned under > > Standard Codecs in the proposal: > > > > 'utf-8': 8-bit variable length encoding > > 'utf-16': 16-bit variable length encoding (litte/big endian) > > 'utf-16-le': utf-16 but explicitly little endian > > 'utf-16-be': utf-16 but explicitly big endian > > 'ascii': 7-bit ASCII codepage > > 'latin-1': Latin-1 codepage > > 'html-entities': Latin-1 + HTML entities; > > see htmlentitydefs.py from the standard Pythin Lib > > 'jis' (a popular version XXX): > > Japanese character encoding > > 'unicode-escape': See Unicode Constructors for a definition > > 'native': Dump of the Internal Format used by Python > > I would suggest adding the Dos, Windows and Macintosh standard 8-bit charsets > (their equivalents of latin-1) too, as documents in these encoding are pretty > ubiquitous. But maybe these should only be added on the respective platforms. Good idea. What code pages would that be ? -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Tue Nov 16 15:13:25 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Tue, 16 Nov 1999 15:13:25 +0100 Subject: [Python-Dev] Unicode Proposal: Version 0.6 References: <382C0A54.E6E8328D@lemburg.com> <382D625B.DC14DBDE@lemburg.com> Message-ID: <38316685.7977448D@lemburg.com> FYI, I've uploaded a new version of the proposal which incorporates many things we have discussed lately, e.g. the buffer interface, "s#" vs. "t#", etc. The latest version of the proposal is available at: http://starship.skyport.net/~lemburg/unicode-proposal.txt Older versions are available as: http://starship.skyport.net/~lemburg/unicode-proposal-X.X.txt Some POD (points of discussion) that are still open: ? Unicode objects support for %-formatting ? specifying StreamCodecs -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Tue Nov 16 13:54:51 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Tue, 16 Nov 1999 13:54:51 +0100 Subject: [Python-Dev] Some thoughts on the codecs... References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com> <024b01bf3027$1cff1480$f29b12c2@secret.pythonware.com> Message-ID: <3831541B.B242FFA9@lemburg.com> Fredrik Lundh wrote: > > > I would propose to only add some very basic encodings to > > the standard distribution, e.g. the ones mentioned under > > Standard Codecs in the proposal: > > > > 'utf-8': 8-bit variable length encoding > > 'utf-16': 16-bit variable length encoding (litte/big endian) > > 'utf-16-le': utf-16 but explicitly little endian > > 'utf-16-be': utf-16 but explicitly big endian > > 'ascii': 7-bit ASCII codepage > > 'latin-1': Latin-1 codepage > > 'html-entities': Latin-1 + HTML entities; > > see htmlentitydefs.py from the standard Pythin Lib > > 'jis' (a popular version XXX): > > Japanese character encoding > > 'unicode-escape': See Unicode Constructors for a definition > > 'native': Dump of the Internal Format used by Python > > since this is already very close, maybe we could adopt > the naming guidelines from XML: > > In an encoding declaration, the values "UTF-8", "UTF-16", > "ISO-10646-UCS-2", and "ISO-10646-UCS-4" should be used > for the various encodings and transformations of > Unicode/ISO/IEC 10646, the values "ISO-8859-1", > "ISO-8859-2", ... "ISO-8859-9" should be used for the parts > of ISO 8859, and the values "ISO-2022-JP", "Shift_JIS", > and "EUC-JP" should be used for the various encoded > forms of JIS X-0208-1997. > > XML processors may recognize other encodings; it is > recommended that character encodings registered > (as charsets) with the Internet Assigned Numbers > Authority [IANA], other than those just listed, > should be referred to using their registered names. > > Note that these registered names are defined to be > case-insensitive, so processors wishing to match > against them should do so in a case-insensitive way. > > (ie "iso-8859-1" instead of "latin-1", etc -- at least as > aliases...). >From the proposal: """ General Remarks: ---------------- ? Unicode encoding names should be lower case on output and case-insensitive on input (they will be converted to lower case by all APIs taking an encoding name as input). Encoding names should follow the name conventions as used by the Unicode Consortium: spaces are converted to hyphens, e.g. 'utf 16' is written as 'utf-16'. """ Is there a naming scheme definition for these encoding names? (The quote you gave above doesn't really sound like a definition to me.) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Tue Nov 16 14:15:19 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Tue, 16 Nov 1999 14:15:19 +0100 Subject: [Python-Dev] Some thoughts on the codecs... References: <19991116121819.21509.rocketmail@web606.mail.yahoo.com> Message-ID: <383158E7.BC574A1F@lemburg.com> Andy Robinson wrote: > > --- "M.-A. Lemburg" <mal at lemburg.com> wrote: > > So I can drop JIS ? [I won't be able to drop the > > escaped unicode > > codec because this is needed for u"" and ur"".] > > Drop Japanese from the core language. Done ... that one was easy ;-) > JIS0208 is a big character set with three popular > encodings (Shift-JIS, EUC-JP and JIS), and a host of > slight variations; it has 6879 characters, and there > are a range of options a user might need to set for it > to be useful. So let's assume for now this a separate > package. There's a good chance I'll do it but it is > not a small job. If you start statically linking in > tables of 7000 characters for one Asian language, > you'll have to do the lot. > > As for the single-byte Latin ones, a prototype Python > module could be whipped up in a couple of evenings, > and a tiny C function which does single-byte to > double-byte mappings and vice versa could make it > fast. We can have an extensible, data driven solution > in no time without having to build it into the core. Perhaps these helper function could be intergrated into the core to avoid compilation when adding a new codec. > The way I see it, to claim that python has i18n, a > serious effort is needed to ensure every major > encoding in the world is available to Python users. > But that's separate to the core languages. Your spec > should only cover what is going to be hard-coded into > Python. Right. > I'd like to see one paragraph in your spec stating > that our architecture seperates the encodings > themselves from the core language changes, and that > getting them sorted is a logically separate (but > important) project. Ideally, we could put together a > separate proposal for the encoding library itself and > run it by some world class experts in that field, but > after yours is done. I've added: All other encoding such as the CJK ones to support Asian scripts should be implemented in seperate packages which do not get included in the core Python distribution and are not a part of this proposal. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Tue Nov 16 14:06:39 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Tue, 16 Nov 1999 14:06:39 +0100 Subject: [Python-Dev] just say no... References: <Pine.LNX.4.10.9911160348480.2535-100000@nebula.lyra.org> Message-ID: <383156DF.2209053F@lemburg.com> Greg Stein wrote: > > On Mon, 15 Nov 1999, M.-A. Lemburg wrote: > > Guido van Rossum wrote: > >... > > > t# refers to byte-encoded data. Multibyte encodings are explicitly > > > designed to be passed cleanly through processing steps that handle > > > single-byte character data, as long as they are 8-bit clean and don't > > > do too much processing. > > > > Ah, ok. I interpreted 8-bit to mean: 8 bits in length, not > > "8-bit clean" as you obviously did. > > Hrm. That might be dangerous. Many of the functions that use "t#" assume > that each character is 8-bits long. i.e. the returned length == the number > of characters. > > I'm not sure what the implications would be if you interpret the semantics > of "t#" as multi-byte characters. FYI, the next version of the proposal now says "s#" gives you UTF-16 and "t#" returns UTF-8. File objects opened in text mode will use "t#" and binary ones use "s#". I'll just use explicit u.encode('utf-8') calls if I want to write UTF-8 to binary files -- perhaps everyone else should too ;-) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From akuchlin at mems-exchange.org Tue Nov 16 15:35:39 1999 From: akuchlin at mems-exchange.org (Andrew M. Kuchling) Date: Tue, 16 Nov 1999 09:35:39 -0500 (EST) Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...) In-Reply-To: <19991116091032.A4063@cnri.reston.va.us> References: <199911152137.QAA28280@eric.cnri.reston.va.us> <Pine.LNX.4.10.9911160406070.2535-100000@nebula.lyra.org> <19991116091032.A4063@cnri.reston.va.us> Message-ID: <14385.27579.292173.433577@amarok.cnri.reston.va.us> Greg Ward writes: >Next, the number of "open" calls: > Solaris Linux IRIX > Perl 16 10 9 > Python 107 71 48 Running 'python -v' explains this: amarok akuchlin>python -v # /usr/local/lib/python1.5/exceptions.pyc matches /usr/local/lib/python1.5/exceptions.py import exceptions # precompiled from /usr/local/lib/python1.5/exceptions.pyc # /usr/local/lib/python1.5/site.pyc matches /usr/local/lib/python1.5/site.py import site # precompiled from /usr/local/lib/python1.5/site.pyc # /usr/local/lib/python1.5/os.pyc matches /usr/local/lib/python1.5/os.py import os # precompiled from /usr/local/lib/python1.5/os.pyc import posix # builtin # /usr/local/lib/python1.5/posixpath.pyc matches /usr/local/lib/python1.5/posixpath.py import posixpath # precompiled from /usr/local/lib/python1.5/posixpath.pyc # /usr/local/lib/python1.5/stat.pyc matches /usr/local/lib/python1.5/stat.py import stat # precompiled from /usr/local/lib/python1.5/stat.pyc # /usr/local/lib/python1.5/UserDict.pyc matches /usr/local/lib/python1.5/UserDict.py import UserDict # precompiled from /usr/local/lib/python1.5/UserDict.pyc Python 1.5.2 (#80, May 25 1999, 18:06:07) [GCC 2.8.1] on sunos5 Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam import readline # dynamically loaded from /usr/local/lib/python1.5/lib-dynload/readline.so And each import tries several different forms of the module name: stat("/usr/local/lib/python1.5/os", 0xEFFFD5E0) Err#2 ENOENT open("/usr/local/lib/python1.5/os.so", O_RDONLY) Err#2 ENOENT open("/usr/local/lib/python1.5/osmodule.so", O_RDONLY) Err#2 ENOENT open("/usr/local/lib/python1.5/os.py", O_RDONLY) = 4 I don't see how this is fixable, unless we strip down site.py, which drags in os, which drags in os.path and stat and UserDict. -- A.M. Kuchling http://starship.python.net/crew/amk/ I'm going stir-crazy, and I've joined the ranks of the walking brain-dead, but otherwise I'm just peachy. -- Lyta Hall on parenthood, in SANDMAN #40: "Parliament of Rooks" From guido at CNRI.Reston.VA.US Tue Nov 16 15:43:07 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Tue, 16 Nov 1999 09:43:07 -0500 Subject: [Python-Dev] just say no... In-Reply-To: Your message of "Tue, 16 Nov 1999 14:06:39 +0100." <383156DF.2209053F@lemburg.com> References: <Pine.LNX.4.10.9911160348480.2535-100000@nebula.lyra.org> <383156DF.2209053F@lemburg.com> Message-ID: <199911161443.JAA29149@eric.cnri.reston.va.us> > FYI, the next version of the proposal now says "s#" gives you > UTF-16 and "t#" returns UTF-8. File objects opened in text mode > will use "t#" and binary ones use "s#". Good. > I'll just use explicit u.encode('utf-8') calls if I want to write > UTF-8 to binary files -- perhaps everyone else should too ;-) You could write UTF-8 to files opened in text mode too; at least most actual systems will leave the UTF-8 escapes alone and just to LF -> CRLF translation, which should be fine. --Guido van Rossum (home page: http://www.python.org/~guido/) From fdrake at acm.org Tue Nov 16 15:50:55 1999 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Tue, 16 Nov 1999 09:50:55 -0500 (EST) Subject: [Python-Dev] just say no... In-Reply-To: <000901bf2ffc$3d4bb8e0$042d153f@tim> References: <14380.16437.71847.832880@weyr.cnri.reston.va.us> <000901bf2ffc$3d4bb8e0$042d153f@tim> Message-ID: <14385.28495.685427.598748@weyr.cnri.reston.va.us> Tim Peters writes: > Yet another use for a weak reference <0.5 wink>. Those just keep popping up! I seem to recall Diane Hackborne actually implemented these under the name "vref" long ago; perhaps that's worth revisiting after all? (Not the implementation so much as the idea.) I think to make it general would cost one PyObject* in each object's structure, and some code in some constructors (maybe), and all destructors, but not much. Is this worth pursuing, or is it locked out of the core because of the added space for the PyObject*? (Note that the concept isn't necessarily useful for all object types -- numbers in particular -- but it only makes sense to bother if it works for everything, even if it's not very useful in some cases.) -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> Corporation for National Research Initiatives From fdrake at acm.org Tue Nov 16 16:12:43 1999 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Tue, 16 Nov 1999 10:12:43 -0500 (EST) Subject: [Python-Dev] just say no... In-Reply-To: <Pine.LNX.4.10.9911160348480.2535-100000@nebula.lyra.org> References: <3830595B.348E8CC7@lemburg.com> <Pine.LNX.4.10.9911160348480.2535-100000@nebula.lyra.org> Message-ID: <14385.29803.459364.456840@weyr.cnri.reston.va.us> Greg Stein writes: > [ man, I'm bad... I've got doc updates there and for the buffer stuff :-( ] And the sooner I receive them, the sooner they can be integrated! Any plans to get them to me? I'll probably want to do another release before the IPC8. -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> Corporation for National Research Initiatives From mal at lemburg.com Tue Nov 16 15:36:54 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Tue, 16 Nov 1999 15:36:54 +0100 Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...) References: <199911152137.QAA28280@eric.cnri.reston.va.us> <Pine.LNX.4.10.9911160406070.2535-100000@nebula.lyra.org> <19991116091032.A4063@cnri.reston.va.us> Message-ID: <38316C06.8B0E1D7B@lemburg.com> Greg Ward wrote: > > > Go start Perl 100 times, then do the same with Python. Python is > > significantly slower. I've actually written a web app in PHP because > > another one that I did in Python had slow response time. > > [ yah: the Real Man Answer is to write a real/good mod_python. ] > > I don't think this is the only factor in startup overhead. Try looking > into the number of system calls for the trivial startup case of each > interpreter: > > $ truss perl -e 1 2> perl.log > $ truss python -c 1 2> python.log > > (This is on Solaris; I did the same thing on Linux with "strace", and on > IRIX with "par -s -SS". Dunno about other Unices.) The results are > interesting, and useful despite the platform and version disparities. > > (For the record: Python 1.5.2 on all three platforms; Perl 5.005_03 on > Solaris, 5.004_05 on Linux, and 5.004_04 on IRIX. The Solaris is 2.6, > using the Official CNRI Python Build by Barry, and the ditto Perl build > by me; the Linux system is starship, using whatever Perl and Python the > Starship Masters provide us with; the IRIX box is an elderly but > well-maintained SGI Challenge running IRIX 5.3.) > > Also, this is with an empty PYTHONPATH. The Solaris build of Python has > different prefix and exec_prefix, but on the Linux and IRIX builds, they > are the same. (I think this will reflect poorly on the Solaris > version.) PERLLIB, PERL5LIB, and Perl's builtin @INC should not affect > startup of the trivial "1" script, so I haven't paid attention to them. For kicks I've done a similar test with cgipython, the one file version of Python 1.5.2: > First, the size of log files (in lines), i.e. number of system calls: > > Solaris Linux IRIX[1] > Perl 88 85 70 > Python 425 316 257 cgipython 182 > [1] after chopping off the summary counts from the "par" output -- ie. > these really are the number of system calls, not the number of > lines in the log files > > Next, the number of "open" calls: > > Solaris Linux IRIX > Perl 16 10 9 > Python 107 71 48 cgipython 33 > (It looks as though *all* of the Perl 'open' calls are due to the > dynamic linker going through /usr/lib and/or /lib.) > > And the number of unsuccessful "open" calls: > > Solaris Linux IRIX > Perl 6 1 3 > Python 77 49 32 cgipython 28 Note that cgipython does search for sitecutomize.py. > > Number of "mmap" calls: > > Solaris Linux IRIX > Perl 25 25 1 > Python 36 24 1 cgipython 13 > > ...nope, guess we can't blame mmap for any Perl/Python startup > disparity. > > How about "brk": > > Solaris Linux IRIX > Perl 6 11 12 > Python 47 39 25 cgipython 41 (?) So at least in theory, using cgipython for the intended purpose should gain some performance. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Tue Nov 16 17:00:58 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Tue, 16 Nov 1999 17:00:58 +0100 Subject: [Python-Dev] Codecs and StreamCodecs Message-ID: <38317FBA.4F3D6B1F@lemburg.com> Here is a new proposal for the codec interface: class Codec: def encode(self,u,slice=None): """ Return the Unicode object u encoded as Python string. If slice is given (as slice object), only the sliced part of the Unicode object is encoded. The method may not store state in the Codec instance. Use SteamCodec for codecs which have to keep state in order to make encoding/decoding efficient. """ ... def decode(self,s,slice=None): """ Return an equivalent Unicode object for the encoded Python string s. If slice is given (as slice object), only the sliced part of the Python string is decoded and returned as Unicode object. Note that this can cause the decoding algorithm to fail due to truncations in the encoding. The method may not store state in the Codec instance. Use SteamCodec for codecs which have to keep state in order to make encoding/decoding efficient. """ ... class StreamCodec(Codec): def __init__(self,stream=None,errors='strict'): """ Creates a StreamCodec instance. stream must be a file-like object open for reading and/or writing binary data depending on the intended codec action or None. The StreamCodec may implement different error handling schemes by providing the errors argument. These parameters are known (they need not all be supported by StreamCodec subclasses): 'strict' - raise an UnicodeError (or a subclass) 'ignore' - ignore the character and continue with the next (a single character) - replace errorneous characters with the given character (may also be a Unicode character) """ self.stream = stream def write(self,u,slice=None): """ Writes the Unicode object's contents encoded to self.stream. stream must be a file-like object open for writing binary data. If slice is given (as slice object), only the sliced part of the Unicode object is written. """ ... the base class should provide a default implementation of this method using self.encode ... def read(self,length=None): """ Reads an encoded string from the stream and returns an equivalent Unicode object. If length is given, only length Unicode characters are returned (the StreamCodec instance reads as many raw bytes as needed to fulfill this requirement). Otherwise, all available data is read and decoded. """ ... the base class should provide a default implementation of this method using self.decode ... It is not required by the unicodec.register() API to provide a subclass of these base class, only the given methods must be present; this allows writing Codecs as extensions types. All Codecs must provide the .encode()/.decode() methods. Codecs having the .read() and/or .write() methods are considered to be StreamCodecs. The Unicode implementation will by itself only use the stateless .encode() and .decode() methods. All other conversion have to be done by explicitly instantiating the appropriate [Stream]Codec. -- Feel free to beat on this one ;-) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Tue Nov 16 17:08:49 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Tue, 16 Nov 1999 17:08:49 +0100 Subject: [Python-Dev] just say no... References: <14380.16437.71847.832880@weyr.cnri.reston.va.us> <000901bf2ffc$3d4bb8e0$042d153f@tim> <14385.28495.685427.598748@weyr.cnri.reston.va.us> Message-ID: <38318191.11D93903@lemburg.com> "Fred L. Drake, Jr." wrote: > > Tim Peters writes: > > Yet another use for a weak reference <0.5 wink>. > > Those just keep popping up! I seem to recall Diane Hackborne > actually implemented these under the name "vref" long ago; perhaps > that's worth revisiting after all? (Not the implementation so much as > the idea.) I think to make it general would cost one PyObject* in > each object's structure, and some code in some constructors (maybe), > and all destructors, but not much. > Is this worth pursuing, or is it locked out of the core because of > the added space for the PyObject*? (Note that the concept isn't > necessarily useful for all object types -- numbers in particular -- > but it only makes sense to bother if it works for everything, even if > it's not very useful in some cases.) FYI, there's mxProxy which implements a flavor of them. Look in the standard places for mx stuff ;-) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From fdrake at acm.org Tue Nov 16 17:14:06 1999 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Tue, 16 Nov 1999 11:14:06 -0500 (EST) Subject: [Python-Dev] just say no... In-Reply-To: <38318191.11D93903@lemburg.com> References: <14380.16437.71847.832880@weyr.cnri.reston.va.us> <000901bf2ffc$3d4bb8e0$042d153f@tim> <14385.28495.685427.598748@weyr.cnri.reston.va.us> <38318191.11D93903@lemburg.com> Message-ID: <14385.33486.855802.187739@weyr.cnri.reston.va.us> M.-A. Lemburg writes: > FYI, there's mxProxy which implements a flavor of them. Look > in the standard places for mx stuff ;-) Yes, but still not in the core. So we have two general examples (vrefs and mxProxy) and there's WeakDict (or something like that). I think there really needs to be a core facility for this. There are a lot of users (including myself) who think that things are far less useful if they're not in the core. (No, I'm not saying that everything should be in the core, or even that it needs a lot more stuff. I just don't want to be writing code that requires a lot of separate packages to be installed. At least not until we can tell an installation tool to "install this and everything it depends on." ;) -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> Corporation for National Research Initiatives From bwarsaw at cnri.reston.va.us Tue Nov 16 17:14:55 1999 From: bwarsaw at cnri.reston.va.us (Barry A. Warsaw) Date: Tue, 16 Nov 1999 11:14:55 -0500 (EST) Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...) References: <199911152137.QAA28280@eric.cnri.reston.va.us> <Pine.LNX.4.10.9911160406070.2535-100000@nebula.lyra.org> <19991116091032.A4063@cnri.reston.va.us> <14385.27579.292173.433577@amarok.cnri.reston.va.us> Message-ID: <14385.33535.23316.286575@anthem.cnri.reston.va.us> >>>>> "AMK" == Andrew M Kuchling <akuchlin at mems-exchange.org> writes: AMK> I don't see how this is fixable, unless we strip down AMK> site.py, which drags in os, which drags in os.path and stat AMK> and UserDict. One approach might be to support loading modules out of jar files (or whatever) using Greg imputils. We could put the bootstrap .pyc files in this jar and teach Python to import from it first. Python installations could even craft their own modules.jar file to include whatever modules they are willing to "hard code". This, with -S might make Python start up much faster, at the small cost of some flexibility (which could be regained with a c.l. switch or other mechanism to bypass modules.jar). -Barry From guido at CNRI.Reston.VA.US Tue Nov 16 17:20:28 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Tue, 16 Nov 1999 11:20:28 -0500 Subject: [Python-Dev] Codecs and StreamCodecs In-Reply-To: Your message of "Tue, 16 Nov 1999 17:00:58 +0100." <38317FBA.4F3D6B1F@lemburg.com> References: <38317FBA.4F3D6B1F@lemburg.com> Message-ID: <199911161620.LAA02643@eric.cnri.reston.va.us> > It is not required by the unicodec.register() API to provide a > subclass of these base class, only the given methods must be present; > this allows writing Codecs as extensions types. All Codecs must > provide the .encode()/.decode() methods. Codecs having the .read() > and/or .write() methods are considered to be StreamCodecs. > > The Unicode implementation will by itself only use the > stateless .encode() and .decode() methods. > > All other conversion have to be done by explicitly instantiating > the appropriate [Stream]Codec. Looks okay, although I'd like someone to implement a simple shift-state-based stream codec to check this out further. I have some questions about the constructor. You seem to imply that instantiating the class without arguments creates a codec without state. That's fine. When given a stream argument, shouldn't the direction of the stream be given as an additional argument, so the proper state for encoding or decoding can be set up? I can see that for an implementation it might be more convenient to have separate classes for encoders and decoders -- certainly the state being kept is very different. Also, I don't want to ignore the alternative interface that was suggested by /F. It uses feed() similar to htmllib c.s. This has some advantages (although we might want to define some compatibility so it can also feed directly into a file). Perhaps someone should go ahead and implement prototype codecs using either paradigm and then write some simple apps, so we can make a better decision. In any case I think the specs codec registry API aren't on the critical path, integration of /F's basic unicode object is the first thing we need. --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at CNRI.Reston.VA.US Tue Nov 16 17:27:53 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Tue, 16 Nov 1999 11:27:53 -0500 Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...) In-Reply-To: Your message of "Tue, 16 Nov 1999 11:14:55 EST." <14385.33535.23316.286575@anthem.cnri.reston.va.us> References: <199911152137.QAA28280@eric.cnri.reston.va.us> <Pine.LNX.4.10.9911160406070.2535-100000@nebula.lyra.org> <19991116091032.A4063@cnri.reston.va.us> <14385.27579.292173.433577@amarok.cnri.reston.va.us> <14385.33535.23316.286575@anthem.cnri.reston.va.us> Message-ID: <199911161627.LAA02665@eric.cnri.reston.va.us> > >>>>> "AMK" == Andrew M Kuchling <akuchlin at mems-exchange.org> writes: > > AMK> I don't see how this is fixable, unless we strip down > AMK> site.py, which drags in os, which drags in os.path and stat > AMK> and UserDict. > > One approach might be to support loading modules out of jar files (or > whatever) using Greg imputils. We could put the bootstrap .pyc files > in this jar and teach Python to import from it first. Python > installations could even craft their own modules.jar file to include > whatever modules they are willing to "hard code". This, with -S might > make Python start up much faster, at the small cost of some > flexibility (which could be regained with a c.l. switch or other > mechanism to bypass modules.jar). A completely different approach (which, incidentally, HP has lobbied for before; and which has been implemented by Sjoerd Mullender for one particular application) would be to cache a mapping from module names to filenames in a dbm file. For Sjoerd's app (which imported hundreds of modules) this made a huge difference. The problem is that it's hard to deal with issues like updating the cache while sharing it with other processes and even other users... But if those can be solved, this could greatly reduce the number of stats and unsuccessful opens, without having to resort to jar files. --Guido van Rossum (home page: http://www.python.org/~guido/) From gmcm at hypernet.com Tue Nov 16 17:56:19 1999 From: gmcm at hypernet.com (Gordon McMillan) Date: Tue, 16 Nov 1999 11:56:19 -0500 Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...) In-Reply-To: <14385.33535.23316.286575@anthem.cnri.reston.va.us> Message-ID: <1269351119-9152905@hypernet.com> Barry A. Warsaw writes: > One approach might be to support loading modules out of jar files > (or whatever) using Greg imputils. We could put the bootstrap > .pyc files in this jar and teach Python to import from it first. > Python installations could even craft their own modules.jar file > to include whatever modules they are willing to "hard code". > This, with -S might make Python start up much faster, at the > small cost of some flexibility (which could be regained with a > c.l. switch or other mechanism to bypass modules.jar). Couple hundred Windows users have been doing this for months (http://starship.python.net/crew/gmcm/install.html). The .pyz files are cross-platform, although the "embedding" app would have to be redone for *nix, (and all the embedding really does is keep Python from hunting all over your disk). Yeah, it's faster. And I can put Python+Tcl/Tk+IDLE on a diskette with a little room left over. but-since-its-WIndows-it-must-be-tainted-ly y'rs - Gordon From guido at CNRI.Reston.VA.US Tue Nov 16 18:00:15 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Tue, 16 Nov 1999 12:00:15 -0500 Subject: [Python-Dev] Python 1.6 status Message-ID: <199911161700.MAA02716@eric.cnri.reston.va.us> Greg Stein recently reminded me that he was holding off on 1.6 patches because he was under the impression that I wasn't accepting them yet. The situation is rather more complicated than that. There are a great deal of things that need to be done, and for many of them I'd be most happy to receive patches! For other things, however, I'm still in the requirements analysis phase, and patches might be premature (e.g., I want to redesign the import mechanisms, and while I like some of the prototypes that have been posted, I'm not ready to commit to any specific implementation). How do you know for which things I'm ready for patches? Ask me. I've tried to make lists before, and there are probably some hints in the TODO FAQ wizard as well as in the "requests" section of the Python Bugs List. Greg also suggested that I might receive more patches if I opened up the CVS tree for checkins by certain valued contributors. On the one hand I'm reluctant to do that (I feel I have a pretty good track record of checking in patches that are mailed to me, assuming I agree with them) but on the other hand there might be something to say for this, because it gives contributors more of a sense of belonging to the inner core. Of course, checkin privileges don't mean you can check in anything you like -- as in the Apache world, changes must be discussed and approved by the group, and I would like to have a veto. However once a change is approved, it's much easier if the contributor can check the code in without having to go through me all the time. A drawback may be that some people will make very forceful requests to be given checkin privileges, only to never use them; just like there are some members of python-dev who have never contributed. I definitely want to limit the number of privileged contributors to a very small number (e.g. 10-15). One additional detail is the legal side -- contributors will have to sign some kind of legal document similar to the current (wetsign.html) release form, but guiding all future contributions. I'll have to discuss this with CNRI's legal team. Greg, I understand you have checkin privileges for Apache. What is the procedure there for handing out those privileges? What is the procedure for using them? (E.g. if you made a bogus change to part of Apache you're not supposed to work on, what happens?) I'm hoping for several kind of responses to this email: - uncontroversial patches - questions about whether specific issues are sufficiently settled to start coding a patch - discussion threads opening up some issues that haven't been settled yet (like the current, very productive, thread in i18n) - posts summarizing issues that were settled long ago in the past, requesting reverification that the issue is still settled - suggestions for new issues that maybe ought to be settled in 1.6 - requests for checkin privileges, preferably with a specific issue or area of expertise for which the requestor will take responsibility --Guido van Rossum (home page: http://www.python.org/~guido/) From akuchlin at mems-exchange.org Tue Nov 16 18:11:48 1999 From: akuchlin at mems-exchange.org (Andrew M. Kuchling) Date: Tue, 16 Nov 1999 12:11:48 -0500 (EST) Subject: [Python-Dev] Python 1.6 status In-Reply-To: <199911161700.MAA02716@eric.cnri.reston.va.us> References: <199911161700.MAA02716@eric.cnri.reston.va.us> Message-ID: <14385.36948.610106.195971@amarok.cnri.reston.va.us> Guido van Rossum writes: >I'm hoping for several kind of responses to this email: My list of things to do for 1.6 is: * Translate re.py to C and switch to the latest PCRE 2 codebase (mostly done, perhaps ready for public review in a week or so). * Go through the O'Reilly POSIX book and draw up a list of missing POSIX functions that aren't available in the posix module. This was sparked by Greg Ward showing me a Perl daemonize() function he'd written, and I realized that some of the functions it used weren't available in Python at all. (setsid() was one of them, I think.) * A while back I got approval to add the mmapfile module to the core. The outstanding issue there is that the constructor has a different interface on Unix and Windows platforms. On Windows: mm = mmapfile.mmapfile("filename", "tag name", <mapsize>) On Unix, it looks like the mmap() function: mm = mmapfile.mmapfile(<filedesc>, <mapsize>, <flags> (like MAP_SHARED), <prot> (like PROT_READ, PROT_READWRITE) ) Can we reconcile these interfaces, have two different function names, or what? >- suggestions for new issues that maybe ought to be settled in 1.6 Perhaps we should figure out what new capabilities, if any, should be added in 1.6. Fred has mentioned weak references, and there are other possibilities such as ExtensionClass. -- A.M. Kuchling http://starship.python.net/crew/amk/ Society, my dear, is like salt water, good to swim in but hard to swallow. -- Arthur Stringer, _The Silver Poppy_ From beazley at cs.uchicago.edu Tue Nov 16 18:24:24 1999 From: beazley at cs.uchicago.edu (David Beazley) Date: Tue, 16 Nov 1999 11:24:24 -0600 (CST) Subject: [Python-Dev] Python 1.6 status References: <199911161700.MAA02716@eric.cnri.reston.va.us> <14385.36948.610106.195971@amarok.cnri.reston.va.us> Message-ID: <199911161724.LAA13496@gargoyle.cs.uchicago.edu> Andrew M. Kuchling writes: > Guido van Rossum writes: > >I'm hoping for several kind of responses to this email: > > * Go through the O'Reilly POSIX book and draw up a list of missing > POSIX functions that aren't available in the posix module. This > was sparked by Greg Ward showing me a Perl daemonize() function > he'd written, and I realized that some of the functions it used > weren't available in Python at all. (setsid() was one of them, I > think.) > I second this! This was one of the things I noticed when doing the Essential Reference Book. Assuming no one has done it already, I wouldn't mind volunteering to take a crack at it. Cheers, Dave From fdrake at acm.org Tue Nov 16 18:25:02 1999 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Tue, 16 Nov 1999 12:25:02 -0500 (EST) Subject: [Python-Dev] Codecs and StreamCodecs In-Reply-To: <199911161620.LAA02643@eric.cnri.reston.va.us> References: <38317FBA.4F3D6B1F@lemburg.com> <199911161620.LAA02643@eric.cnri.reston.va.us> Message-ID: <14385.37742.816993.642515@weyr.cnri.reston.va.us> Guido van Rossum writes: > Also, I don't want to ignore the alternative interface that was > suggested by /F. It uses feed() similar to htmllib c.s. This has > some advantages (although we might want to define some compatibility > so it can also feed directly into a file). I think one or the other can be used, and then a wrapper that converts to the other interface. Perhaps the encoders should provide feed(), and a file-like wrapper can convert write() to feed(). It could also be done the other way; I'm not sure if it matters which is "normal." (Or perhaps feed() was badly named and should be write()? The general intent was a little different, I think, but an output file is very much a stream consumer.) -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> Corporation for National Research Initiatives From akuchlin at mems-exchange.org Tue Nov 16 18:32:41 1999 From: akuchlin at mems-exchange.org (Andrew M. Kuchling) Date: Tue, 16 Nov 1999 12:32:41 -0500 (EST) Subject: [Python-Dev] mmapfile module In-Reply-To: <199911161720.MAA02764@eric.cnri.reston.va.us> References: <199911161700.MAA02716@eric.cnri.reston.va.us> <14385.36948.610106.195971@amarok.cnri.reston.va.us> <199911161720.MAA02764@eric.cnri.reston.va.us> Message-ID: <14385.38201.301429.786642@amarok.cnri.reston.va.us> Guido van Rossum writes: >Hm, this seems to require a higher-level Python module to hide the >differences. Maybe the Unix version could also use a filename? I >would think that mmap'ed files should always be backed by a file (not >by a pipe, socket etc.). Or is there an issue with secure creation of >temp files? This is a question for a separate thread. Hmm... I don't know of any way to use mmap() on non-file things, either; there are odd special cases, like using MAP_ANONYMOUS on /dev/zero to allocate memory, but that's still using a file. On the other hand, there may be some special case where you need to do that. We could add a fileno() method to get the file descriptor, but I don't know if that's useful to Windows. (Is Sam Rushing, the original author of the Win32 mmapfile, on this list?) What do we do about the tagname, which is a Win32 argument that has no Unix counterpart -- I'm not even sure what its function is. -- A.M. Kuchling http://starship.python.net/crew/amk/ I had it in me to be the Pierce Brosnan of my generation. -- Vincent Me's past career plans in EGYPT #1 From mal at lemburg.com Tue Nov 16 18:53:46 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Tue, 16 Nov 1999 18:53:46 +0100 Subject: [Python-Dev] Codecs and StreamCodecs References: <38317FBA.4F3D6B1F@lemburg.com> <199911161620.LAA02643@eric.cnri.reston.va.us> Message-ID: <38319A2A.4385D2E7@lemburg.com> Guido van Rossum wrote: > > > It is not required by the unicodec.register() API to provide a > > subclass of these base class, only the given methods must be present; > > this allows writing Codecs as extensions types. All Codecs must > > provide the .encode()/.decode() methods. Codecs having the .read() > > and/or .write() methods are considered to be StreamCodecs. > > > > The Unicode implementation will by itself only use the > > stateless .encode() and .decode() methods. > > > > All other conversion have to be done by explicitly instantiating > > the appropriate [Stream]Codec. > > Looks okay, although I'd like someone to implement a simple > shift-state-based stream codec to check this out further. > > I have some questions about the constructor. You seem to imply > that instantiating the class without arguments creates a codec without > state. That's fine. When given a stream argument, shouldn't the > direction of the stream be given as an additional argument, so the > proper state for encoding or decoding can be set up? I can see that > for an implementation it might be more convenient to have separate > classes for encoders and decoders -- certainly the state being kept is > very different. Wouldn't it be possible to have the read/write methods set up the state when called for the first time ? Note that I wrote ".read() and/or .write() methods" in the proposal on purpose: you can of course implement Codecs which only implement one of them, i.e. Readers and Writers. The registry doesn't care about them anyway :-) Then, if you use a Reader for writing, it will result in an AttributeError... > Also, I don't want to ignore the alternative interface that was > suggested by /F. It uses feed() similar to htmllib c.s. This has > some advantages (although we might want to define some compatibility > so it can also feed directly into a file). AFAIK, .feed() and .finalize() (or .close() etc.) have a different backgound: you add data in chunks and then process it at some final stage rather than for each feed. This is often more efficient. With respest to codecs this would mean, that you buffer the output in memory, first doing only preliminary operations on the feeds and then apply some final logic to the buffer at the time .finalize() is called. We could define a StreamCodec subclass for this kind of operation. > Perhaps someone should go ahead and implement prototype codecs using > either paradigm and then write some simple apps, so we can make a > better decision. > > In any case I think the specs codec registry API aren't on the > critical path, integration of /F's basic unicode object is the first > thing we need. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From gward at cnri.reston.va.us Tue Nov 16 18:54:06 1999 From: gward at cnri.reston.va.us (Greg Ward) Date: Tue, 16 Nov 1999 12:54:06 -0500 Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...) In-Reply-To: <199911161627.LAA02665@eric.cnri.reston.va.us>; from guido@cnri.reston.va.us on Tue, Nov 16, 1999 at 11:27:53AM -0500 References: <199911152137.QAA28280@eric.cnri.reston.va.us> <Pine.LNX.4.10.9911160406070.2535-100000@nebula.lyra.org> <19991116091032.A4063@cnri.reston.va.us> <14385.27579.292173.433577@amarok.cnri.reston.va.us> <14385.33535.23316.286575@anthem.cnri.reston.va.us> <199911161627.LAA02665@eric.cnri.reston.va.us> Message-ID: <19991116125405.B4063@cnri.reston.va.us> On 16 November 1999, Guido van Rossum said: > A completely different approach (which, incidentally, HP has lobbied > for before; and which has been implemented by Sjoerd Mullender for one > particular application) would be to cache a mapping from module names > to filenames in a dbm file. For Sjoerd's app (which imported hundreds > of modules) this made a huge difference. Hey, this could be a big win for Zope startup. Dunno how much of that 20-30 sec startup overhead is due to loading modules, but I'm sure it's a sizeable percentage. Any Zope-heads listening? > The problem is that it's > hard to deal with issues like updating the cache while sharing it with > other processes and even other users... Probably not a concern in the case of Zope: one installation, one process, only gets started when it's explicitly shut down and restarted. HmmmMMMMmmm... Greg From petrilli at amber.org Tue Nov 16 19:04:46 1999 From: petrilli at amber.org (Christopher Petrilli) Date: Tue, 16 Nov 1999 13:04:46 -0500 Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...) In-Reply-To: <19991116125405.B4063@cnri.reston.va.us>; from gward@cnri.reston.va.us on Tue, Nov 16, 1999 at 12:54:06PM -0500 References: <199911152137.QAA28280@eric.cnri.reston.va.us> <Pine.LNX.4.10.9911160406070.2535-100000@nebula.lyra.org> <19991116091032.A4063@cnri.reston.va.us> <14385.27579.292173.433577@amarok.cnri.reston.va.us> <14385.33535.23316.286575@anthem.cnri.reston.va.us> <199911161627.LAA02665@eric.cnri.reston.va.us> <19991116125405.B4063@cnri.reston.va.us> Message-ID: <19991116130446.A3068@trump.amber.org> Greg Ward [gward at cnri.reston.va.us] wrote: > On 16 November 1999, Guido van Rossum said: > > A completely different approach (which, incidentally, HP has lobbied > > for before; and which has been implemented by Sjoerd Mullender for one > > particular application) would be to cache a mapping from module names > > to filenames in a dbm file. For Sjoerd's app (which imported hundreds > > of modules) this made a huge difference. > > Hey, this could be a big win for Zope startup. Dunno how much of that > 20-30 sec startup overhead is due to loading modules, but I'm sure it's > a sizeable percentage. Any Zope-heads listening? Wow, that's a huge start up that I've personally never seen. I can't imagine... even loading the Oracle libraries dynamically, which are HUGE (2Mb or so), it's only a couple seconds. > > The problem is that it's > > hard to deal with issues like updating the cache while sharing it with > > other processes and even other users... > > Probably not a concern in the case of Zope: one installation, one > process, only gets started when it's explicitly shut down and > restarted. HmmmMMMMmmm... This doesn't reslve a lot of other users of Python howver... and Zope would always benefit, especially when you're running multiple instances on th same machine... would perhaps share more code. Chris -- | Christopher Petrilli | petrilli at amber.org From gmcm at hypernet.com Tue Nov 16 19:04:41 1999 From: gmcm at hypernet.com (Gordon McMillan) Date: Tue, 16 Nov 1999 13:04:41 -0500 Subject: [Python-Dev] mmapfile module In-Reply-To: <14385.38201.301429.786642@amarok.cnri.reston.va.us> References: <199911161720.MAA02764@eric.cnri.reston.va.us> Message-ID: <1269347016-9399681@hypernet.com> Andrew M. Kuchling wrote: > Hmm... I don't know of any way to use mmap() on non-file things, > either; there are odd special cases, like using MAP_ANONYMOUS on > /dev/zero to allocate memory, but that's still using a file. On > the other hand, there may be some special case where you need to > do that. We could add a fileno() method to get the file > descriptor, but I don't know if that's useful to Windows. (Is > Sam Rushing, the original author of the Win32 mmapfile, on this > list?) > > What do we do about the tagname, which is a Win32 argument that > has no Unix counterpart -- I'm not even sure what its function > is. On Windows, a mmap is always backed by disk (swap space), but is not necessarily associated with a (user-land) file. The tagname is like the "name" associated with a semaphore; two processes opening the same tagname get shared memory. Fileno (in the c runtime sense) would be useless on Windows. As with all Win32 resources, there's a "handle", which is analagous. But different enough, it seems to me, to confound any attempts at a common API. Another fundamental difference (IIRC) is that Windows mmap's can be resized on the fly. - Gordon From guido at CNRI.Reston.VA.US Tue Nov 16 19:09:43 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Tue, 16 Nov 1999 13:09:43 -0500 Subject: [Python-Dev] Codecs and StreamCodecs In-Reply-To: Your message of "Tue, 16 Nov 1999 18:53:46 +0100." <38319A2A.4385D2E7@lemburg.com> References: <38317FBA.4F3D6B1F@lemburg.com> <199911161620.LAA02643@eric.cnri.reston.va.us> <38319A2A.4385D2E7@lemburg.com> Message-ID: <199911161809.NAA02894@eric.cnri.reston.va.us> > > I have some questions about the constructor. You seem to imply > > that instantiating the class without arguments creates a codec without > > state. That's fine. When given a stream argument, shouldn't the > > direction of the stream be given as an additional argument, so the > > proper state for encoding or decoding can be set up? I can see that > > for an implementation it might be more convenient to have separate > > classes for encoders and decoders -- certainly the state being kept is > > very different. > > Wouldn't it be possible to have the read/write methods set up > the state when called for the first time ? Hm, I'd rather be explicit. We don't do this for files either. > Note that I wrote ".read() and/or .write() methods" in the proposal > on purpose: you can of course implement Codecs which only implement > one of them, i.e. Readers and Writers. The registry doesn't care > about them anyway :-) > > Then, if you use a Reader for writing, it will result in an > AttributeError... > > > Also, I don't want to ignore the alternative interface that was > > suggested by /F. It uses feed() similar to htmllib c.s. This has > > some advantages (although we might want to define some compatibility > > so it can also feed directly into a file). > > AFAIK, .feed() and .finalize() (or .close() etc.) have a different > backgound: you add data in chunks and then process it at some > final stage rather than for each feed. This is often more > efficient. > > With respest to codecs this would mean, that you buffer the > output in memory, first doing only preliminary operations on > the feeds and then apply some final logic to the buffer at > the time .finalize() is called. This is part of the purpose, yes. > We could define a StreamCodec subclass for this kind of operation. The difference is that to decode from a file, your proposed interface is to call read() on the codec which will in turn call read() on the stream. In /F's version, I call read() on the stream (geting multibyte encoded data), feed() that to the codec, which in turn calls feed() to some other back end -- perhaps another codec which in turn feed()s its converted data to another file, perhaps an XML parser. --Guido van Rossum (home page: http://www.python.org/~guido/) From fdrake at acm.org Tue Nov 16 19:16:42 1999 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Tue, 16 Nov 1999 13:16:42 -0500 (EST) Subject: [Python-Dev] Codecs and StreamCodecs In-Reply-To: <38319A2A.4385D2E7@lemburg.com> References: <38317FBA.4F3D6B1F@lemburg.com> <199911161620.LAA02643@eric.cnri.reston.va.us> <38319A2A.4385D2E7@lemburg.com> Message-ID: <14385.40842.709711.12141@weyr.cnri.reston.va.us> M.-A. Lemburg writes: > Wouldn't it be possible to have the read/write methods set up > the state when called for the first time ? That slows the down; the constructor should handle initialization. Perhaps what gets registered should be: encoding function, decoding function, stream encoder factory (can be a class), stream decoder factory (again, can be a class). These can be encapsulated either before or after hitting the registry, and can be None. The registry and provide default implementations from what is provided (stream handlers from the functions, or functions from the stream handlers) as required. Ideally, I should be able to write a module with four well-known entry points and then provide the module object itself as the registration entry. Or I could construct a new object that has the right interface and register that if it made more sense for the encoding. > AFAIK, .feed() and .finalize() (or .close() etc.) have a different > backgound: you add data in chunks and then process it at some > final stage rather than for each feed. This is often more Many of the classes that provide feed() do as much work as possible as data is fed into them (see htmllib.HTMLParser); this structure is commonly used to support asynchonous operation. > With respest to codecs this would mean, that you buffer the > output in memory, first doing only preliminary operations on > the feeds and then apply some final logic to the buffer at > the time .finalize() is called. That depends on the encoding. I'd expect it to feed encoded data to a sink as quickly as it could and let the target decide what needs to happen. If buffering is needed, the target could be a StringIO or whatever. -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> Corporation for National Research Initiatives From fredrik at pythonware.com Tue Nov 16 20:32:21 1999 From: fredrik at pythonware.com (Fredrik Lundh) Date: Tue, 16 Nov 1999 20:32:21 +0100 Subject: [Python-Dev] mmapfile module References: <199911161700.MAA02716@eric.cnri.reston.va.us><14385.36948.610106.195971@amarok.cnri.reston.va.us><199911161720.MAA02764@eric.cnri.reston.va.us> <14385.38201.301429.786642@amarok.cnri.reston.va.us> Message-ID: <002201bf3069$4e232a50$f29b12c2@secret.pythonware.com> > Hmm... I don't know of any way to use mmap() on non-file things, > either; there are odd special cases, like using MAP_ANONYMOUS on > /dev/zero to allocate memory, but that's still using a file. but that's not always the case -- OSF/1 supports truly anonymous mappings, for example. in fact, it bombs if you use ANONYMOUS with a file handle: $ man mmap ... If MAP_ANONYMOUS is set in the flags parameter: + A new memory region is created and initialized to all zeros. This memory region can be shared only with descendents of the current pro- cess. + If the filedes parameter is not -1, the mmap() function fails. ... (btw, doing anonymous maps isn't exactly an odd special case under this operating system; it's the only memory- allocation mechanism provided by the kernel...) </F> From fredrik at pythonware.com Tue Nov 16 20:33:52 1999 From: fredrik at pythonware.com (Fredrik Lundh) Date: Tue, 16 Nov 1999 20:33:52 +0100 Subject: [Python-Dev] Codecs and StreamCodecs References: <38317FBA.4F3D6B1F@lemburg.com> <199911161620.LAA02643@eric.cnri.reston.va.us> Message-ID: <002e01bf3069$8477b440$f29b12c2@secret.pythonware.com> Guido van Rossum <guido at CNRI.Reston.VA.US> wrote: > Also, I don't want to ignore the alternative interface that was > suggested by /F. It uses feed() similar to htmllib c.s. This has > some advantages (although we might want to define some > compatibility so it can also feed directly into a file). seeing this made me switch on my brain for a moment, and recall how things are done in PIL (which is, as I've bragged about before, another library with an internal format, and many possible external encodings). among other things, PIL lets you read and write images to both ordinary files and arbitrary file objects, but it also lets you incrementally decode images by feeding it chunks of data (through ImageFile.Parser). and it's fast -- it has to be, since images tends to contain lots of pixels... anyway, here's what I came up with (code will follow, if someone's interested). -------------------------------------------------------------------- A PIL-like Unicode Codec Proposal -------------------------------------------------------------------- In the PIL model, the codecs are called with a piece of data, and returns the result to the caller. The codecs maintain internal state when needed. class decoder: def decode(self, s, offset=0): # decode as much data as we possibly can from the # given string. if there's not enough data in the # input string to form a full character, return # what we've got this far (this might be an empty # string). def flush(self): # flush the decoding buffers. this should usually # return None, unless the fact that knowing that the # input stream has ended means that the state can be # interpreted in a meaningful way. however, if the # state indicates that there last character was not # finished, this method should raise a UnicodeError # exception. class encoder: def encode(self, u, offset=0, buffersize=0): # encode data from the given offset in the input # unicode string into a buffer of the given size # (or slightly larger, if required to proceed). # if the buffer size is 0, the decoder is free # to pick a suitable size itself (if at all # possible, it should make it large enough to # encode the entire input string). returns a # 2-tuple containing the encoded data, and the # number of characters consumed by this call. def flush(self): # flush the encoding buffers. returns an ordinary # string (which may be empty), or None. Note that a codec instance can be used for a single string; the codec registry should hold codec factories, not codec instances. In addition, you may use a single type or class to implement both interfaces at once. -------------------------------------------------------------------- Use Cases -------------------------------------------------------------------- A null decoder: class decoder: def decode(self, s, offset=0): return s[offset:] def flush(self): pass A null encoder: class encoder: def encode(self, s, offset=0, buffersize=0): if buffersize: s = s[offset:offset+buffersize] else: s = s[offset:] return s, len(s) def flush(self): pass Decoding a string: def decode(s, encoding) c = registry.getdecoder(encoding) u = c.decode(s) t = c.flush() if not t: return u return u + t # not very common Encoding a string: def encode(u, encoding) c = registry.getencoder(encoding) p = [] o = 0 while o < len(u): s, n = c.encode(u, o) p.append(s) o = o + n if len(p) == 1: return p[0] return string.join(p, "") # not very common Implementing stream codecs is left as an exercise (see the zlib material in the eff-bot guide for a decoder example). --- end of proposal From fredrik at pythonware.com Tue Nov 16 20:37:40 1999 From: fredrik at pythonware.com (Fredrik Lundh) Date: Tue, 16 Nov 1999 20:37:40 +0100 Subject: [Python-Dev] Python 1.6 status References: <199911161700.MAA02716@eric.cnri.reston.va.us> <14385.36948.610106.195971@amarok.cnri.reston.va.us> Message-ID: <003d01bf306a$0bdea330$f29b12c2@secret.pythonware.com> > * Go through the O'Reilly POSIX book and draw up a list of missing > POSIX functions that aren't available in the posix module. This > was sparked by Greg Ward showing me a Perl daemonize() function > he'd written, and I realized that some of the functions it used > weren't available in Python at all. (setsid() was one of them, I > think.) $ python Python 1.5.2 (#1, Aug 23 1999, 14:42:39) [GCC 2.7.2.3] on linux2 Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam >>> import os >>> os.setsid <built-in function setsid> </F> From mhammond at skippinet.com.au Tue Nov 16 22:54:15 1999 From: mhammond at skippinet.com.au (Mark Hammond) Date: Wed, 17 Nov 1999 08:54:15 +1100 Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: <19991116110555.8B43335BB1E@snelboot.oratrix.nl> Message-ID: <00f701bf307d$20f0cb00$0501a8c0@bobcat> [Andy writes:] > Leave JISXXX and the CJK stuff out. If you get into Japanese, you > really need to cover ShiftJIS, EUC-JP and JIS, they are big, and there [Then Marc relpies:] > 2. give more information to the unicodec registry: > one could register classes instead of instances which the Unicode [Jack chimes in with:] > I would suggest adding the Dos, Windows and Macintosh > standard 8-bit charsets > (their equivalents of latin-1) too, as documents in these > encoding are pretty > ubiquitous. But maybe these should only be added on the > respective platforms. [And the conversation twisted around to Greg noting:] > Next, the number of "open" calls: > > Solaris Linux IRIX > Perl 16 10 9 > Python 107 71 48 This is leading me to conclude that our "codec registry" should be the file system, and Python modules. Would it be possible to define a "standard package" called "encodings", and when we need an encoding, we simply attempt to load a module from that package? The key benefits I see are: * No need to load modules simply to register a codec (which would make the number of open calls even higher, and the startup time even slower.) This makes it truly demand-loading of the codecs, rather than explicit load-and-register. * Making language specific distributions becomes simple - simply select a different set of modules from the "encodings" directory. The Python source distribution has them all, but (say) the Windows binary installer selects only a few. The Japanese binary installer for Windows installs a few more. * Installing new codecs becomes trivial - no need to hack site.py etc - simply copy the new "codec module" to the encodings directory and you are done. * No serious problem for GMcM's installer nor for freeze We would probably need to assume that certain codes exist for _all_ platforms and language - but this is no different to assuming that "exceptions.py" also exists for all platforms. Is this worthy of consideration? Mark. From andy at robanal.demon.co.uk Wed Nov 17 01:14:06 1999 From: andy at robanal.demon.co.uk (Andy Robinson) Date: Wed, 17 Nov 1999 00:14:06 GMT Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: <010001bf300e$14741310$f29b12c2@secret.pythonware.com> References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com> <199911152137.QAA28280@eric.cnri.reston.va.us> <010001bf300e$14741310$f29b12c2@secret.pythonware.com> Message-ID: <3836f28c.4929177@post.demon.co.uk> On Tue, 16 Nov 1999 09:39:20 +0100, you wrote: >1) codes written according to the "data > consumer model", instead of the "stream" > model. > > class myDecoder: > def __init__(self, target): > self.target = target > self.state = ... > def feed(self, data): > ... extract as much data as possible ... > self.target.feed(extracted data) > def close(self): > ... extract what's left ... > self.target.feed(additional data) > self.target.close() > Apart from feed() instead of write(), how is that different from a Java-like Stream writer as Guido suggested? He said: >Andy's file translation example could then be written as follows: > ># assuming variables input_file, input_encoding, output_file, ># output_encoding, and constant BUFFER_SIZE > >f = open(input_file, "rb") >f1 = unicodec.codecs[input_encoding].stream_reader(f) >g = open(output_file, "wb") >g1 = unicodec.codecs[output_encoding].stream_writer(f) > >while 1: > buffer = f1.read(BUFFER_SIZE) > if not buffer: > break > f2.write(buffer) > >f2.close() >f1.close() > >Note that we could possibly make these the only API that a codec needs >to provide; the string object <--> unicode object conversions can be >done using this and the cStringIO module. (On the other hand it seems >a common case that would be quite useful.) - Andy From gstein at lyra.org Wed Nov 17 03:03:21 1999 From: gstein at lyra.org (Greg Stein) Date: Tue, 16 Nov 1999 18:03:21 -0800 (PST) Subject: [Python-Dev] shared data In-Reply-To: <1269351119-9152905@hypernet.com> Message-ID: <Pine.LNX.4.10.9911161756290.10639-100000@nebula.lyra.org> On Tue, 16 Nov 1999, Gordon McMillan wrote: > Barry A. Warsaw writes: > > One approach might be to support loading modules out of jar files > > (or whatever) using Greg imputils. We could put the bootstrap > > .pyc files in this jar and teach Python to import from it first. > > Python installations could even craft their own modules.jar file > > to include whatever modules they are willing to "hard code". > > This, with -S might make Python start up much faster, at the > > small cost of some flexibility (which could be regained with a > > c.l. switch or other mechanism to bypass modules.jar). > > Couple hundred Windows users have been doing this for > months (http://starship.python.net/crew/gmcm/install.html). > The .pyz files are cross-platform, although the "embedding" > app would have to be redone for *nix, (and all the embedding > really does is keep Python from hunting all over your disk). > Yeah, it's faster. And I can put Python+Tcl/Tk+IDLE on a > diskette with a little room left over. I've got a patch from Jim Ahlstrom to provide a "standardized" library file. I've got to review and fold that thing in (I'll post here when that is done). As Gordon states: yes, the startup time is considerably improved. The DBM approach is interesting. That could definitely be used thru an imputils Importer; it would be quite interesting to try that out. (Note that the library style approach would be even harder to deal with updates, relative to what Sjoerd saw with the DBM approach; I would guess that the "right" approach is to rebuild the library from scratch and atomically replace the thing (but that would bust people with open references...)) Certainly something to look at. Cheers, -g p.s. I also want to try mmap'ing a library and creating code objects that use PyBufferObjects (rather than PyStringObjects) that refer to portions of the mmap. Presuming the mmap is shared, there "should" be a large reduction in heap usage. Question is that I don't know the proportion of code bytes to other heap usage caused by loading a .pyc. p.p.s. I also want to try the buffer approach for frozen code. -- Greg Stein, http://www.lyra.org/ From gstein at lyra.org Wed Nov 17 03:29:42 1999 From: gstein at lyra.org (Greg Stein) Date: Tue, 16 Nov 1999 18:29:42 -0800 (PST) Subject: [Python-Dev] Codecs and StreamCodecs In-Reply-To: <14385.40842.709711.12141@weyr.cnri.reston.va.us> Message-ID: <Pine.LNX.4.10.9911161821230.10639-100000@nebula.lyra.org> On Tue, 16 Nov 1999, Fred L. Drake, Jr. wrote: > M.-A. Lemburg writes: > > Wouldn't it be possible to have the read/write methods set up > > the state when called for the first time ? > > That slows the down; the constructor should handle initialization. > Perhaps what gets registered should be: encoding function, decoding > function, stream encoder factory (can be a class), stream decoder > factory (again, can be a class). These can be encapsulated either > before or after hitting the registry, and can be None. The registry I'm with Fred here; he beat me to the punch (and his email is better than what I'd write anyhow :-). I'd like to see the API be *functions* rather than a particular class specification. If the spec is going to say "do not alter/store state", then a function makes much more sense than a method on an object. Of course, bound method objects could be registered. This might occur if you have a general JIS encode/decoder but need to instantiate it a little differently for each JIS variant. (Andy also mentioned something about "options" in JIS encoding/decoding) > and provide default implementations from what is provided (stream > handlers from the functions, or functions from the stream handlers) as > required. Excellent idea... "I'll provide the encode/decode functions, but I don't have a spiffy algorithm for streaming -- please provide a stream wrapper for my functions." > Ideally, I should be able to write a module with four well-known > entry points and then provide the module object itself as the > registration entry. Or I could construct a new object that has the > right interface and register that if it made more sense for the > encoding. Mark's idea about throwing these things into a package for on-demand registrations is much better than a "register-beforehand" model. When the module is loaded from the package, it calls a registration function to insert its 4-tuple of registration data. Cheers, -g -- Greg Stein, http://www.lyra.org/ From gstein at lyra.org Wed Nov 17 03:40:07 1999 From: gstein at lyra.org (Greg Stein) Date: Tue, 16 Nov 1999 18:40:07 -0800 (PST) Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: <00f701bf307d$20f0cb00$0501a8c0@bobcat> Message-ID: <Pine.LNX.4.10.9911161830020.10639-100000@nebula.lyra.org> On Wed, 17 Nov 1999, Mark Hammond wrote: >... > Would it be possible to define a "standard package" called > "encodings", and when we need an encoding, we simply attempt to load a > module from that package? The key benefits I see are: >... > Is this worthy of consideration? Absolutely! You will need to provide a way for a module (in the "codec" package) to state *beforehand* that it should be loaded for the X, Y, and Z encodings. This might be in terms of little "info" files that get dropped into the package. The __init__.py module scans the directory for the info files and loads them to build an encoding => module-name mapping. The alternative would be to have stub modules like: iso-8859-1.py: import unicodec def encode_1(...) ... def encode_2(...) ... ... unicodec.register('iso-8859-1', encode_1, decode_1) unicodec.register('iso-8859-2', encode_2, decode_2) ... iso-8859-2.py: import iso-8859-1 I believe that encoding names are legitimate file names, but they aren't necessarily Python identifiers. That kind of bungs up "import codec.iso-8859-1". The codec package would need to programmatically import the modules. Clients should not be directly importing the modules, so I don't see a difficult here. [ if we do decide to allow clients access to the modules, then maybe they have to arrive through a "helper" module that has a nice name, or the codec package provides a "module = code.load('iso-8859-1')" idiom. ] Cheers, -g -- Greg Stein, http://www.lyra.org/ From mhammond at skippinet.com.au Wed Nov 17 03:57:48 1999 From: mhammond at skippinet.com.au (Mark Hammond) Date: Wed, 17 Nov 1999 13:57:48 +1100 Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: <Pine.LNX.4.10.9911161830020.10639-100000@nebula.lyra.org> Message-ID: <010501bf30a7$88c00320$0501a8c0@bobcat> > You will need to provide a way for a module (in the "codec" > package) to > state *beforehand* that it should be loaded for the X, Y, and ... > The alternative would be to have stub modules like: Actually, I was thinking even more radically - drop the codec registry all together, and use modules with "well-known" names (a slight precedent, but Python isnt adverse to well-known names in general) eg: iso-8859-1.py: import unicodec def encode(...): ... def decode(...): ... iso-8859-2.py: from iso-8859-1 import * The codec registry then is trivial, and effectively does not exist (cant get much more trivial than something that doesnt exist :-): def getencoder(encoding): mod = __import__( "encodings." + encoding ) return getattr(mod, "encode") > I believe that encoding names are legitimate file names, but > they aren't > necessarily Python identifiers. That kind of bungs up "import > codec.iso-8859-1". Agreed - clients should never need to import them, and codecs that wish to import other codes could use "__import__" Of course, I am not adverse to the idea of a registry as well and having the modules manually register themselves - but it doesnt seem to buy much, and the logic for getting a codec becomes more complex - ie, it needs to determine the module to import, then look in the registry - if it needs to determine the module anyway, why not just get it from the module and be done with it? Mark. From andy at robanal.demon.co.uk Wed Nov 17 01:18:22 1999 From: andy at robanal.demon.co.uk (Andy Robinson) Date: Wed, 17 Nov 1999 00:18:22 GMT Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: <00f701bf307d$20f0cb00$0501a8c0@bobcat> References: <00f701bf307d$20f0cb00$0501a8c0@bobcat> Message-ID: <3837f379.5166829@post.demon.co.uk> On Wed, 17 Nov 1999 08:54:15 +1100, you wrote: >This is leading me to conclude that our "codec registry" should be the >file system, and Python modules. > >Would it be possible to define a "standard package" called >"encodings", and when we need an encoding, we simply attempt to load a >module from that package? The key benefits I see are: [snip] >Is this worthy of consideration? Exactly what I am aiming for. The real icing on the cake would be a small state machine or some helper functions in C which made it possible to write fast codecs in pure Python, but that can come a bit later when we have examples up and running. - Andy From andy at robanal.demon.co.uk Wed Nov 17 01:08:01 1999 From: andy at robanal.demon.co.uk (Andy Robinson) Date: Wed, 17 Nov 1999 00:08:01 GMT Subject: [Python-Dev] Internationalization Toolkit In-Reply-To: <000601bf2ff7$4d8a4c80$042d153f@tim> References: <000601bf2ff7$4d8a4c80$042d153f@tim> Message-ID: <3834f142.4599884@post.demon.co.uk> On Tue, 16 Nov 1999 00:56:18 -0500, you wrote: >[Andy Robinson] >> ... >> I presume no one is actually advocating dropping >> ordinary Python strings, or the ability to do >> rawdata = open('myfile.txt', 'rb').read() >> without any transformations? > >If anyone has advocated either, they've successfully hidden it from me. >Anyone? Well, I hear statements looking forward to when all string-handling is done in Unicode internally. This scares the hell out of me - it is what VB does and that bit us badly on simple stream operations. For encoding work, you will always need raw strings, and often need Unicode ones. - Andy From tim_one at email.msn.com Wed Nov 17 08:33:06 1999 From: tim_one at email.msn.com (Tim Peters) Date: Wed, 17 Nov 1999 02:33:06 -0500 Subject: [Python-Dev] Unicode proposal: %-formatting ? In-Reply-To: <383134AA.4B49D178@lemburg.com> Message-ID: <000001bf30cd$fd6be9c0$a42d153f@tim> [MAL] > ... > This means a new PyUnicode_Format() implementation mapping > Unicode format objects to Unicode objects. It's a bitch, isn't it <0.5 wink>? I hope they're paying you a lot for this! > ... hmm, there is a problem there: how should the PyUnicode_Format() > API deal with '%s' when it sees a Unicode object as argument ? Anything other than taking the Unicode characters as-is would be incomprehensible. I mean, it's a Unicode format string sucking up Unicode strings -- what else could possibly make *sense*? > E.g. what would you get in these cases: > > u = u"%s %s" % (u"abc", "abc") That u"abc" gets substituted as-is seems screamingly necessary to me. I'm more baffled about what "abc" should do. I didn't understand the t#/s# etc arguments, and how those do or don't relate to what str() does. On the face of it, the idea that a gazillion and one distinct encodings all get lumped into "a string object" without remembering their nature makes about as much sense as if Python were to treat all instances of all user-defined classes as being of a single InstanceType type <wink> -- except in the latter case you at least get a __class__ attribute to find your way home again. As an ignorant user, I would hope that u"%s" % string had enough sense to know what string's encoding is all on its own, and promote it correctly to Unicode by magic. > Perhaps we need a new marker for "insert Unicode object here". %s means string, and at this level a Unicode object *is* "a string". If this isn't obvious, it's likely because we're too clever about what non-Unicode string objects do in this context. From captainrobbo at yahoo.com Wed Nov 17 08:53:53 1999 From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=) Date: Tue, 16 Nov 1999 23:53:53 -0800 (PST) Subject: [Python-Dev] Some thoughts on the codecs... Message-ID: <19991117075353.16046.rocketmail@web606.mail.yahoo.com> --- Mark Hammond <mhammond at skippinet.com.au> wrote: > Actually, I was thinking even more radically - drop > the codec registry > all together, and use modules with "well-known" > names (a slight > precedent, but Python isnt adverse to well-known > names in general) > > eg: > iso-8859-1.py: > > import unicodec > def encode(...): > ... > def decode(...): > ... > > iso-8859-2.py: > from iso-8859-1 import * > This is the simplest if each codec really is likely to be implemented in a separate module. But just look at the data! All the iso-8859 encodings need identical functionality, and just have a different mapping table with 256 elements. It would be trivial to implement these in one module. And the wide variety of Japanese encodings (mostly corporate or historical variants of the same character set) are again best treated from one code base with a bunch of mapping tables and routines to generate the variants - basically one can store the deltas. So the choice is between possibly having a lot of almost-dummy modules, or having Python modules which generate and register a logical family of encodings. I may have some time next week and will try to code up a few so we can pound on something. - Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From captainrobbo at yahoo.com Wed Nov 17 08:58:23 1999 From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=) Date: Tue, 16 Nov 1999 23:58:23 -0800 (PST) Subject: [Python-Dev] Unicode proposal: %-formatting ? Message-ID: <19991117075823.6498.rocketmail@web602.mail.yahoo.com> --- Tim Peters <tim_one at email.msn.com> wrote: > I'm more baffled about what "abc" should do. I > didn't understand the t#/s# > etc arguments, and how those do or don't relate to > what str() does. On the > face of it, the idea that a gazillion and one > distinct encodings all get > lumped into "a string object" without remembering > their nature makes about > as much sense as if Python were to treat all > instances of all user-defined > classes as being of a single InstanceType type > <wink> -- except in the > latter case you at least get a __class__ attribute > to find your way home > again. Well said. When the core stuff is done, I'm going to implement a set of "TypedString" helper routines which will remember what they are encoded in and won't let you abuse them by concatenating or otherwise mixing different encodings. If you are consciously working with multi-encoding data, this higher level of abstraction is really useful. But I reckon that can be done in pure Python (just overload '%;, '+' etc. with some encoding checks). - Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From mal at lemburg.com Wed Nov 17 11:03:59 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Wed, 17 Nov 1999 11:03:59 +0100 Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit) References: <000201bf30d3$cb2cb240$a42d153f@tim> Message-ID: <38327D8F.7A5352E6@lemburg.com> Tim Peters wrote: > > [MAL] > > ...demo script... > > It looks like > > r'\\u0000' > > will get translated into a 2-character Unicode string. Right... > That's probably not > good, if for no other reason than that Java would not do this (it would > create the obvious 7-character Unicode string), and having something that > looks like a Java escape that doesn't *work* like the Java escape will be > confusing as heck for JPython users. Keeping track of even-vs-odd number of > backslashes can't be done with a regexp search, but is easy if the code is > simple <wink>: > ...Tim's version of the demo... Guido and I have decided to turn \uXXXX into a standard escape sequence with no further magic applied. \uXXXX will only be expanded in u"" strings. Here's the new scheme: With the 'unicode-escape' encoding being defined as: ? all non-escape characters represent themselves as a Unicode ordinal (e.g. 'a' -> U+0061). ? all existing defined Python escape sequences are interpreted as Unicode ordinals; note that \xXXXX can represent all Unicode ordinals, and \OOO (octal) can represent Unicode ordinals up to U+01FF. ? a new escape sequence, \uXXXX, represents U+XXXX; it is a syntax error to have fewer than 4 digits after \u. Examples: u'abc' -> U+0061 U+0062 U+0063 u'\u1234' -> U+1234 u'abc\u1234\n' -> U+0061 U+0062 U+0063 U+1234 U+05c Now how should we define ur"abc\u1234\n" ... ? -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 44 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From tim_one at email.msn.com Wed Nov 17 10:31:27 1999 From: tim_one at email.msn.com (Tim Peters) Date: Wed, 17 Nov 1999 04:31:27 -0500 Subject: [Python-Dev] Python 1.6 status In-Reply-To: <199911161700.MAA02716@eric.cnri.reston.va.us> Message-ID: <000801bf30de$85bea500$a42d153f@tim> [Guido] > ... > I'm hoping for several kind of responses to this email: > ... > - requests for checkin privileges, preferably with a specific issue > or area of expertise for which the requestor will take responsibility. I'm specifically requesting not to have checkin privileges. So there. I see two problems: 1. When patches go thru you, you at least eyeball them. This catches bugs and design errors early. 2. For a multi-platform app, few people have adequate resources for testing; e.g., I can test under an obsolete version of Win95, and NT if I have to, but that's it. You may not actually do better testing than that, but having patches go thru you allows me the comfort of believing you do <wink>. From mal at lemburg.com Wed Nov 17 11:11:05 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Wed, 17 Nov 1999 11:11:05 +0100 Subject: [Python-Dev] Some thoughts on the codecs... References: <00f701bf307d$20f0cb00$0501a8c0@bobcat> Message-ID: <38327F39.AA381647@lemburg.com> Mark Hammond wrote: > > This is leading me to conclude that our "codec registry" should be the > file system, and Python modules. > > Would it be possible to define a "standard package" called > "encodings", and when we need an encoding, we simply attempt to load a > module from that package? The key benefits I see are: > > * No need to load modules simply to register a codec (which would make > the number of open calls even higher, and the startup time even > slower.) This makes it truly demand-loading of the codecs, rather > than explicit load-and-register. > > * Making language specific distributions becomes simple - simply > select a different set of modules from the "encodings" directory. The > Python source distribution has them all, but (say) the Windows binary > installer selects only a few. The Japanese binary installer for > Windows installs a few more. > > * Installing new codecs becomes trivial - no need to hack site.py > etc - simply copy the new "codec module" to the encodings directory > and you are done. > > * No serious problem for GMcM's installer nor for freeze > > We would probably need to assume that certain codes exist for _all_ > platforms and language - but this is no different to assuming that > "exceptions.py" also exists for all platforms. > > Is this worthy of consideration? Why not... using the new registry scheme I proposed in the thread "Codecs and StreamCodecs" you could implement this via factory_functions and lazy imports (with the encoding name folded to make up a proper Python identifier, e.g. hyphens get converted to '' and spaces to '_'). I'd suggest grouping encodings: [encodings] [iso} [iso88591] [iso88592] [jis] ... [cyrillic] ... [misc] The unicodec registry could then query encodings.get(encoding,action) and the package would take care of the rest. Note that the "walk-me-up-scotty" import patch would probably be nice in this situation too, e.g. to reach the modules in [misc] or in higher levels such the ones in [iso] from [iso88591]. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 44 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Wed Nov 17 10:29:34 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Wed, 17 Nov 1999 10:29:34 +0100 Subject: [Python-Dev] Codecs and StreamCodecs References: <38317FBA.4F3D6B1F@lemburg.com> <199911161620.LAA02643@eric.cnri.reston.va.us> <002e01bf3069$8477b440$f29b12c2@secret.pythonware.com> Message-ID: <3832757E.B9503606@lemburg.com> Fredrik Lundh wrote: > > -------------------------------------------------------------------- > A PIL-like Unicode Codec Proposal > -------------------------------------------------------------------- > > In the PIL model, the codecs are called with a piece of data, and > returns the result to the caller. The codecs maintain internal state > when needed. > > class decoder: > > def decode(self, s, offset=0): > # decode as much data as we possibly can from the > # given string. if there's not enough data in the > # input string to form a full character, return > # what we've got this far (this might be an empty > # string). > > def flush(self): > # flush the decoding buffers. this should usually > # return None, unless the fact that knowing that the > # input stream has ended means that the state can be > # interpreted in a meaningful way. however, if the > # state indicates that there last character was not > # finished, this method should raise a UnicodeError > # exception. Could you explain for reason for having a .flush() method and what it should return. Note that the .decode method is not so much different from my Codec.decode method except that it uses a single offset where my version uses a slice (the offset is probably the better variant, because it avoids data truncation). > class encoder: > > def encode(self, u, offset=0, buffersize=0): > # encode data from the given offset in the input > # unicode string into a buffer of the given size > # (or slightly larger, if required to proceed). > # if the buffer size is 0, the decoder is free > # to pick a suitable size itself (if at all > # possible, it should make it large enough to > # encode the entire input string). returns a > # 2-tuple containing the encoded data, and the > # number of characters consumed by this call. Dito. > def flush(self): > # flush the encoding buffers. returns an ordinary > # string (which may be empty), or None. > > Note that a codec instance can be used for a single string; the codec > registry should hold codec factories, not codec instances. In > addition, you may use a single type or class to implement both > interfaces at once. Perhaps I'm missing something, but how would you define stream codecs using this interface ? > Implementing stream codecs is left as an exercise (see the zlib > material in the eff-bot guide for a decoder example). ...? -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 44 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Wed Nov 17 10:55:05 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Wed, 17 Nov 1999 10:55:05 +0100 Subject: [Python-Dev] Codecs and StreamCodecs References: <38317FBA.4F3D6B1F@lemburg.com> <199911161620.LAA02643@eric.cnri.reston.va.us> <38319A2A.4385D2E7@lemburg.com> <14385.40842.709711.12141@weyr.cnri.reston.va.us> Message-ID: <38327B79.2415786B@lemburg.com> "Fred L. Drake, Jr." wrote: > > M.-A. Lemburg writes: > > Wouldn't it be possible to have the read/write methods set up > > the state when called for the first time ? > > That slows the down; the constructor should handle initialization. > Perhaps what gets registered should be: encoding function, decoding > function, stream encoder factory (can be a class), stream decoder > factory (again, can be a class). Guido proposed the factory approach too, though not seperated into these 4 APIs (note that your proposal looks very much like what I had in the early version of my proposal). Anyway, I think that factory functions are the way to go, because they offer more flexibility w/r to reusing already instantiated codecs, importing modules on-the-fly as was suggested in another thread (thereby making codec module import lazy) or mapping encoder and decoder requests all to one class. So here's a new registry approach: unicodec.register(encoding,factory_function,action) with encoding - name of the supported encoding, e.g. Shift_JIS factory_function - a function that returns an object or function ready to be used for action action - a string stating the supported action: 'encode' 'decode' 'stream write' 'stream read' The factory_function API depends on the implementation of the codec. The returned object's interface on the value of action: Codecs: ------- obj = factory_function_for_<action>(errors='strict') 'encode': obj(u,slice=None) -> Python string 'decode': obj(s,offset=0,chunksize=0) -> (Unicode object, bytes consumed) factory_functions are free to return simple function objects for stateless encodings. StreamCodecs: ------------- obj = factory_function_for_<action>(stream,errors='strict') obj should provide access to all methods defined for the stream object, overriding these: 'stream write': obj.write(u,slice=None) -> bytes written to stream obj.flush() -> ??? 'stream read': obj.read(chunksize=0) -> (Unicode object, bytes read) obj.flush() -> ??? errors is defined like in my Codec spec. The codecs are expected to use this argument to handle error conditions. I'm not sure what Fredrik intended with the .flush() methods, so the definition is still open. I would expect it to do some finalization of state. Perhaps we need another set of actions for the .feed()/.close() approach... As in earlier version of the proposal: The registry should provide default implementations for missing action factory_functions using the other registered functions, e.g. 'stream write' can be emulated using 'encode' and 'stream read' using 'decode'. The same probably holds for feed approach. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 44 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From tim_one at email.msn.com Wed Nov 17 09:14:38 1999 From: tim_one at email.msn.com (Tim Peters) Date: Wed, 17 Nov 1999 03:14:38 -0500 Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit) In-Reply-To: <3831350B.8F69CB6D@lemburg.com> Message-ID: <000201bf30d3$cb2cb240$a42d153f@tim> [MAL] > ... > Here is a sample implementation of what I had in mind: > > """ Demo for 'unicode-escape' encoding. > """ > import struct,string,re > > pack_format = '>H' > > def convert_string(s): > > l = map(None,s) > for i in range(len(l)): > l[i] = struct.pack(pack_format,ord(l[i])) > return l > > u_escape = re.compile(r'\\u([0-9a-fA-F]{0,4})') > > def unicode_unescape(s): > > l = [] > start = 0 > while start < len(s): > m = u_escape.search(s,start) > if not m: > l[len(l):] = convert_string(s[start:]) > break > m_start,m_end = m.span() > if m_start > start: > l[len(l):] = convert_string(s[start:m_start]) > hexcode = m.group(1) > #print hexcode,start,m_start > if len(hexcode) != 4: > raise SyntaxError,'illegal \\uXXXX sequence: \\u%s' % hexcode > ordinal = string.atoi(hexcode,16) > l.append(struct.pack(pack_format,ordinal)) > start = m_end > #print l > return string.join(l,'') > > def hexstr(s,sep=''): > > return string.join(map(lambda x,hex=hex,ord=ord: '%02x' % > ord(x),s),sep) It looks like r'\\u0000' will get translated into a 2-character Unicode string. That's probably not good, if for no other reason than that Java would not do this (it would create the obvious 7-character Unicode string), and having something that looks like a Java escape that doesn't *work* like the Java escape will be confusing as heck for JPython users. Keeping track of even-vs-odd number of backslashes can't be done with a regexp search, but is easy if the code is simple <wink>: def unicode_unescape(s): from string import atoi import array i, n = 0, len(s) result = array.array('H') # unsigned short, native order while i < n: ch = s[i] i = i+1 if ch != "\\": result.append(ord(ch)) continue if i == n: raise ValueError("string ends with lone backslash") ch = s[i] i = i+1 if ch != "u": result.append(ord("\\")) result.append(ord(ch)) continue hexchars = s[i:i+4] if len(hexchars) != 4: raise ValueError("\\u escape at end not followed by " "at least 4 characters") i = i+4 for ch in hexchars: if ch not in "01234567890abcdefABCDEF": raise ValueError("\\u" + hexchars + " contains " "non-hex characters") result.append(atoi(hexchars, 16)) # print result return result.tostring() From tim_one at email.msn.com Wed Nov 17 09:47:48 1999 From: tim_one at email.msn.com (Tim Peters) Date: Wed, 17 Nov 1999 03:47:48 -0500 Subject: [Python-Dev] just say no... In-Reply-To: <383156DF.2209053F@lemburg.com> Message-ID: <000401bf30d8$6cf30bc0$a42d153f@tim> [MAL] > FYI, the next version of the proposal ... > File objects opened in text mode will use "t#" and binary ones use "s#". Am I the only one who sees magical distinctions between text and binary mode as a Really Bad Idea? I wouldn't have guessed the Unix natives here would quietly acquiesce to importing a bit of Windows madness <wink>. From tim_one at email.msn.com Wed Nov 17 09:47:46 1999 From: tim_one at email.msn.com (Tim Peters) Date: Wed, 17 Nov 1999 03:47:46 -0500 Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: <383140F3.EDDB307A@lemburg.com> Message-ID: <000301bf30d8$6bbd4ae0$a42d153f@tim> [Jack Jansen] > I would suggest adding the Dos, Windows and Macintosh standard > 8-bit charsets (their equivalents of latin-1) too, as documents > in these encoding are pretty ubiquitous. But maybe these should > only be added on the respective platforms. [MAL] > Good idea. What code pages would that be ? I'm not clear on what's being suggested; e.g., Windows supports *many* different "code pages". CP 1252 is default in the U.S., and is an extension of Latin-1. See e.g. ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT which appears to be up-to-date (has 0x80 as the euro symbol, Unicode U+20AC -- although whether your version of U.S. Windows actually has this depends on whether you installed the service pack that added it!). See ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP850.TXT for the closest DOS got. From tim_one at email.msn.com Wed Nov 17 10:05:21 1999 From: tim_one at email.msn.com (Tim Peters) Date: Wed, 17 Nov 1999 04:05:21 -0500 Subject: Weak refs (was [Python-Dev] just say no...) In-Reply-To: <14385.33486.855802.187739@weyr.cnri.reston.va.us> Message-ID: <000601bf30da$e069d820$a42d153f@tim> [Fred L. Drake, Jr., pines for some flavor of weak refs; MAL reminds us of his work; & back to Fred] > Yes, but still not in the core. So we have two general examples > (vrefs and mxProxy) and there's WeakDict (or something like that). I > think there really needs to be a core facility for this. This kind of thing certainly belongs in the core (for efficiency and smooth integration) -- if it belongs in the language at all. This was discussed at length here some months ago; that's what prompted MAL to "do something" about it. Guido hasn't shown visible interest, and nobody has been willing to fight him to the death over it. So it languishes. Buy him lunch tomorrow and get him excited <wink>. From tim_one at email.msn.com Wed Nov 17 10:10:24 1999 From: tim_one at email.msn.com (Tim Peters) Date: Wed, 17 Nov 1999 04:10:24 -0500 Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...) In-Reply-To: <1269351119-9152905@hypernet.com> Message-ID: <000701bf30db$94d4ac40$a42d153f@tim> [Gordon McMillan] > ... > Yeah, it's faster. And I can put Python+Tcl/Tk+IDLE on a > diskette with a little room left over. That's truly remarkable (he says while waiting for the Inbox Repair Tool to finish repairing his 50Mb Outlook mail file ...)! > but-since-its-WIndows-it-must-be-tainted-ly y'rs Indeed -- if it runs on Windows, it's a worthless piece o' crap <wink>. From fredrik at pythonware.com Wed Nov 17 12:00:10 1999 From: fredrik at pythonware.com (Fredrik Lundh) Date: Wed, 17 Nov 1999 12:00:10 +0100 Subject: [Python-Dev] Codecs and StreamCodecs References: <38317FBA.4F3D6B1F@lemburg.com> <199911161620.LAA02643@eric.cnri.reston.va.us> <002e01bf3069$8477b440$f29b12c2@secret.pythonware.com> <3832757E.B9503606@lemburg.com> Message-ID: <004101bf30ea$eb3801e0$f29b12c2@secret.pythonware.com> M.-A. Lemburg <mal at lemburg.com> wrote: > > def flush(self): > > # flush the decoding buffers. this should usually > > # return None, unless the fact that knowing that the > > # input stream has ended means that the state can be > > # interpreted in a meaningful way. however, if the > > # state indicates that there last character was not > > # finished, this method should raise a UnicodeError > > # exception. > > Could you explain for reason for having a .flush() method > and what it should return. in most cases, it should either return None, or raise a UnicodeError exception: >>> u = unicode("? i ?a ? e ?", "iso-latin-1") >>> # yes, that's a valid Swedish sentence ;-) >>> s = u.encode("utf-8") >>> d = decoder("utf-8") >>> d.decode(s[:-1]) "? i ?a ? e " >>> d.flush() UnicodeError: last character not complete on the other hand, there are situations where it might actually return a string. consider a "HTML entity decoder" which uses the following pattern to match a character entity: "&\w+;?" (note that the trailing semicolon is optional). >>> u = unicode("? i ?a ? e ?", "iso-latin-1") >>> s = u.encode("html-entities") >>> d = decoder("html-entities") >>> d.decode(s[:-1]) "? i ?a ? e " >>> d.flush() "?" > Perhaps I'm missing something, but how would you define > stream codecs using this interface ? input: read chunks of data, decode, and keep extra data in a local buffer. output: encode data into suitable chunks, and write to the output stream (that's why there's a buffersize argument to encode -- if someone writes a 10mb unicode string to an encoded stream, python shouldn't allocate an extra 10-30 megabytes just to be able to encode the darn thing...) > > Implementing stream codecs is left as an exercise (see the zlib > > material in the eff-bot guide for a decoder example). everybody should have a copy of the eff-bot guide ;-) (but alright, I plan to post a complete utf-8 implementation in a not too distant future). </F> From gstein at lyra.org Wed Nov 17 11:57:36 1999 From: gstein at lyra.org (Greg Stein) Date: Wed, 17 Nov 1999 02:57:36 -0800 (PST) Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: <38327F39.AA381647@lemburg.com> Message-ID: <Pine.LNX.4.10.9911170255060.10639-100000@nebula.lyra.org> On Wed, 17 Nov 1999, M.-A. Lemburg wrote: >... > I'd suggest grouping encodings: > > [encodings] > [iso} > [iso88591] > [iso88592] > [jis] > ... > [cyrillic] > ... > [misc] WHY?!?! This is taking a simple solution and making it complicated. I see no benefit to the creating yet-another-level-of-hierarchy. Why should they be grouped? Leave the modules just under "encodings" and be done with it. Cheers, -g -- Greg Stein, http://www.lyra.org/ From gstein at lyra.org Wed Nov 17 12:14:01 1999 From: gstein at lyra.org (Greg Stein) Date: Wed, 17 Nov 1999 03:14:01 -0800 (PST) Subject: [Python-Dev] Codecs and StreamCodecs In-Reply-To: <38327B79.2415786B@lemburg.com> Message-ID: <Pine.LNX.4.10.9911170259150.10639-100000@nebula.lyra.org> On Wed, 17 Nov 1999, M.-A. Lemburg wrote: >... > Anyway, I think that factory functions are the way to go, > because they offer more flexibility w/r to reusing already > instantiated codecs, importing modules on-the-fly as was > suggested in another thread (thereby making codec module > import lazy) or mapping encoder and decoder requests all > to one class. Why a factory? I've got a simple encode() function. I don't need a factory. "flexibility" at the cost of complexity (IMO). > So here's a new registry approach: > > unicodec.register(encoding,factory_function,action) > > with > encoding - name of the supported encoding, e.g. Shift_JIS > factory_function - a function that returns an object > or function ready to be used for action > action - a string stating the supported action: > 'encode' > 'decode' > 'stream write' > 'stream read' This action thing is subject to error. *if* you're wanting to go this route, then have: unicodec.register_encode(...) unicodec.register_decode(...) unicodec.register_stream_write(...) unicodec.register_stream_read(...) They are equivalent. Guido has also told me in the past that he dislikes parameters that alter semantics -- preferring different functions instead. (this is why there are a good number of PyBufferObject interfaces; I had fewer to start with) This suggested approach is also quite a bit more wordy/annoying than Fred's alternative: unicode.register('iso-8859-1', encoder, decoder, None, None) And don't say "future compatibility allows us to add new actions." Well, those same future changes can add new registration functions or additional parameters to the single register() function. Not that I'm advocating it, but register() could also take a single parameter: if a class, then instantiate it and call methods for each action; if an instance, then just call methods for each action. [ and the third/original variety: a function object as the first param is the actual hook, and params 2 thru 4 (each are optional, or just the stream funcs?) are the other hook functions ] > The factory_function API depends on the implementation of > the codec. The returned object's interface on the value of action: > > Codecs: > ------- > > obj = factory_function_for_<action>(errors='strict') Where does this "errors" value come from? How does a user alter that value? Without an ability to change this, I see no reason for a factory. [ and no: don't tell me it is a thread-state value :-) ] On the other hand: presuming the "errors" thing is valid, *then* I see a need for a factory. Truly... I dislike factories. IMO, they just add code/complexity in many cases where the functionality isn't needed. But that's just me :-) Cheers, -g -- Greg Stein, http://www.lyra.org/ From captainrobbo at yahoo.com Wed Nov 17 12:17:00 1999 From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=) Date: Wed, 17 Nov 1999 03:17:00 -0800 (PST) Subject: [Python-Dev] Rosette i18n API Message-ID: <19991117111700.8831.rocketmail@web603.mail.yahoo.com> There is a very capable C++ library at http://rosette.basistech.com/ It is well worth looking at the things this API actually lets you do for ideas on patterns. - Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From gstein at lyra.org Wed Nov 17 12:21:18 1999 From: gstein at lyra.org (Greg Stein) Date: Wed, 17 Nov 1999 03:21:18 -0800 (PST) Subject: [Python-Dev] just say no... In-Reply-To: <000401bf30d8$6cf30bc0$a42d153f@tim> Message-ID: <Pine.LNX.4.10.9911170316380.10639-100000@nebula.lyra.org> On Wed, 17 Nov 1999, Tim Peters wrote: > [MAL] > > FYI, the next version of the proposal ... > > File objects opened in text mode will use "t#" and binary ones use "s#". > > Am I the only one who sees magical distinctions between text and binary mode > as a Really Bad Idea? I wouldn't have guessed the Unix natives here would > quietly acquiesce to importing a bit of Windows madness <wink>. It's a seductive idea... yes, it feels wrong, but then... it seems kind of right, too... :-) Yes. It is a mode. Is it bad? Not sure. You've already told the system that you want to treat the file differently. Much like you're treating it differently when you specify 'r' vs. 'w'. The real annoying thing would be to assume that opening a file as 'r' means that I *meant* text mode and to start using "t#". In actuality, I typically open files that way since I do most of my coding on Linux. If I now have to pay attention to things and open it as 'rb', then I'll be pissed. And the change in behavior and bugs that interpreting 'r' as text would introduce? Ack! Cheers, -g -- Greg Stein, http://www.lyra.org/ From fredrik at pythonware.com Wed Nov 17 12:36:32 1999 From: fredrik at pythonware.com (Fredrik Lundh) Date: Wed, 17 Nov 1999 12:36:32 +0100 Subject: [Python-Dev] Codecs and StreamCodecs References: <Pine.LNX.4.10.9911170259150.10639-100000@nebula.lyra.org> Message-ID: <001b01bf30ef$ffb08a20$f29b12c2@secret.pythonware.com> Greg Stein <gstein at lyra.org> wrote: > Why a factory? I've got a simple encode() function. I don't need a > factory. "flexibility" at the cost of complexity (IMO). so where do you put the state? how do you reset the state between strings? how do you handle incremental decoding/encoding? etc. (I suggest taking another look at PIL's codec design. it solves all these problems with a minimum of code, and it works -- people have been hammering on PIL for years...) </F> From gstein at lyra.org Wed Nov 17 12:34:30 1999 From: gstein at lyra.org (Greg Stein) Date: Wed, 17 Nov 1999 03:34:30 -0800 (PST) Subject: [Python-Dev] Codecs and StreamCodecs In-Reply-To: <001b01bf30ef$ffb08a20$f29b12c2@secret.pythonware.com> Message-ID: <Pine.LNX.4.10.9911170331560.10639-100000@nebula.lyra.org> On Wed, 17 Nov 1999, Fredrik Lundh wrote: > Greg Stein <gstein at lyra.org> wrote: > > Why a factory? I've got a simple encode() function. I don't need a > > factory. "flexibility" at the cost of complexity (IMO). > > so where do you put the state? encode() is not supposed to retain state. It is supposed to do a complete translation. It is not a stream thingy, which may have received partial characters. > how do you reset the state between > strings? There is none :-) > how do you handle incremental > decoding/encoding? Streams. -g -- Greg Stein, http://www.lyra.org/ From fredrik at pythonware.com Wed Nov 17 12:46:01 1999 From: fredrik at pythonware.com (Fredrik Lundh) Date: Wed, 17 Nov 1999 12:46:01 +0100 Subject: [Python-Dev] Python 1.6 status References: <199911161700.MAA02716@eric.cnri.reston.va.us> Message-ID: <004c01bf30f1$537102b0$f29b12c2@secret.pythonware.com> Guido van Rossum <guido at CNRI.Reston.VA.US> wrote: > - suggestions for new issues that maybe ought to be settled in 1.6 three things: imputil, imputil, imputil </F> From fredrik at pythonware.com Wed Nov 17 12:51:33 1999 From: fredrik at pythonware.com (Fredrik Lundh) Date: Wed, 17 Nov 1999 12:51:33 +0100 Subject: [Python-Dev] Codecs and StreamCodecs References: <Pine.LNX.4.10.9911170331560.10639-100000@nebula.lyra.org> Message-ID: <006201bf30f2$194626f0$f29b12c2@secret.pythonware.com> Greg Stein <gstein at lyra.org> wrote: > > so where do you put the state? > > encode() is not supposed to retain state. It is supposed to do a complete > translation. It is not a stream thingy, which may have received partial > characters. > > > how do you handle incremental > > decoding/encoding? > > Streams. hmm. why have two different mechanisms when you can do the same thing with one? </F> From gstein at lyra.org Wed Nov 17 14:01:47 1999 From: gstein at lyra.org (Greg Stein) Date: Wed, 17 Nov 1999 05:01:47 -0800 (PST) Subject: [Python-Dev] Apache process (was: Python 1.6 status) In-Reply-To: <199911161700.MAA02716@eric.cnri.reston.va.us> Message-ID: <Pine.LNX.4.10.9911170441360.10639-100000@nebula.lyra.org> On Tue, 16 Nov 1999, Guido van Rossum wrote: >... > Greg, I understand you have checkin privileges for Apache. What is > the procedure there for handing out those privileges? What is the > procedure for using them? (E.g. if you made a bogus change to part of > Apache you're not supposed to work on, what happens?) Somebody proposes that a person is added to the list of people with checkin privileges. If nobody else in the group vetoes that, then they're in (their system doesn't require continual participation by each member, so it can only operate at a veto level, rather than a unanimous assent). It is basically determined on the basis of merit -- has the person been active (on the Apache developer's mailing list) and has the person contributed something significant? Further, by providing commit access, will they further the goals of Apache? And, of course, does their temperament seem to fit in with the other group members? I can make any change that I'd like. However, there are about 20 other people who can easily revert or alter my changes if they're bogus. There are no programmatic restrictions.... You could say it is based on mutual respect and a social contract of behavior. Large changes should be discussed before committing to CVS. Bug fixes, doc enhancements, minor functional improvements, etc, all follow a commit-then-review process. I just check the thing in. Others see the diff (emailed to the checkins mailing list (this is different from Python-checkins which only says what files are changed, rather than providing the diff)) and can comment on the change, make their own changes, etc. To be concrete: I added the Expat code that now appears in Apache 1.3.9. Before doing so, I queried the group. There were some issues that I dealt with before finally commiting Expat to the CVS repository. On another occasion, I added a new API to Apache; again, I proposed it first, got an "all OK" and committed it. I've done a couple bug fixes which I just checked in. [ "all OK" means three +1 votes and no vetoes. everybody has veto ability (but the responsibility to explain why and to remove their veto when their concerns are addressed). ] On many occasions, I've reviewed the diffs that were posted to the checkins list, and made comments back to the author. I've caught a few problems this way. For Apache 2.0, even large changes are commit-then-review at this point. At some point, it will switch over to review-then-commit and the project will start moving towards stabilization/release. (bug fixes and stuff will always remain commit-then-review) I'll note that the process works very well given that diffs are emailed. I doubt that it would be effective if people had to fetch CVS diffs themselves. Your note also implies "areas of ownership". This doesn't really exist within Apache. There aren't even "primary authors" or things like that. I have the ability/rights to change any portions: from the low-level networking, to the documentation, to the server-side include processing. Of coures, if I'm going to make a big change, then I'll be posting a patch for review first, and whoever has worked in that area in the past may/will/should comment. Cheers, -g -- Greg Stein, http://www.lyra.org/ From guido at CNRI.Reston.VA.US Wed Nov 17 14:32:05 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Wed, 17 Nov 1999 08:32:05 -0500 Subject: [Python-Dev] Python 1.6 status In-Reply-To: Your message of "Wed, 17 Nov 1999 04:31:27 EST." <000801bf30de$85bea500$a42d153f@tim> References: <000801bf30de$85bea500$a42d153f@tim> Message-ID: <199911171332.IAA03266@kaluha.cnri.reston.va.us> > I'm specifically requesting not to have checkin privileges. So there. I will force nobody to use checkin privileges. However I see that for some contributors, checkin privileges will save me and them time. > I see two problems: > > 1. When patches go thru you, you at least eyeball them. This catches bugs > and design errors early. I will still eyeball them -- only after the fact. Since checkins are pretty public, being slapped on the wrist for a bad checkin is a pretty big embarrassment, so few contributors will check in buggy code more than once. Moreover, there will be more eyeballs. > 2. For a multi-platform app, few people have adequate resources for testing; > e.g., I can test under an obsolete version of Win95, and NT if I have to, > but that's it. You may not actually do better testing than that, but having > patches go thru you allows me the comfort of believing you do <wink>. I expect that the same mechanisms will apply. I have access to Solaris, Linux and Windows (NT + 98) but it's actually a lot easier to check portability after things have been checked in. And again, there will be more testers. --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at CNRI.Reston.VA.US Wed Nov 17 14:34:23 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Wed, 17 Nov 1999 08:34:23 -0500 Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: Your message of "Tue, 16 Nov 1999 23:53:53 PST." <19991117075353.16046.rocketmail@web606.mail.yahoo.com> References: <19991117075353.16046.rocketmail@web606.mail.yahoo.com> Message-ID: <199911171334.IAA03374@kaluha.cnri.reston.va.us> > This is the simplest if each codec really is likely to > be implemented in a separate module. But just look at > the data! All the iso-8859 encodings need identical > functionality, and just have a different mapping table > with 256 elements. It would be trivial to implement > these in one module. And the wide variety of Japanese > encodings (mostly corporate or historical variants of > the same character set) are again best treated from > one code base with a bunch of mapping tables and > routines to generate the variants - basically one can > store the deltas. > > So the choice is between possibly having a lot of > almost-dummy modules, or having Python modules which > generate and register a logical family of encodings. > > I may have some time next week and will try to code up > a few so we can pound on something. I see no problem with having a lot of near-dummy modules if it simplifies the architecture. You can still do code sharing. Files are cheap; APIs are expensive. --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at CNRI.Reston.VA.US Wed Nov 17 14:38:35 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Wed, 17 Nov 1999 08:38:35 -0500 Subject: [Python-Dev] Some thoughts on the codecs... In-Reply-To: Your message of "Wed, 17 Nov 1999 02:57:36 PST." <Pine.LNX.4.10.9911170255060.10639-100000@nebula.lyra.org> References: <Pine.LNX.4.10.9911170255060.10639-100000@nebula.lyra.org> Message-ID: <199911171338.IAA03511@kaluha.cnri.reston.va.us> > This is taking a simple solution and making it complicated. I see no > benefit to the creating yet-another-level-of-hierarchy. Why should they be > grouped? > > Leave the modules just under "encodings" and be done with it. Agreed. Tim Peters once remarked that Python likes shallow encodings (or perhaps that *I* like them :-). This is one such case where I would strongly urge for the simplicity of a shallow hierarchy. --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at CNRI.Reston.VA.US Wed Nov 17 14:43:44 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Wed, 17 Nov 1999 08:43:44 -0500 Subject: [Python-Dev] Codecs and StreamCodecs In-Reply-To: Your message of "Wed, 17 Nov 1999 03:14:01 PST." <Pine.LNX.4.10.9911170259150.10639-100000@nebula.lyra.org> References: <Pine.LNX.4.10.9911170259150.10639-100000@nebula.lyra.org> Message-ID: <199911171343.IAA03636@kaluha.cnri.reston.va.us> > Why a factory? I've got a simple encode() function. I don't need a > factory. "flexibility" at the cost of complexity (IMO). Unless there are certain cases where factories are useful. But let's read on... > > action - a string stating the supported action: > > 'encode' > > 'decode' > > 'stream write' > > 'stream read' > > This action thing is subject to error. *if* you're wanting to go this > route, then have: > > unicodec.register_encode(...) > unicodec.register_decode(...) > unicodec.register_stream_write(...) > unicodec.register_stream_read(...) > > They are equivalent. Guido has also told me in the past that he dislikes > parameters that alter semantics -- preferring different functions instead. Yes, indeed! (But weren't we going to do away with the whole registry idea in favor of an encodings package?) > Not that I'm advocating it, but register() could also take a single > parameter: if a class, then instantiate it and call methods for each > action; if an instance, then just call methods for each action. Nah, that's bad -- a class is just a factory, and once you are allowing classes it's really good to also allowing factory functions. > [ and the third/original variety: a function object as the first param is > the actual hook, and params 2 thru 4 (each are optional, or just the > stream funcs?) are the other hook functions ] Fine too. They should all be optional. > > obj = factory_function_for_<action>(errors='strict') > > Where does this "errors" value come from? How does a user alter that > value? Without an ability to change this, I see no reason for a factory. > [ and no: don't tell me it is a thread-state value :-) ] > > On the other hand: presuming the "errors" thing is valid, *then* I see a > need for a factory. The idea is that various places that take an encoding name can also take a codec instance. So the user can call the factory function / class constructor. > Truly... I dislike factories. IMO, they just add code/complexity in many > cases where the functionality isn't needed. But that's just me :-) Get over it... In a sense, every Python class is a factory for its own instances! I think you must be confusing Python with Java or C++. :-) --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at CNRI.Reston.VA.US Wed Nov 17 14:56:56 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Wed, 17 Nov 1999 08:56:56 -0500 Subject: [Python-Dev] Apache process (was: Python 1.6 status) In-Reply-To: Your message of "Wed, 17 Nov 1999 05:01:47 PST." <Pine.LNX.4.10.9911170441360.10639-100000@nebula.lyra.org> References: <Pine.LNX.4.10.9911170441360.10639-100000@nebula.lyra.org> Message-ID: <199911171356.IAA04005@kaluha.cnri.reston.va.us> > Somebody proposes that a person is added to the list of people with > checkin privileges. If nobody else in the group vetoes that, then they're > in (their system doesn't require continual participation by each member, > so it can only operate at a veto level, rather than a unanimous assent). > It is basically determined on the basis of merit -- has the person been > active (on the Apache developer's mailing list) and has the person > contributed something significant? Further, by providing commit access, > will they further the goals of Apache? And, of course, does their > temperament seem to fit in with the other group members? This makes sense, but I have one concern: if somebody who isn't liked very much (say a capable hacker who is a real troublemaker) asks for privileges, would people veto this? I'd be reluctant to go on record as veto'ing a particular person. (E.g. there are a few troublemakers in c.l.py, and I would never want them to join python-dev let alone give them commit privileges, but I'm not sure if I would want to discuss this on a publicly archived mailing list -- or even on a privately archived mailing list, given that the number of members might be in the hundreds. [...stuff I like...] > I'll note that the process works very well given that diffs are emailed. I > doubt that it would be effective if people had to fetch CVS diffs > themselves. That's a great idea; I'll see if we can do that to our checkin email, regardless of whether we hand out commit privileges. > Your note also implies "areas of ownership". This doesn't really exist > within Apache. There aren't even "primary authors" or things like that. I > have the ability/rights to change any portions: from the low-level > networking, to the documentation, to the server-side include processing. But that's Apache, which is explicitly run as a collective. In Python, I definitely want to have ownership of certain sections of the code. But I agree that this doesn't need to be formalized by access control lists; the social process you describe sounds like it will work just fine. --Guido van Rossum (home page: http://www.python.org/~guido/) From fdrake at acm.org Wed Nov 17 15:44:25 1999 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Wed, 17 Nov 1999 09:44:25 -0500 (EST) Subject: Weak refs (was [Python-Dev] just say no...) In-Reply-To: <000601bf30da$e069d820$a42d153f@tim> References: <14385.33486.855802.187739@weyr.cnri.reston.va.us> <000601bf30da$e069d820$a42d153f@tim> Message-ID: <14386.48969.630893.119344@weyr.cnri.reston.va.us> Tim Peters writes: > about it. Guido hasn't shown visible interest, and nobody has been willing > to fight him to the death over it. So it languishes. Buy him lunch > tomorrow and get him excited <wink>. Guido has asked me to pursue this topic, so I'll be checking out available implementations and seeing if any are adoptable or if something different is needed to be fully general and well-integrated. -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> Corporation for National Research Initiatives From tim_one at email.msn.com Thu Nov 18 04:21:16 1999 From: tim_one at email.msn.com (Tim Peters) Date: Wed, 17 Nov 1999 22:21:16 -0500 Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit) In-Reply-To: <38327D8F.7A5352E6@lemburg.com> Message-ID: <000101bf3173$f9805340$c0a0143f@tim> [MAL] > Guido and I have decided to turn \uXXXX into a standard > escape sequence with no further magic applied. \uXXXX will > only be expanded in u"" strings. Does that exclude ur"" strings? Not arguing either way, just don't know what all this means. > Here's the new scheme: > > With the 'unicode-escape' encoding being defined as: > > ? all non-escape characters represent themselves as a Unicode ordinal > (e.g. 'a' -> U+0061). Same as before (scream if that's wrong). > ? all existing defined Python escape sequences are interpreted as > Unicode ordinals; Same as before (ditto). > note that \xXXXX can represent all Unicode ordinals, This means that the definition of \xXXXX has changed, then -- as you pointed out just yesterday <wink>, \xABCDq currently acts like \xCDq. Does the new \x definition apply only in u"" strings, or in "" strings too? What is the new \x definition? > and \OOO (octal) can represent Unicode ordinals up to U+01FF. Same as before (ditto). > ? a new escape sequence, \uXXXX, represents U+XXXX; it is a syntax > error to have fewer than 4 digits after \u. Same as before (ditto). IOW, I don't see anything that's changed other than an unspecified new treatment of \x escapes, and possibly that ur"" strings don't expand \u escapes. > Examples: > > u'abc' -> U+0061 U+0062 U+0063 > u'\u1234' -> U+1234 > u'abc\u1234\n' -> U+0061 U+0062 U+0063 U+1234 U+05c The last example is damaged (U+05c isn't legit). Other than that, these look the same as before. > Now how should we define ur"abc\u1234\n" ... ? If strings carried an encoding tag with them, the obvious answer is that this acts exactly like r"abc\u1234\n" acts today except gets a "unicode-escaped" encoding tag instead of a "[whatever the default is today]" encoding tag. If strings don't carry an encoding tag with them, you're in a bit of a pickle: you'll have to convert it to a regular string or a Unicode string, but in either case have no way to communicate that it may need further processing; i.e., no way to distinguish it from a regular or Unicode string produced by any other mechanism. The code I posted yesterday remains my best answer to that unpleasant puzzle (i.e., produce a Unicode string, fiddling with backslashes just enough to get the \u escapes expanded, in the same way Java's (conceptual) preprocessor does it). From tim_one at email.msn.com Thu Nov 18 04:21:19 1999 From: tim_one at email.msn.com (Tim Peters) Date: Wed, 17 Nov 1999 22:21:19 -0500 Subject: [Python-Dev] just say no... In-Reply-To: <Pine.LNX.4.10.9911170316380.10639-100000@nebula.lyra.org> Message-ID: <000201bf3173$fb7f7ea0$c0a0143f@tim> [MAL] > File objects opened in text mode will use "t#" and binary > ones use "s#". [Greg Stein] > ... > The real annoying thing would be to assume that opening a file as 'r' > means that I *meant* text mode and to start using "t#". Isn't that exactly what MAL said would happen? Note that a "t" flag for "text mode" is an MS extension -- C doesn't define "t", and Python doesn't either; a lone "r" has always meant text mode. > In actuality, I typically open files that way since I do most of my > coding on Linux. If I now have to pay attention to things and open it > as 'rb', then I'll be pissed. > > And the change in behavior and bugs that interpreting 'r' as text would > introduce? Ack! 'r' is already intepreted as text mode, but so far, on Unix-like systems, there's been no difference between text and binary modes. Introducing a distinction will certainly cause problems. I don't know what the compensating advantages are thought to be. From tim_one at email.msn.com Thu Nov 18 04:23:00 1999 From: tim_one at email.msn.com (Tim Peters) Date: Wed, 17 Nov 1999 22:23:00 -0500 Subject: [Python-Dev] Python 1.6 status In-Reply-To: <199911171332.IAA03266@kaluha.cnri.reston.va.us> Message-ID: <000301bf3174$37b465c0$c0a0143f@tim> [Guido] > I will force nobody to use checkin privileges. That almost went without saying <wink>. > However I see that for some contributors, checkin privileges will > save me and them time. Then it's Good! Provided it doesn't hurt language stability. I agree that changing the system to mail out diffs addresses what I was worried about there. From tim_one at email.msn.com Thu Nov 18 04:31:38 1999 From: tim_one at email.msn.com (Tim Peters) Date: Wed, 17 Nov 1999 22:31:38 -0500 Subject: [Python-Dev] Apache process (was: Python 1.6 status) In-Reply-To: <199911171356.IAA04005@kaluha.cnri.reston.va.us> Message-ID: <000401bf3175$6c089660$c0a0143f@tim> [Greg] > ... > Somebody proposes that a person is added to the list of people with > checkin privileges. If nobody else in the group vetoes that, then ? they're in ... [Guido] > This makes sense, but I have one concern: if somebody who isn't liked > very much (say a capable hacker who is a real troublemaker) asks for > privileges, would people veto this? It seems that a key point in Greg's description is that people don't propose *themselves* for checkin. They have to talk someone else into proposing them. That should keep Endang out of the running for a few years <wink>. After that, I care more about their code than their personalities. If the stuff they check in is good, fine; if it's not, lock 'em out for direct cause. > I'd be reluctant to go on record as veto'ing a particular person. Secret Ballot run off a web page -- although not so secret you can't see who voted for what <wink>. From tim_one at email.msn.com Thu Nov 18 04:37:18 1999 From: tim_one at email.msn.com (Tim Peters) Date: Wed, 17 Nov 1999 22:37:18 -0500 Subject: Weak refs (was [Python-Dev] just say no...) In-Reply-To: <14386.48969.630893.119344@weyr.cnri.reston.va.us> Message-ID: <000501bf3176$36a5ca00$c0a0143f@tim> [Fred L. Drake, Jr.] > Guido has asked me to pursue this topic [weak refs], so I'll be > checking out available implementations and seeing if any are > adoptable or if something different is needed to be fully general > and well-integrated. Just don't let "fully general" stop anything for its sake alone; e.g., if there's a slick trick that *could* exempt numbers, that's all to the good! Adding a pointer to every object is really unattractive, while adding a flag or two to type objects is dirt cheap. Note in passing that current Java addresses weak refs too (several flavors of 'em! -- very elaborate). From gstein at lyra.org Thu Nov 18 09:09:24 1999 From: gstein at lyra.org (Greg Stein) Date: Thu, 18 Nov 1999 00:09:24 -0800 (PST) Subject: [Python-Dev] just say no... In-Reply-To: <000201bf3173$fb7f7ea0$c0a0143f@tim> Message-ID: <Pine.LNX.4.10.9911180008020.10639-100000@nebula.lyra.org> On Wed, 17 Nov 1999, Tim Peters wrote: >... > 'r' is already intepreted as text mode, but so far, on Unix-like systems, > there's been no difference between text and binary modes. Introducing a > distinction will certainly cause problems. I don't know what the > compensating advantages are thought to be. Wow. "compensating advantages" ... Excellent "power phrase" there. hehe... -g -- Greg Stein, http://www.lyra.org/ From mal at lemburg.com Thu Nov 18 09:15:04 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Thu, 18 Nov 1999 09:15:04 +0100 Subject: [Python-Dev] just say no... References: <000201bf3173$fb7f7ea0$c0a0143f@tim> Message-ID: <3833B588.1E31F01B@lemburg.com> Tim Peters wrote: > > [MAL] > > File objects opened in text mode will use "t#" and binary > > ones use "s#". > > [Greg Stein] > > ... > > The real annoying thing would be to assume that opening a file as 'r' > > means that I *meant* text mode and to start using "t#". > > Isn't that exactly what MAL said would happen? Note that a "t" flag for > "text mode" is an MS extension -- C doesn't define "t", and Python doesn't > either; a lone "r" has always meant text mode. Em, I think you've got something wrong here: "t#" refers to the parsing marker used for writing data to files opened in text mode. Until now, all files used the "s#" parsing marker for writing data, regardeless of being opened in text or binary mode. The new interpretation (new, because there previously was none ;-) of the buffer interface forces this to be changed to regain conformance. > > In actuality, I typically open files that way since I do most of my > > coding on Linux. If I now have to pay attention to things and open it > > as 'rb', then I'll be pissed. > > > > And the change in behavior and bugs that interpreting 'r' as text would > > introduce? Ack! > > 'r' is already intepreted as text mode, but so far, on Unix-like systems, > there's been no difference between text and binary modes. Introducing a > distinction will certainly cause problems. I don't know what the > compensating advantages are thought to be. I guess you won't notice any difference: strings define both interfaces ("s#" and "t#") to mean the same thing. Only other buffer compatible types may now fail to write to text files -- which is not so bad, because it forces the programmer to rethink what he really intended when opening the file in text mode. Besides, if you are writing portable scripts you should pay close attention to "r" vs. "rb" anyway. [Strange, I find myself argueing for a feature that I don't like myself ;-)] -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Thu Nov 18 09:59:21 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Thu, 18 Nov 1999 09:59:21 +0100 Subject: [Python-Dev] Python 1.6 status References: <199911161700.MAA02716@eric.cnri.reston.va.us> <004c01bf30f1$537102b0$f29b12c2@secret.pythonware.com> Message-ID: <3833BFE9.6FD118B1@lemburg.com> Fredrik Lundh wrote: > > Guido van Rossum <guido at CNRI.Reston.VA.US> wrote: > > - suggestions for new issues that maybe ought to be settled in 1.6 > > three things: imputil, imputil, imputil But please don't add the current version as default importer... its strategy is way too slow for real life apps (yes, I've tested this: imports typically take twice as long as with the builtin importer). I'd opt for an import manager which provides a useful API for import hooks to register themselves with. What we really need is not yet another complete reimplementation of what the builtin importer does, but rather a more detailed exposure of the various import aspects: finding modules and loading modules. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Thu Nov 18 09:50:36 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Thu, 18 Nov 1999 09:50:36 +0100 Subject: [Python-Dev] Codecs and StreamCodecs References: <38317FBA.4F3D6B1F@lemburg.com> <199911161620.LAA02643@eric.cnri.reston.va.us> <002e01bf3069$8477b440$f29b12c2@secret.pythonware.com> <3832757E.B9503606@lemburg.com> <004101bf30ea$eb3801e0$f29b12c2@secret.pythonware.com> Message-ID: <3833BDDC.7CD2CC1F@lemburg.com> Fredrik Lundh wrote: > > M.-A. Lemburg <mal at lemburg.com> wrote: > > > def flush(self): > > > # flush the decoding buffers. this should usually > > > # return None, unless the fact that knowing that the > > > # input stream has ended means that the state can be > > > # interpreted in a meaningful way. however, if the > > > # state indicates that there last character was not > > > # finished, this method should raise a UnicodeError > > > # exception. > > > > Could you explain for reason for having a .flush() method > > and what it should return. > > in most cases, it should either return None, or > raise a UnicodeError exception: > > >>> u = unicode("? i ?a ? e ?", "iso-latin-1") > >>> # yes, that's a valid Swedish sentence ;-) > >>> s = u.encode("utf-8") > >>> d = decoder("utf-8") > >>> d.decode(s[:-1]) > "? i ?a ? e " > >>> d.flush() > UnicodeError: last character not complete > > on the other hand, there are situations where it > might actually return a string. consider a "HTML > entity decoder" which uses the following pattern > to match a character entity: "&\w+;?" (note that > the trailing semicolon is optional). > > >>> u = unicode("? i ?a ? e ?", "iso-latin-1") > >>> s = u.encode("html-entities") > >>> d = decoder("html-entities") > >>> d.decode(s[:-1]) > "? i ?a ? e " > >>> d.flush() > "?" Ah, ok. So the .flush() method checks for proper string endings and then either returns the remaining input or raises an error. > > Perhaps I'm missing something, but how would you define > > stream codecs using this interface ? > > input: read chunks of data, decode, and > keep extra data in a local buffer. > > output: encode data into suitable chunks, > and write to the output stream (that's why > there's a buffersize argument to encode -- > if someone writes a 10mb unicode string to > an encoded stream, python shouldn't allocate > an extra 10-30 megabytes just to be able to > encode the darn thing...) So the stream codecs would be wrappers around the string codecs. Have you read my latest version of the Codec interface ? Wouldn't that be a reasonable approach ? Note that I have integrated your ideas into the new API -- it's basically only missing the .flush() methods, which I can add now that I know what you meant. > > > Implementing stream codecs is left as an exercise (see the zlib > > > material in the eff-bot guide for a decoder example). > > everybody should have a copy of the eff-bot guide ;-) Sure, but the format, the format... make it printed and add a CD and you would probably have a good selling book there ;-) > (but alright, I plan to post a complete utf-8 implementation > in a not too distant future). -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Thu Nov 18 09:16:48 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Thu, 18 Nov 1999 09:16:48 +0100 Subject: [Python-Dev] Some thoughts on the codecs... References: <Pine.LNX.4.10.9911170255060.10639-100000@nebula.lyra.org> Message-ID: <3833B5F0.FA4620AD@lemburg.com> Greg Stein wrote: > > On Wed, 17 Nov 1999, M.-A. Lemburg wrote: > >... > > I'd suggest grouping encodings: > > > > [encodings] > > [iso} > > [iso88591] > > [iso88592] > > [jis] > > ... > > [cyrillic] > > ... > > [misc] > > WHY?!?! > > This is taking a simple solution and making it complicated. I see no > benefit to the creating yet-another-level-of-hierarchy. Why should they be > grouped? > > Leave the modules just under "encodings" and be done with it. Nevermind, was just an idea... -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Thu Nov 18 09:43:31 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Thu, 18 Nov 1999 09:43:31 +0100 Subject: [Python-Dev] Codecs and StreamCodecs References: <Pine.LNX.4.10.9911170259150.10639-100000@nebula.lyra.org> <199911171343.IAA03636@kaluha.cnri.reston.va.us> Message-ID: <3833BC33.66E134F@lemburg.com> Guido van Rossum wrote: > > > Why a factory? I've got a simple encode() function. I don't need a > > factory. "flexibility" at the cost of complexity (IMO). > > Unless there are certain cases where factories are useful. But let's > read on... > > > > action - a string stating the supported action: > > > 'encode' > > > 'decode' > > > 'stream write' > > > 'stream read' > > > > This action thing is subject to error. *if* you're wanting to go this > > route, then have: > > > > unicodec.register_encode(...) > > unicodec.register_decode(...) > > unicodec.register_stream_write(...) > > unicodec.register_stream_read(...) > > > > They are equivalent. Guido has also told me in the past that he dislikes > > parameters that alter semantics -- preferring different functions instead. > > Yes, indeed! Ok. > (But weren't we going to do away with the whole registry > idea in favor of an encodings package?) One way or another, the Unicode implementation will have to access a dictionary containing references to the codecs for a particular encoding. You won't get around registering these at some point... be it in a lazy way, on-the-fly or by some other means. What we could do is implement the lookup like this: 1. call encodings.lookup_<action>(encoding) and use the return value for the conversion 2. if all fails, cop out with an error Step 1. would do all the import magic and then register the found codecs in some dictionary for faster access (perhaps this could be done in a way that is directly available to the Unicode implementation, e.g. in a global internal dictionary -- the one I originally had in mind for the unicodec registry). > > Not that I'm advocating it, but register() could also take a single > > parameter: if a class, then instantiate it and call methods for each > > action; if an instance, then just call methods for each action. > > Nah, that's bad -- a class is just a factory, and once you are > allowing classes it's really good to also allowing factory functions. > > > [ and the third/original variety: a function object as the first param is > > the actual hook, and params 2 thru 4 (each are optional, or just the > > stream funcs?) are the other hook functions ] > > Fine too. They should all be optional. Ok. > > > obj = factory_function_for_<action>(errors='strict') > > > > Where does this "errors" value come from? How does a user alter that > > value? Without an ability to change this, I see no reason for a factory. > > [ and no: don't tell me it is a thread-state value :-) ] > > > > On the other hand: presuming the "errors" thing is valid, *then* I see a > > need for a factory. > > The idea is that various places that take an encoding name can also > take a codec instance. So the user can call the factory function / > class constructor. Right. The argument is reachable via: Codec = encodings.lookup_encode('utf-8') codec = Codec(errors='?') s = codec(u"abc????") s would then equal 'abc??'. -- Should I go ahead then and change the registry business to the new strategy (via the encodings package in the above sense) ? -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mhammond at skippinet.com.au Thu Nov 18 11:57:44 1999 From: mhammond at skippinet.com.au (Mark Hammond) Date: Thu, 18 Nov 1999 21:57:44 +1100 Subject: [Python-Dev] Codecs and StreamCodecs In-Reply-To: <3833BC33.66E134F@lemburg.com> Message-ID: <002401bf31b3$bf16c230$0501a8c0@bobcat> [Guido] > > (But weren't we going to do away with the whole registry > > idea in favor of an encodings package?) > [MAL] > One way or another, the Unicode implementation will have to > access a dictionary containing references to the codecs for > a particular encoding. You won't get around registering these > at some point... be it in a lazy way, on-the-fly or by some > other means. What is wrong with my idea of using well-known-names from the encoding module? The dict then is "encodings.<encoding-name>.__dict__". All encodings "just work" because the leverage from the Python module system. Unless Im missing something, there is no need for any extra registry at all. I guess it would actually resolve to 2 dict lookups, but thats OK surely? Mark. From mal at lemburg.com Thu Nov 18 10:39:30 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Thu, 18 Nov 1999 10:39:30 +0100 Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit) References: <000101bf3173$f9805340$c0a0143f@tim> Message-ID: <3833C952.C6F154B1@lemburg.com> Tim Peters wrote: > > [MAL] > > Guido and I have decided to turn \uXXXX into a standard > > escape sequence with no further magic applied. \uXXXX will > > only be expanded in u"" strings. > > Does that exclude ur"" strings? Not arguing either way, just don't know > what all this means. > > > Here's the new scheme: > > > > With the 'unicode-escape' encoding being defined as: > > > > ? all non-escape characters represent themselves as a Unicode ordinal > > (e.g. 'a' -> U+0061). > > Same as before (scream if that's wrong). > > > ? all existing defined Python escape sequences are interpreted as > > Unicode ordinals; > > Same as before (ditto). > > > note that \xXXXX can represent all Unicode ordinals, > > This means that the definition of \xXXXX has changed, then -- as you pointed > out just yesterday <wink>, \xABCDq currently acts like \xCDq. Does the new > \x definition apply only in u"" strings, or in "" strings too? What is the > new \x definition? Guido decided to make \xYYXX return U+YYXX *only* within u"" strings. In "" (Python strings) the same sequence will result in chr(0xXX). > > and \OOO (octal) can represent Unicode ordinals up to U+01FF. > > Same as before (ditto). > > > ? a new escape sequence, \uXXXX, represents U+XXXX; it is a syntax > > error to have fewer than 4 digits after \u. > > Same as before (ditto). > > IOW, I don't see anything that's changed other than an unspecified new > treatment of \x escapes, and possibly that ur"" strings don't expand \u > escapes. The difference is that we no longer take the two step approach. \uXXXX is treated at the same time all other escape sequences are decoded (the previous version first scanned and decoded all standard Python sequences and then turned to the \uXXXX sequences in a second scan). > > Examples: > > > > u'abc' -> U+0061 U+0062 U+0063 > > u'\u1234' -> U+1234 > > u'abc\u1234\n' -> U+0061 U+0062 U+0063 U+1234 U+05c > > The last example is damaged (U+05c isn't legit). Other than that, these > look the same as before. Corrected; thanks. > > Now how should we define ur"abc\u1234\n" ... ? > > If strings carried an encoding tag with them, the obvious answer is that > this acts exactly like r"abc\u1234\n" acts today except gets a > "unicode-escaped" encoding tag instead of a "[whatever the default is > today]" encoding tag. > > If strings don't carry an encoding tag with them, you're in a bit of a > pickle: you'll have to convert it to a regular string or a Unicode string, > but in either case have no way to communicate that it may need further > processing; i.e., no way to distinguish it from a regular or Unicode string > produced by any other mechanism. The code I posted yesterday remains my > best answer to that unpleasant puzzle (i.e., produce a Unicode string, > fiddling with backslashes just enough to get the \u escapes expanded, in the > same way Java's (conceptual) preprocessor does it). They don't have such tags... so I guess we're in trouble ;-) I guess to make ur"" have a meaning at all, we'd need to go the Java preprocessor way here, i.e. scan the string *only* for \uXXXX sequences, decode these and convert the rest as-is to Unicode ordinals. Would that be ok ? -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Thu Nov 18 12:41:32 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Thu, 18 Nov 1999 12:41:32 +0100 Subject: [Python-Dev] Codecs and StreamCodecs References: <002401bf31b3$bf16c230$0501a8c0@bobcat> Message-ID: <3833E5EC.AAFE5016@lemburg.com> Mark Hammond wrote: > > [Guido] > > > (But weren't we going to do away with the whole registry > > > idea in favor of an encodings package?) > > > [MAL] > > One way or another, the Unicode implementation will have to > > access a dictionary containing references to the codecs for > > a particular encoding. You won't get around registering these > > at some point... be it in a lazy way, on-the-fly or by some > > other means. > > What is wrong with my idea of using well-known-names from the encoding > module? The dict then is "encodings.<encoding-name>.__dict__". All > encodings "just work" because the leverage from the Python module > system. Unless Im missing something, there is no need for any extra > registry at all. I guess it would actually resolve to 2 dict lookups, > but thats OK surely? The problem is that the encoding names are not Python identifiers, e.g. iso-8859-1 is allowed as identifier. This and the fact that applications may want to ship their own codecs (which do not get installed under the system wide encodings package) make the registry necessary. I don't see a problem with the registry though -- the encodings package can take care of the registration process without any user interaction. There would only have to be an API for looking up an encoding published by the encodings package for the Unicode implementation to use. The magic behind that API is left to the encodings package... BTW, nothing's wrong with your idea :-) In fact, I like it a lot because it keeps the encoding modules out of the top-level scope which is good. PS: we could probably even take the whole codec idea one step further and also allow other input/output formats to be registered, e.g. stream ciphers or pickle mechanisms. The step in that direction is not a big one: we'd only have to drop the specification of the Unicode object in the spec and replace it with an arbitrary object. Of course, this will still have to be a Unicode object for use by the Unicode implementation. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From gmcm at hypernet.com Thu Nov 18 15:19:48 1999 From: gmcm at hypernet.com (Gordon McMillan) Date: Thu, 18 Nov 1999 09:19:48 -0500 Subject: [Python-Dev] Python 1.6 status In-Reply-To: <3833BFE9.6FD118B1@lemburg.com> Message-ID: <1269187709-18981857@hypernet.com> Marc-Andre wrote: > Fredrik Lundh wrote: > > > > Guido van Rossum <guido at CNRI.Reston.VA.US> wrote: > > > - suggestions for new issues that maybe ought to be settled in 1.6 > > > > three things: imputil, imputil, imputil > > But please don't add the current version as default importer... > its strategy is way too slow for real life apps (yes, I've tested > this: imports typically take twice as long as with the builtin > importer). I think imputil's emulation of the builtin importer is more of a demonstration than a serious implementation. As for speed, it depends on the test. > I'd opt for an import manager which provides a useful API for > import hooks to register themselves with. I think that rather than blindly chain themselves together, there should be a simple minded manager. This could let the programmer prioritize them. > What we really need > is not yet another complete reimplementation of what the > builtin importer does, but rather a more detailed exposure of > the various import aspects: finding modules and loading modules. The first clause I sort of agree with - the current implementation is a fine implementation of a filesystem directory based importer. I strongly disagree with the second clause. The current import hooks are just such a detailed exposure; and they are incomprehensible and unmanagable. I guess you want to tweak the "finding" part of the builtin import mechanism. But that's no reason to ask all importers to break themselves up into "find" and "load" pieces. It's a reason to ask that the standard importer be, in some sense, "subclassable" (ie, expose hooks, or perhaps be an extension class like thingie). - Gordon From jim at interet.com Thu Nov 18 15:39:20 1999 From: jim at interet.com (James C. Ahlstrom) Date: Thu, 18 Nov 1999 09:39:20 -0500 Subject: [Python-Dev] Python 1.6 status References: <1269187709-18981857@hypernet.com> Message-ID: <38340F98.212F61@interet.com> Gordon McMillan wrote: > > Marc-Andre wrote: > > > Fredrik Lundh wrote: > > > > > > Guido van Rossum <guido at CNRI.Reston.VA.US> wrote: > > > > - suggestions for new issues that maybe ought to be settled in 1.6 > > > > > > three things: imputil, imputil, imputil > > > > But please don't add the current version as default importer... > > its strategy is way too slow for real life apps (yes, I've tested > > this: imports typically take twice as long as with the builtin > > importer). > > I think imputil's emulation of the builtin importer is more of a > demonstration than a serious implementation. As for speed, it > depends on the test. IMHO the current import mechanism is good for developers who must work on the library code in the directory tree, but a disaster for sysadmins who must distribute Python applications either internally to a number of machines or commercially. What we need is a standard Python library file like a Java "Jar" file. Imputil can support this as 130 lines of Python. I have also written one in C. I like the imputil approach, but if we want to add a library importer to import.c, I volunteer to write it. I don't want to just add more complicated and unmanageable hooks which people will all use different ways and just add to the confusion. It is easy to install packages by just making them into a library file and throwing it into a directory. So why aren't we doing it? Jim Ahlstrom From guido at CNRI.Reston.VA.US Thu Nov 18 16:30:28 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Thu, 18 Nov 1999 10:30:28 -0500 Subject: [Python-Dev] Import redesign (was: Python 1.6 status) In-Reply-To: Your message of "Thu, 18 Nov 1999 09:19:48 EST." <1269187709-18981857@hypernet.com> References: <1269187709-18981857@hypernet.com> Message-ID: <199911181530.KAA03887@eric.cnri.reston.va.us> Gordon McMillan wrote: > Marc-Andre wrote: > > > Fredrik Lundh wrote: > > > > > Guido van Rossum <guido at CNRI.Reston.VA.US> wrote: > > > > - suggestions for new issues that maybe ought to be settled in 1.6 > > > > > > three things: imputil, imputil, imputil > > > > But please don't add the current version as default importer... > > its strategy is way too slow for real life apps (yes, I've tested > > this: imports typically take twice as long as with the builtin > > importer). > > I think imputil's emulation of the builtin importer is more of a > demonstration than a serious implementation. As for speed, it > depends on the test. Agreed. I like some of imputil's features, but I think the API need to be redesigned. > > I'd opt for an import manager which provides a useful API for > > import hooks to register themselves with. > > I think that rather than blindly chain themselves together, there > should be a simple minded manager. This could let the > programmer prioritize them. Indeed. (A list of importers has been suggested, to replace the list of directories currently used.) > > What we really need > > is not yet another complete reimplementation of what the > > builtin importer does, but rather a more detailed exposure of > > the various import aspects: finding modules and loading modules. > > The first clause I sort of agree with - the current > implementation is a fine implementation of a filesystem > directory based importer. > > I strongly disagree with the second clause. The current import > hooks are just such a detailed exposure; and they are > incomprehensible and unmanagable. Based on how many people have successfully written import hooks, I have to agree. :-( > I guess you want to tweak the "finding" part of the builtin > import mechanism. But that's no reason to ask all importers > to break themselves up into "find" and "load" pieces. It's a > reason to ask that the standard importer be, in some sense, > "subclassable" (ie, expose hooks, or perhaps be an extension > class like thingie). Agreed. Subclassing is a good way towards flexibility. And Jim Ahlstrom writes: > IMHO the current import mechanism is good for developers who must > work on the library code in the directory tree, but a disaster > for sysadmins who must distribute Python applications either > internally to a number of machines or commercially. Unfortunately, you're right. :-( > What we need is a standard Python library file like a Java "Jar" > file. Imputil can support this as 130 lines of Python. I have also > written one in C. I like the imputil approach, but if we want to > add a library importer to import.c, I volunteer to write it. Please volunteer to design or at least review the grand architecture -- see below. > I don't want to just add more complicated and unmanageable hooks > which people will all use different ways and just add to the > confusion. You're so right! > It is easy to install packages by just making them into a library > file and throwing it into a directory. So why aren't we doing it? Rhetorical question. :-) So here's a challenge: redesign the import API from scratch. Let me start with some requirements. Compatibility issues: --------------------- - the core API may be incompatible, as long as compatibility layers can be provided in pure Python - support for rexec functionality - support for freeze functionality - load .py/.pyc/.pyo files and shared libraries from files - support for packages - sys.path and sys.modules should still exist; sys.path might have a slightly different meaning - $PYTHONPATH and $PYTHONHOME should still be supported (I wouldn't mind a splitting up of importdl.c into several platform-specific files, one of which is chosen by the configure script; but that's a bit of a separate issue.) New features: ------------- - Integrated support for Greg Ward's distribution utilities (i.e. a module prepared by the distutil tools should install painlessly) - Good support for prospective authors of "all-in-one" packaging tool authors like Gordon McMillan's win32 installer or /F's squish. (But I *don't* require backwards compatibility for existing tools.) - Standard import from zip or jar files, in two ways: (1) an entry on sys.path can be a zip/jar file instead of a directory; its contents will be searched for modules or packages (2) a file in a directory that's on sys.path can be a zip/jar file; its contents will be considered as a package (note that this is different from (1)!) I don't particularly care about supporting all zip compression schemes; if Java gets away with only supporting gzip compression in jar files, so can we. - Easy ways to subclass or augment the import mechanism along different dimensions. For example, while none of the following features should be part of the core implementation, it should be easy to add any or all: - support for a new compression scheme to the zip importer - support for a new archive format, e.g. tar - a hook to import from URLs or other data sources (e.g. a "module server" imported in CORBA) (this needn't be supported through $PYTHONPATH though) - a hook that imports from compressed .py or .pyc/.pyo files - a hook to auto-generate .py files from other filename extensions (as currently implemented by ILU) - a cache for file locations in directories/archives, to improve startup time - a completely different source of imported modules, e.g. for an embedded system or PalmOS (which has no traditional filesystem) - Note that different kinds of hooks should (ideally, and within reason) properly combine, as follows: if I write a hook to recognize .spam files and automatically translate them into .py files, and you write a hook to support a new archive format, then if both hooks are installed together, it should be possible to find a .spam file in an archive and do the right thing, without any extra action. Right? - It should be possible to write hooks in C/C++ as well as Python - Applications embedding Python may supply their own implementations, default search path, etc., but don't have to if they want to piggyback on an existing Python installation (even though the latter is fraught with risk, it's cheaper and easier to understand). Implementation: --------------- - There must clearly be some code in C that can import certain essential modules (to solve the chicken-or-egg problem), but I don't mind if the majority of the implementation is written in Python. Using Python makes it easy to subclass. - In order to support importing from zip/jar files using compression, we'd at least need the zlib extension module and hence libz itself, which may not be available everywhere. - I suppose that the bootstrap is solved using a mechanism very similar to what freeze currently used (other solutions seem to be platform dependent). - I also want to still support importing *everything* from the filesystem, if only for development. (It's hard enough to deal with the fact that exceptions.py is needed during Py_Initialize(); I want to be able to hack on the import code written in Python without having to rebuild the executable all the time. Let's first complete the requirements gathering. Are these requirements reasonable? Will they make an implementation too complex? Am I missing anything? Finally, to what extent does this impact the desire for dealing differently with the Python bytecode compiler (e.g. supporting optimizers written in Python)? And does it affect the desire to implement the read-eval-print loop (the >>> prompt) in Python? --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at CNRI.Reston.VA.US Thu Nov 18 16:37:49 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Thu, 18 Nov 1999 10:37:49 -0500 Subject: [Python-Dev] Codecs and StreamCodecs In-Reply-To: Your message of "Thu, 18 Nov 1999 12:41:32 +0100." <3833E5EC.AAFE5016@lemburg.com> References: <002401bf31b3$bf16c230$0501a8c0@bobcat> <3833E5EC.AAFE5016@lemburg.com> Message-ID: <199911181537.KAA03911@eric.cnri.reston.va.us> > The problem is that the encoding names are not Python identifiers, > e.g. iso-8859-1 is allowed as identifier. This is easily taken care of by translating each string of consecutive non-identifier-characters to an underscore, so this would import the iso_8859_1.py module. (I also noticed in an earlier post that the official name for Shift_JIS has an underscore, while most other encodings use hyphens.) > This and > the fact that applications may want to ship their own codecs (which > do not get installed under the system wide encodings package) > make the registry necessary. But it could be enough to register a package where to look for encodings (in addition to the system package). Or there could be a registry for encoding search functions. (See the import discussion.) > I don't see a problem with the registry though -- the encodings > package can take care of the registration process without any > user interaction. There would only have to be an API for > looking up an encoding published by the encodings package for > the Unicode implementation to use. The magic behind that API > is left to the encodings package... I think that the collection of encodings will eventually grow large enough to make it a requirement to avoid doing work proportional to the number of supported encodings at startup (or even when an encoding is referenced for the first time). Any "lazy" mechanism (of which module search is an example) will do. > BTW, nothing's wrong with your idea :-) In fact, I like it > a lot because it keeps the encoding modules out of the > top-level scope which is good. Yes. > PS: we could probably even take the whole codec idea one step > further and also allow other input/output formats to be registered, > e.g. stream ciphers or pickle mechanisms. The step in that > direction is not a big one: we'd only have to drop the specification > of the Unicode object in the spec and replace it with an arbitrary > object. Of course, this will still have to be a Unicode object > for use by the Unicode implementation. This is a step towards Java's architecture of stackable streams. But I'm always in favor of tackling what we know we need before tackling the most generalized version of the problem. --Guido van Rossum (home page: http://www.python.org/~guido/) From mal at lemburg.com Thu Nov 18 16:52:26 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Thu, 18 Nov 1999 16:52:26 +0100 Subject: [Python-Dev] Python 1.6 status References: <1269187709-18981857@hypernet.com> <38340F98.212F61@interet.com> Message-ID: <383420BA.EF8A6AC5@lemburg.com> [imputil and friends] "James C. Ahlstrom" wrote: > > IMHO the current import mechanism is good for developers who must > work on the library code in the directory tree, but a disaster > for sysadmins who must distribute Python applications either > internally to a number of machines or commercially. What we > need is a standard Python library file like a Java "Jar" file. > Imputil can support this as 130 lines of Python. I have also > written one in C. I like the imputil approach, but if we want > to add a library importer to import.c, I volunteer to write it. > > I don't want to just add more complicated and unmanageable hooks > which people will all use different ways and just add to the > confusion. > > It is easy to install packages by just making them into a library > file and throwing it into a directory. So why aren't we doing it? Perhaps we ought to rethink the strategy under a different light: what are the real requirement we have for Python imports ? Perhaps the outcome is only the addition of say one or two features and those can probably easily be added to the builtin system... then we can just forget about the whole import hook dilema for quite a while (AFAIK, this is how we got packages into the core -- people weren't happy with the import hook). Well, just an idea... I have other threads to follow :-) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From fdrake at acm.org Thu Nov 18 17:01:47 1999 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Thu, 18 Nov 1999 11:01:47 -0500 (EST) Subject: [Python-Dev] Codecs and StreamCodecs In-Reply-To: <3833E5EC.AAFE5016@lemburg.com> References: <002401bf31b3$bf16c230$0501a8c0@bobcat> <3833E5EC.AAFE5016@lemburg.com> Message-ID: <14388.8939.911928.41746@weyr.cnri.reston.va.us> M.-A. Lemburg writes: > The problem is that the encoding names are not Python identifiers, > e.g. iso-8859-1 is allowed as identifier. This and > the fact that applications may want to ship their own codecs (which > do not get installed under the system wide encodings package) > make the registry necessary. This isn't a substantial problem. Try this on for size (probably not too different from what everyone is already thinking, but let's make it clear). This could be in encodings/__init__.py; I've tried to be really clear on the names. (No testing, only partially complete.) ------------------------------------------------------------------------ import string import sys try: from cStringIO import StringIO except ImportError: from StringIO import StringIO class EncodingError(Exception): def __init__(self, encoding, error): self.encoding = encoding self.strerror = "%s %s" % (error, `encoding`) self.error = error Exception.__init__(self, encoding, error) _registry = {} def registerEncoding(encoding, encode=None, decode=None, make_stream_encoder=None, make_stream_decoder=None): encoding = encoding.lower() if _registry.has_key(encoding): info = _registry[encoding] else: info = _registry[encoding] = Codec(encoding) info._update(encode, decode, make_stream_encoder, make_stream_decoder) def getCodec(encoding): encoding = encoding.lower() if _registry.has_key(encoding): return _registry[encoding] # load the module modname = "encodings." + encoding.replace("-", "_") try: __import__(modname) except ImportError: raise EncodingError("unknown uncoding " + `encoding`) # if the module registered, use the codec as-is: if _registry.has_key(encoding): return _registry[encoding] # nothing registered, use well-known names module = sys.modules[modname] codec = _registry[encoding] = Codec(encoding) encode = getattr(module, "encode", None) decode = getattr(module, "decode", None) make_stream_encoder = getattr(module, "make_stream_encoder", None) make_stream_decoder = getattr(module, "make_stream_decoder", None) codec._update(encode, decode, make_stream_encoder, make_stream_decoder) class Codec: __encode = None __decode = None __stream_encoder_factory = None __stream_decoder_factory = None def __init__(self, name): self.name = name def encode(self, u): if self.__stream_encoder_factory: sio = StringIO() encoder = self.__stream_encoder_factory(sio) encoder.write(u) encoder.flush() return sio.getvalue() else: raise EncodingError("no encoder available for " + `self.name`) # similar for decode()... def make_stream_encoder(self, target): if self.__stream_encoder_factory: return self.__stream_encoder_factory(target) elif self.__encode: return DefaultStreamEncoder(target, self.__encode) else: raise EncodingError("no encoder available for " + `self.name`) # similar for make_stream_decoder()... def _update(self, encode, decode, make_stream_encoder, make_stream_decoder): self.__encode = encode or self.__encode self.__decode = decode or self.__decode self.__stream_encoder_factory = ( make_stream_encoder or self.__stream_encoder_factory) self.__stream_decoder_factory = ( make_stream_decoder or self.__stream_decoder_factory) ------------------------------------------------------------------------ > I don't see a problem with the registry though -- the encodings > package can take care of the registration process without any No problem at all; we just need to make sure the right magic is there for the "normal" case. > PS: we could probably even take the whole codec idea one step > further and also allow other input/output formats to be registered, File formats are different from text encodings, so let's keep them separate. Yes, a registry can be a good approach whenever the various things being registered are sufficiently similar semantically, but the behavior of the registry/lookup can be very different for each type of thing. Let's not over-generalize. -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> Corporation for National Research Initiatives From fdrake at acm.org Thu Nov 18 17:02:45 1999 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Thu, 18 Nov 1999 11:02:45 -0500 (EST) Subject: [Python-Dev] Codecs and StreamCodecs In-Reply-To: <3833E5EC.AAFE5016@lemburg.com> References: <002401bf31b3$bf16c230$0501a8c0@bobcat> <3833E5EC.AAFE5016@lemburg.com> Message-ID: <14388.8997.703108.401808@weyr.cnri.reston.va.us> Er, I should note that the sample code I just sent makes use of string methods. ;) -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> Corporation for National Research Initiatives From mal at lemburg.com Thu Nov 18 17:23:09 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Thu, 18 Nov 1999 17:23:09 +0100 Subject: [Python-Dev] Codecs and StreamCodecs References: <002401bf31b3$bf16c230$0501a8c0@bobcat> <3833E5EC.AAFE5016@lemburg.com> <199911181537.KAA03911@eric.cnri.reston.va.us> Message-ID: <383427ED.45A01BBB@lemburg.com> Guido van Rossum wrote: > > > The problem is that the encoding names are not Python identifiers, > > e.g. iso-8859-1 is allowed as identifier. > > This is easily taken care of by translating each string of consecutive > non-identifier-characters to an underscore, so this would import the > iso_8859_1.py module. (I also noticed in an earlier post that the > official name for Shift_JIS has an underscore, while most other > encodings use hyphens.) Right. That's one way of doing it. > > This and > > the fact that applications may want to ship their own codecs (which > > do not get installed under the system wide encodings package) > > make the registry necessary. > > But it could be enough to register a package where to look for > encodings (in addition to the system package). > > Or there could be a registry for encoding search functions. (See the > import discussion.) Like a path of search functions ? Not a bad idea... I will still want the internal dict for caching purposes though. I'm not sure how often these encodings will be, but even a few hundred function call will slow down the Unicode implementation quite a bit. The implementation could proceed as follows: def lookup(encoding): codecs = _internal_dict.get(encoding,None) if codecs: return codecs for query in sys.encoders: codecs = query(encoding) if codecs: break else: raise UnicodeError,'unkown encoding: %s' % encoding _internal_dict[encoding] = codecs return codecs For simplicity, codecs should be a tuple (encoder,decoder, stream_writer,stream_reader) of factory functions. ...that is if we can agree on these 4 APIs :-) Here are my current versions: ----------------------------------------------------------------------- class Codec: """ Defines the interface for stateless encoders/decoders. """ def __init__(self,errors='strict'): """ Creates a Codec instance. The Codec may implement different error handling schemes by providing the errors argument. These parameters are defined: 'strict' - raise an UnicodeError (or a subclass) 'ignore' - ignore the character and continue with the next (a single character) - replace errorneous characters with the given character (may also be a Unicode character) """ self.errors = errors def encode(self,u,slice=None): """ Return the Unicode object u encoded as Python string. If slice is given (as slice object), only the sliced part of the Unicode object is encoded. The method may not store state in the Codec instance. Use SteamCodec for codecs which have to keep state in order to make encoding/decoding efficient. """ ... def decode(self,s,offset=0): """ Decodes data from the Python string s and returns a tuple (Unicode object, bytes consumed). If offset is given, the decoding process starts at s[offset]. It defaults to 0. The method may not store state in the Codec instance. Use SteamCodec for codecs which have to keep state in order to make encoding/decoding efficient. """ ... StreamWriter and StreamReader define the interface for stateful encoders/decoders: class StreamWriter(Codec): def __init__(self,stream,errors='strict'): """ Creates a StreamWriter instance. stream must be a file-like object open for writing (binary) data. The StreamWriter may implement different error handling schemes by providing the errors argument. These parameters are defined: 'strict' - raise an UnicodeError (or a subclass) 'ignore' - ignore the character and continue with the next (a single character) - replace errorneous characters with the given character (may also be a Unicode character) """ self.stream = stream def write(self,u,slice=None): """ Writes the Unicode object's contents encoded to self.stream and returns the number of bytes written. If slice is given (as slice object), only the sliced part of the Unicode object is written. """ ... the base class should provide a default implementation of this method using self.encode ... def flush(self): """ Flushed the codec buffers used for keeping state. Returns values are not defined. Implementations are free to return None, raise an exception (in case there is pending data in the buffers which could not be decoded) or return any remaining data from the state buffers used. """ pass class StreamReader(Codec): def __init__(self,stream,errors='strict'): """ Creates a StreamReader instance. stream must be a file-like object open for reading (binary) data. The StreamReader may implement different error handling schemes by providing the errors argument. These parameters are defined: 'strict' - raise an UnicodeError (or a subclass) 'ignore' - ignore the character and continue with the next (a single character) - replace errorneous characters with the given character (may also be a Unicode character) """ self.stream = stream def read(self,chunksize=0): """ Decodes data from the stream self.stream and returns a tuple (Unicode object, bytes consumed). chunksize indicates the approximate maximum number of bytes to read from the stream for decoding purposes. The decoder can modify this setting as appropriate. The default value 0 indicates to read and decode as much as possible. The chunksize is intended to prevent having to decode huge files in one step. """ ... the base class should provide a default implementation of this method using self.decode ... def flush(self): """ Flushed the codec buffers used for keeping state. Returns values are not defined. Implementations are free to return None, raise an exception (in case there is pending data in the buffers which could not be decoded) or return any remaining data from the state buffers used. """ In addition to the above methods, the StreamWriter and StreamReader instances should also provide access to all other methods defined for the stream object. Stream codecs are free to combine the StreamWriter and StreamReader interfaces into one class. ----------------------------------------------------------------------- > > I don't see a problem with the registry though -- the encodings > > package can take care of the registration process without any > > user interaction. There would only have to be an API for > > looking up an encoding published by the encodings package for > > the Unicode implementation to use. The magic behind that API > > is left to the encodings package... > > I think that the collection of encodings will eventually grow large > enough to make it a requirement to avoid doing work proportional to > the number of supported encodings at startup (or even when an encoding > is referenced for the first time). Any "lazy" mechanism (of which > module search is an example) will do. Right. The list of search functions should provide this kind of lazyness. It also provides ways to implement other strategies to look for codecs, e.g. PIL could provide such a search function for its codecs, mxCrypto for the included ciphers, etc. > > BTW, nothing's wrong with your idea :-) In fact, I like it > > a lot because it keeps the encoding modules out of the > > top-level scope which is good. > > Yes. > > > PS: we could probably even take the whole codec idea one step > > further and also allow other input/output formats to be registered, > > e.g. stream ciphers or pickle mechanisms. The step in that > > direction is not a big one: we'd only have to drop the specification > > of the Unicode object in the spec and replace it with an arbitrary > > object. Of course, this will still have to be a Unicode object > > for use by the Unicode implementation. > > This is a step towards Java's architecture of stackable streams. > > But I'm always in favor of tackling what we know we need before > tackling the most generalized version of the problem. Well, I just wanted to mention the possibility... might be something to look into next year. I find it rather thrilling to be able to create encrypted streams by just hooking together a few stream codecs... f = open('myfile.txt','w') CipherWriter = sys.codec('rc5-cipher')[3] sf = StreamWriter(f,key='xxxxxxxx') UTF8Writer = sys.codec('utf-8')[3] sfx = UTF8Writer(sf) sfx.write('asdfasdfasdfasdf') sfx.close() Hmm, we should probably define the additional constructor arguments to be keyword arguments... writers/readers other than Unicode ones will probably need different kinds of parameters (such as the key in the above example). Ahem, ...I'm getting distracted here :-) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From bwarsaw at cnri.reston.va.us Thu Nov 18 17:23:41 1999 From: bwarsaw at cnri.reston.va.us (Barry A. Warsaw) Date: Thu, 18 Nov 1999 11:23:41 -0500 (EST) Subject: [Python-Dev] Codecs and StreamCodecs References: <002401bf31b3$bf16c230$0501a8c0@bobcat> <3833E5EC.AAFE5016@lemburg.com> <14388.8997.703108.401808@weyr.cnri.reston.va.us> Message-ID: <14388.10253.902424.904199@anthem.cnri.reston.va.us> >>>>> "Fred" == Fred L Drake, Jr <fdrake at acm.org> writes: Fred> Er, I should note that the sample code I just sent makes Fred> use of string methods. ;) Yay! From guido at CNRI.Reston.VA.US Thu Nov 18 17:37:08 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Thu, 18 Nov 1999 11:37:08 -0500 Subject: [Python-Dev] Codecs and StreamCodecs In-Reply-To: Your message of "Thu, 18 Nov 1999 17:23:09 +0100." <383427ED.45A01BBB@lemburg.com> References: <002401bf31b3$bf16c230$0501a8c0@bobcat> <3833E5EC.AAFE5016@lemburg.com> <199911181537.KAA03911@eric.cnri.reston.va.us> <383427ED.45A01BBB@lemburg.com> Message-ID: <199911181637.LAA04260@eric.cnri.reston.va.us> > Like a path of search functions ? Not a bad idea... I will still > want the internal dict for caching purposes though. I'm not sure > how often these encodings will be, but even a few hundred function > call will slow down the Unicode implementation quite a bit. Of course. (It's like sys.modules caching the results of an import). [...] > def flush(self): > > """ Flushed the codec buffers used for keeping state. > > Returns values are not defined. Implementations are free to > return None, raise an exception (in case there is pending > data in the buffers which could not be decoded) or > return any remaining data from the state buffers used. > > """ I don't know where this came from, but a flush() should work like flush() on a file. It doesn't return a value, it just sends any remaining data to the underlying stream (for output). For input it shouldn't be supported at all. The idea is that flush() should do the same to the encoder state that close() followed by a reopen() would do. Well, more or less. But if the process were to be killed right after a flush(), the data written to disk should be a complete encoding, and not have a lingering shift state. --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at CNRI.Reston.VA.US Thu Nov 18 17:59:06 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Thu, 18 Nov 1999 11:59:06 -0500 Subject: [Python-Dev] Codecs and StreamCodecs In-Reply-To: Your message of "Thu, 18 Nov 1999 09:50:36 +0100." <3833BDDC.7CD2CC1F@lemburg.com> References: <38317FBA.4F3D6B1F@lemburg.com> <199911161620.LAA02643@eric.cnri.reston.va.us> <002e01bf3069$8477b440$f29b12c2@secret.pythonware.com> <3832757E.B9503606@lemburg.com> <004101bf30ea$eb3801e0$f29b12c2@secret.pythonware.com> <3833BDDC.7CD2CC1F@lemburg.com> Message-ID: <199911181659.LAA04303@eric.cnri.reston.va.us> [Responding to some lingering mails] [/F] > > >>> u = unicode("? i ?a ? e ?", "iso-latin-1") > > >>> s = u.encode("html-entities") > > >>> d = decoder("html-entities") > > >>> d.decode(s[:-1]) > > "? i ?a ? e " > > >>> d.flush() > > "?" [MAL] > Ah, ok. So the .flush() method checks for proper > string endings and then either returns the remaining > input or raises an error. No, please. See my previous post on flush(). > > input: read chunks of data, decode, and > > keep extra data in a local buffer. > > > > output: encode data into suitable chunks, > > and write to the output stream (that's why > > there's a buffersize argument to encode -- > > if someone writes a 10mb unicode string to > > an encoded stream, python shouldn't allocate > > an extra 10-30 megabytes just to be able to > > encode the darn thing...) > > So the stream codecs would be wrappers around the > string codecs. No -- the other way around. Think of the stream encoder as a little FSM engine that you feed with unicode characters and which sends bytes to the backend stream. When a unicode character comes in that requires a particular shift state, and the FSM isn't in that shift state, it emits the escape sequence to enter that shift state first. It should use standard buffered writes to the output stream; i.e. one call to feed the encoder could cause several calls to write() on the output stream, or vice versa (if you fed the encoder a single character it might keep it in its own buffer). That's all up to the codec implementation. The flush() forces the FSM into the "neutral" shift state, possibly writing an escape sequence to leave the current shift state, and empties the internal buffer. The string codec CONCEPTUALLY uses the stream codec to a cStringIO object, using flush() to force the final output. However the implementation may take a shortcut. For stateless encodings the stream codec may call on the string codec, but that's all an implementation issue. For input, things are slightly different (you don't know how much encoded data you must read to give you N Unicode characters, so you may have to make a guess and hold on to some data that you read unnecessarily -- either in encoded form or in Unicode form, at the discretion of the implementation. Using seek() on the input stream is forbidden (it could be a pipe or socket). --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at CNRI.Reston.VA.US Thu Nov 18 18:11:51 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Thu, 18 Nov 1999 12:11:51 -0500 Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit) In-Reply-To: Your message of "Thu, 18 Nov 1999 10:39:30 +0100." <3833C952.C6F154B1@lemburg.com> References: <000101bf3173$f9805340$c0a0143f@tim> <3833C952.C6F154B1@lemburg.com> Message-ID: <199911181711.MAA04342@eric.cnri.reston.va.us> > > > Now how should we define ur"abc\u1234\n" ... ? > > > > If strings carried an encoding tag with them, the obvious answer is that > > this acts exactly like r"abc\u1234\n" acts today except gets a > > "unicode-escaped" encoding tag instead of a "[whatever the default is > > today]" encoding tag. > > > > If strings don't carry an encoding tag with them, you're in a bit of a > > pickle: you'll have to convert it to a regular string or a Unicode string, > > but in either case have no way to communicate that it may need further > > processing; i.e., no way to distinguish it from a regular or Unicode string > > produced by any other mechanism. The code I posted yesterday remains my > > best answer to that unpleasant puzzle (i.e., produce a Unicode string, > > fiddling with backslashes just enough to get the \u escapes expanded, in the > > same way Java's (conceptual) preprocessor does it). > > They don't have such tags... so I guess we're in trouble ;-) > > I guess to make ur"" have a meaning at all, we'd need to go > the Java preprocessor way here, i.e. scan the string *only* > for \uXXXX sequences, decode these and convert the rest as-is > to Unicode ordinals. > > Would that be ok ? Read Tim's code (posted about 40 messages ago in this list). Like Java, it interprets \u.... when the number of backslashes is odd, but not when it's even. So \\u.... returns exactly that, while \\\u.... returns two backslashes and a unicode character. This is nice and can be done regardless of whether we are going to interpret other \ escapes or not. --Guido van Rossum (home page: http://www.python.org/~guido/) From skip at mojam.com Thu Nov 18 18:34:51 1999 From: skip at mojam.com (Skip Montanaro) Date: Thu, 18 Nov 1999 11:34:51 -0600 (CST) Subject: [Python-Dev] just say no... In-Reply-To: <000401bf30d8$6cf30bc0$a42d153f@tim> References: <383156DF.2209053F@lemburg.com> <000401bf30d8$6cf30bc0$a42d153f@tim> Message-ID: <14388.14523.158050.594595@dolphin.mojam.com> >> FYI, the next version of the proposal ... File objects opened in >> text mode will use "t#" and binary ones use "s#". Tim> Am I the only one who sees magical distinctions between text and Tim> binary mode as a Really Bad Idea? No. Tim> I wouldn't have guessed the Unix natives here would quietly Tim> acquiesce to importing a bit of Windows madness <wink>. We figured you and Guido would come to our rescue... ;-) Skip Montanaro | http://www.mojam.com/ skip at mojam.com | http://www.musi-cal.com/ 847-971-7098 | Python: Programming the way Guido indented... From mal at lemburg.com Thu Nov 18 19:15:54 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Thu, 18 Nov 1999 19:15:54 +0100 Subject: [Python-Dev] Unicode Proposal: Version 0.7 References: <382C0A54.E6E8328D@lemburg.com> <382D625B.DC14DBDE@lemburg.com> <38316685.7977448D@lemburg.com> Message-ID: <3834425A.8E9C3B7E@lemburg.com> FYI, I've uploaded a new version of the proposal which includes new codec APIs, a new codec search mechanism and some minor fixes here and there. The latest version of the proposal is available at: http://starship.skyport.net/~lemburg/unicode-proposal.txt Older versions are available as: http://starship.skyport.net/~lemburg/unicode-proposal-X.X.txt Some POD (points of discussion) that are still open: ? Unicode objects support for %-formatting ? Design of the internal C API and the Python API for the Unicode character properties database -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Thu Nov 18 19:32:49 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Thu, 18 Nov 1999 19:32:49 +0100 Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit) References: <000101bf3173$f9805340$c0a0143f@tim> <3833C952.C6F154B1@lemburg.com> <199911181711.MAA04342@eric.cnri.reston.va.us> Message-ID: <38344651.960878A2@lemburg.com> Guido van Rossum wrote: > > > I guess to make ur"" have a meaning at all, we'd need to go > > the Java preprocessor way here, i.e. scan the string *only* > > for \uXXXX sequences, decode these and convert the rest as-is > > to Unicode ordinals. > > > > Would that be ok ? > > Read Tim's code (posted about 40 messages ago in this list). I did, but wasn't sure whether he was argueing for going the Java way... > Like Java, it interprets \u.... when the number of backslashes is odd, > but not when it's even. So \\u.... returns exactly that, while > \\\u.... returns two backslashes and a unicode character. > > This is nice and can be done regardless of whether we are going to > interpret other \ escapes or not. So I'll take that as: this is what we want in Python too :-) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Thu Nov 18 19:38:41 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Thu, 18 Nov 1999 19:38:41 +0100 Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit) References: <000101bf3173$f9805340$c0a0143f@tim> <3833C952.C6F154B1@lemburg.com> <199911181711.MAA04342@eric.cnri.reston.va.us> Message-ID: <383447B1.1B7B594C@lemburg.com> Would this definition be fine ? """ u = ur'<raw-unicode-escape encoded Python string>' The 'raw-unicode-escape' encoding is defined as follows: ? \uXXXX sequence represent the U+XXXX Unicode character if and only if the number of leading backslashes is odd ? all other characters represent themselves as Unicode ordinal (e.g. 'b' -> U+0062) """ -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From guido at CNRI.Reston.VA.US Thu Nov 18 19:46:35 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Thu, 18 Nov 1999 13:46:35 -0500 Subject: [Python-Dev] just say no... In-Reply-To: Your message of "Thu, 18 Nov 1999 11:34:51 CST." <14388.14523.158050.594595@dolphin.mojam.com> References: <383156DF.2209053F@lemburg.com> <000401bf30d8$6cf30bc0$a42d153f@tim> <14388.14523.158050.594595@dolphin.mojam.com> Message-ID: <199911181846.NAA04547@eric.cnri.reston.va.us> > >> FYI, the next version of the proposal ... File objects opened in > >> text mode will use "t#" and binary ones use "s#". > > Tim> Am I the only one who sees magical distinctions between text and > Tim> binary mode as a Really Bad Idea? > > No. > > Tim> I wouldn't have guessed the Unix natives here would quietly > Tim> acquiesce to importing a bit of Windows madness <wink>. > > We figured you and Guido would come to our rescue... ;-) Don't count on me. My brain is totally cross-platform these days, and writing "rb" or "wb" for files containing binary data is second nature for me. I actually *like* it. Anyway, the Unicode stuff ought to have a wrapper open(filename, mode, encoding) where the 'b' will be added to the mode if you don't give it and it's needed. --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at CNRI.Reston.VA.US Thu Nov 18 19:50:20 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Thu, 18 Nov 1999 13:50:20 -0500 Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit) In-Reply-To: Your message of "Thu, 18 Nov 1999 19:32:49 +0100." <38344651.960878A2@lemburg.com> References: <000101bf3173$f9805340$c0a0143f@tim> <3833C952.C6F154B1@lemburg.com> <199911181711.MAA04342@eric.cnri.reston.va.us> <38344651.960878A2@lemburg.com> Message-ID: <199911181850.NAA04576@eric.cnri.reston.va.us> > > Like Java, it interprets \u.... when the number of backslashes is odd, > > but not when it's even. So \\u.... returns exactly that, while > > \\\u.... returns two backslashes and a unicode character. > > > > This is nice and can be done regardless of whether we are going to > > interpret other \ escapes or not. > > So I'll take that as: this is what we want in Python too :-) I'll reserve judgement until we've got some experience with it in the field, but it seems the best compromise. It also gives a clear explanation about why we have \uXXXX when we already have \xXXXX. --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at CNRI.Reston.VA.US Thu Nov 18 19:57:36 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Thu, 18 Nov 1999 13:57:36 -0500 Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit) In-Reply-To: Your message of "Thu, 18 Nov 1999 19:38:41 +0100." <383447B1.1B7B594C@lemburg.com> References: <000101bf3173$f9805340$c0a0143f@tim> <3833C952.C6F154B1@lemburg.com> <199911181711.MAA04342@eric.cnri.reston.va.us> <383447B1.1B7B594C@lemburg.com> Message-ID: <199911181857.NAA04617@eric.cnri.reston.va.us> > Would this definition be fine ? > """ > > u = ur'<raw-unicode-escape encoded Python string>' > > The 'raw-unicode-escape' encoding is defined as follows: > > ? \uXXXX sequence represent the U+XXXX Unicode character if and > only if the number of leading backslashes is odd > > ? all other characters represent themselves as Unicode ordinal > (e.g. 'b' -> U+0062) > > """ Yes. --Guido van Rossum (home page: http://www.python.org/~guido/) From skip at mojam.com Thu Nov 18 20:09:46 1999 From: skip at mojam.com (Skip Montanaro) Date: Thu, 18 Nov 1999 13:09:46 -0600 (CST) Subject: [Python-Dev] Unicode Proposal: Version 0.7 In-Reply-To: <3834425A.8E9C3B7E@lemburg.com> References: <382C0A54.E6E8328D@lemburg.com> <382D625B.DC14DBDE@lemburg.com> <38316685.7977448D@lemburg.com> <3834425A.8E9C3B7E@lemburg.com> Message-ID: <14388.20218.294814.234327@dolphin.mojam.com> I haven't been following this discussion closely at all, and have no previous experience with Unicode, so please pardon a couple stupid questions from the peanut gallery: 1. What does U+0061 mean (other than 'a')? That is, what is U? 2. I saw nothing about encodings in the Codec/StreamReader/StreamWriter description. Given a Unicode object with encoding e1, how do I write it to a file that is to be encoded with encoding e2? Seems like I would do something like u1 = unicode(s, encoding=e1) f = open("somefile", "wb") u2 = unicode(u1, encoding=e2) f.write(u2) Is that how it would be done? Does this question even make sense? 3. What will the impact be on programmers such as myself currently living with blinders on (that is, writing in plain old 7-bit ASCII)? Thx, Skip Montanaro | http://www.mojam.com/ skip at mojam.com | http://www.musi-cal.com/ 847-971-7098 | Python: Programming the way Guido indented... From jim at interet.com Thu Nov 18 20:23:53 1999 From: jim at interet.com (James C. Ahlstrom) Date: Thu, 18 Nov 1999 14:23:53 -0500 Subject: [Python-Dev] Import redesign (was: Python 1.6 status) References: <1269187709-18981857@hypernet.com> <199911181530.KAA03887@eric.cnri.reston.va.us> Message-ID: <38345249.4AFD91DA@interet.com> Guido van Rossum wrote: > > Let's first complete the requirements gathering. Yes. > Are these > requirements reasonable? Will they make an implementation too > complex? I think you can get 90% of where you want to be with something much simpler. And the simpler implementation will be useful in the 100% solution, so it is not wasted time. How about if we just design a Python archive file format; provide code in the core (in Python or C) to import from it; provide a Python program to create archive files; and provide a Standard Directory to put archives in so they can be found quickly. For extensibility and control, we add functions to the imp module. Detailed comments follow: > Compatibility issues: > --------------------- > [list of current features...] Easily met by keeping the current C code. > > New features: > ------------- > > - Integrated support for Greg Ward's distribution utilities (i.e. a > module prepared by the distutil tools should install painlessly) > > - Good support for prospective authors of "all-in-one" packaging tool > authors like Gordon McMillan's win32 installer or /F's squish. (But > I *don't* require backwards compatibility for existing tools.) These tools go well beyond just an archive file format, but hopefully a file format will help. Greg and Gordon should be able to control the format so it meets their needs. We need a standard format. > - Standard import from zip or jar files, in two ways: > > (1) an entry on sys.path can be a zip/jar file instead of a directory; > its contents will be searched for modules or packages > > (2) a file in a directory that's on sys.path can be a zip/jar file; > its contents will be considered as a package (note that this is > different from (1)!) I don't like sys.path at all. It is currently part of the problem. I suggest that archive files MUST be put into a known directory. On Windows this is the directory of the executable, sys.executable. On Unix this $PREFIX plus version, namely "%s/lib/python%s/" % (sys.prefix, sys.version[0:3]). Other platforms can have different rules. We should also have the ability to append archive files to the executable or a shared library assuming the OS allows this (Windows and Linux do allow it). This is the first location searched, nails the archive to the interpreter, insulates us from an erroneous sys.path, and enables single-file Python programs. > I don't particularly care about supporting all zip compression > schemes; if Java gets away with only supporting gzip compression > in jar files, so can we. We don't need compression. The whole ./Lib is 1.2 Meg, and if we compress it to zero we save a Meg. Irrelevant. Installers provide compression anyway so when Python programs are shipped, they will be compressed then. Problems are that Python does not ship with compression, we will have to add it, we will have to support it and its current method of compression forever, and it adds complexity. > - Easy ways to subclass or augment the import mechanism along > different dimensions. For example, while none of the following > features should be part of the core implementation, it should be > easy to add any or all: > > [ List of new features including hooks...] Sigh, this proposal does not provide for this. It seems like a job for imputil. But if the file format and import code is available from the imp module, it can be used as part of the solution. > - support for a new compression scheme to the zip importer I guess compression should be easy to add if Python ships with a compression module. > - a cache for file locations in directories/archives, to improve > startup time If the Python library is available as an archive, I think startup will be greatly improved anyway. > Implementation: > --------------- > > - There must clearly be some code in C that can import certain > essential modules (to solve the chicken-or-egg problem), but I don't > mind if the majority of the implementation is written in Python. > Using Python makes it easy to subclass. Yes. > - In order to support importing from zip/jar files using compression, > we'd at least need the zlib extension module and hence libz itself, > which may not be available everywhere. That's a good reason to omit compression. At least for now. > - I suppose that the bootstrap is solved using a mechanism very > similar to what freeze currently used (other solutions seem to be > platform dependent). Yes, except that we need to be careful to preserve the freeze feature for users. We don't want to take it over. > - I also want to still support importing *everything* from the > filesystem, if only for development. (It's hard enough to deal with > the fact that exceptions.py is needed during Py_Initialize(); > I want to be able to hack on the import code written in Python > without having to rebuild the executable all the time. Yes, we need a function in imp to turn archives off: import imp imp.archiveEnable(0) > Finally, to what extent does this impact the desire for dealing > differently with the Python bytecode compiler (e.g. supporting > optimizers written in Python)? And does it affect the desire to > implement the read-eval-print loop (the >>> prompt) in Python? I don't think it impacts these at all. Jim Ahlstrom From guido at CNRI.Reston.VA.US Thu Nov 18 20:55:02 1999 From: guido at CNRI.Reston.VA.US (Guido van Rossum) Date: Thu, 18 Nov 1999 14:55:02 -0500 Subject: [Python-Dev] Import redesign (was: Python 1.6 status) In-Reply-To: Your message of "Thu, 18 Nov 1999 14:23:53 EST." <38345249.4AFD91DA@interet.com> References: <1269187709-18981857@hypernet.com> <199911181530.KAA03887@eric.cnri.reston.va.us> <38345249.4AFD91DA@interet.com> Message-ID: <199911181955.OAA04830@eric.cnri.reston.va.us> > I think you can get 90% of where you want to be with something > much simpler. And the simpler implementation will be useful in > the 100% solution, so it is not wasted time. Agreed, but I'm not sure that it addresses the problems that started this thread. I can't really tell, since the message starting the thread just requested imputil, without saying which parts of it were needed. A followup claimed that imputil was a fine prototype but too slow for real work. I inferred that flexibility was requested. But maybe that was projection since that was on my own list. (I'm happy with the performance and find manipulating zip or jar files clumsy, so I'm not too concerned about all the nice things you can *do* with that flexibility. :-) > How about if we just design a Python archive file format; provide > code in the core (in Python or C) to import from it; provide a > Python program to create archive files; and provide a Standard > Directory to put archives in so they can be found quickly. For > extensibility and control, we add functions to the imp module. > Detailed comments follow: > These tools go well beyond just an archive file format, but hopefully > a file format will help. Greg and Gordon should be able to control the > format so it meets their needs. We need a standard format. I think the standard format should be a subclass of zip or jar (which is itself a subclass of zip). We have already written (at CNRI, as yet unreleased) the necessary Python tools to manipulate zip archives; moreover 3rd party tools are abundantly available, both on Unix and on Windows (as well as in Java). Zip files also lend themselves to self-extracting archives and similar things, because the file index is at the end, so I think that Greg & Gordon should be happy. > I don't like sys.path at all. It is currently part of the problem. Eh? That's the first thing I hear something bad about it. Maybe that's because you live on Windows -- on Unix, search paths are ubiquitous. > I suggest that archive files MUST be put into a known directory. Why? Maybe this works on Windows; on Unix this is asking for trouble because it prevents users from augmenting the installation provided by the sysadmin. Even on newer Windows versions, users without admin perms may not be allowed to add files to that privileged directory. > On Windows this is the directory of the executable, sys.executable. > On Unix this $PREFIX plus version, namely > "%s/lib/python%s/" % (sys.prefix, sys.version[0:3]). > Other platforms can have different rules. > > We should also have the ability to append archive files to the > executable or a shared library assuming the OS allows this > (Windows and Linux do allow it). This is the first location > searched, nails the archive to the interpreter, insulates us > from an erroneous sys.path, and enables single-file Python programs. OK for the executable. I'm not sure what the point is of appending an archive to the shared library? Anyway, does it matter (on Windows) if you add it to python16.dll or to python.exe? > We don't need compression. The whole ./Lib is 1.2 Meg, and if we > compress > it to zero we save a Meg. Irrelevant. Installers provide compression > anyway so when Python programs are shipped, they will be compressed > then. > > Problems are that Python does not ship with compression, we will > have to add it, we will have to support it and its current method > of compression forever, and it adds complexity. OK, OK. I think most zip tools have a way to turn off the compression. (Anyway, it's a matter of more I/O time vs. more CPU time; hardare for both is getting better faster than we can tweak the code :-) > Sigh, this proposal does not provide for this. It seems > like a job for imputil. But if the file format and import code > is available from the imp module, it can be used as part of the > solution. Well, the question is really if we want flexibility or archive files. I care more about the flexibility. If we get a clear vote for archive files, I see no problem with implementing that first. > If the Python library is available as an archive, I think > startup will be greatly improved anyway. Really? I know about all the system calls it makes, but I don't really see much of a delay -- I have a prompt in well under 0.1 second. --Guido van Rossum (home page: http://www.python.org/~guido/) From gstein at lyra.org Thu Nov 18 23:03:55 1999 From: gstein at lyra.org (Greg Stein) Date: Thu, 18 Nov 1999 14:03:55 -0800 (PST) Subject: [Python-Dev] file modes (was: just say no...) In-Reply-To: <3833B588.1E31F01B@lemburg.com> Message-ID: <Pine.LNX.4.10.9911181358380.10639-100000@nebula.lyra.org> On Thu, 18 Nov 1999, M.-A. Lemburg wrote: > Tim Peters wrote: > > [MAL] > > > File objects opened in text mode will use "t#" and binary > > > ones use "s#". > > > > [Greg Stein] > > > ... > > > The real annoying thing would be to assume that opening a file as 'r' > > > means that I *meant* text mode and to start using "t#". > > > > Isn't that exactly what MAL said would happen? Note that a "t" flag for > > "text mode" is an MS extension -- C doesn't define "t", and Python doesn't > > either; a lone "r" has always meant text mode. > > Em, I think you've got something wrong here: "t#" refers to the > parsing marker used for writing data to files opened in text mode. Nope. We've got it right :-) Tim and I used 'r' and "t" to refer to file-open modes. I used "t#" to refer to the parse marker. >... > I guess you won't notice any difference: strings define both > interfaces ("s#" and "t#") to mean the same thing. Only other > buffer compatible types may now fail to write to text files > -- which is not so bad, because it forces the programmer to > rethink what he really intended when opening the file in text > mode. It *is* bad if it breaks my existing programs in subtle ways that are a bitch to track down. > Besides, if you are writing portable scripts you should pay > close attention to "r" vs. "rb" anyway. I'm not writing portable scripts. I mentioned that once before. I don't want a difference between 'r' and 'rb' on my Linux box. It was never there before, I'm lazy, and I don't want to see it added :-). Honestly, I don't know offhand of any Python types that repond to "s#" and "t#" in different ways, such that changing file.write would end up writing something different (and thereby breaking existing code). I just don't like introduce text/binary to *nix platforms where it didn't exist before. Cheers, -g -- Greg Stein, http://www.lyra.org/ From skip at mojam.com Thu Nov 18 23:15:43 1999 From: skip at mojam.com (Skip Montanaro) Date: Thu, 18 Nov 1999 16:15:43 -0600 (CST) Subject: [Python-Dev] file modes (was: just say no...) In-Reply-To: <Pine.LNX.4.10.9911181358380.10639-100000@nebula.lyra.org> References: <3833B588.1E31F01B@lemburg.com> <Pine.LNX.4.10.9911181358380.10639-100000@nebula.lyra.org> Message-ID: <14388.31375.296388.973848@dolphin.mojam.com> Greg> I'm not writing portable scripts. I mentioned that once before. I Greg> don't want a difference between 'r' and 'rb' on my Linux box. It Greg> was never there before, I'm lazy, and I don't want to see it added Greg> :-). ... Greg> I just don't like introduce text/binary to *nix platforms where it Greg> didn't exist before. I'll vote with Greg, Guido's cross-platform conversion not withstanding. If I haven't been writing portable scripts up to this point because I only care about a single target platform, why break my scripts for me? Forcing me to use "rb" or "wb" on my open calls isn't going to make them portable anyway. There are probably many other harder to identify and correct portability issues than binary file access anyway. Seems like requiring "b" is just going to cause gratuitous breakage with no obvious increase in portability. porta-nanny.py-anyone?-ly y'rs, Skip Montanaro | http://www.mojam.com/ skip at mojam.com | http://www.musi-cal.com/ 847-971-7098 | Python: Programming the way Guido indented... From jim at interet.com Thu Nov 18 23:40:05 1999 From: jim at interet.com (James C. Ahlstrom) Date: Thu, 18 Nov 1999 17:40:05 -0500 Subject: [Python-Dev] Import redesign (was: Python 1.6 status) References: <1269187709-18981857@hypernet.com> <199911181530.KAA03887@eric.cnri.reston.va.us> <38345249.4AFD91DA@interet.com> <199911181955.OAA04830@eric.cnri.reston.va.us> Message-ID: <38348045.BB95F783@interet.com> Guido van Rossum wrote: > I think the standard format should be a subclass of zip or jar (which > is itself a subclass of zip). We have already written (at CNRI, as > yet unreleased) the necessary Python tools to manipulate zip archives; > moreover 3rd party tools are abundantly available, both on Unix and on > Windows (as well as in Java). Zip files also lend themselves to > self-extracting archives and similar things, because the file index is > at the end, so I think that Greg & Gordon should be happy. Think about multiple packages in multiple zip files. The zip files store file directories. That means we would need a sys.zippath to search the zip files. I don't want another PYTHONPATH phenomenon. Greg Stein and I once discussed this (and Gordon I think). They argued that the directories should be flattened. That is, think of all directories which can be reached on PYTHONPATH. Throw away all initial paths. The resultant archive has *.pyc at the top level, as well as package directories only. The search path is "." in every archive file. No directory information is stored, only module names, some with dots. > > I don't like sys.path at all. It is currently part of the problem. > > Eh? That's the first thing I hear something bad about it. Maybe > that's because you live on Windows -- on Unix, search paths are > ubiquitous. On windows, just print sys.path. It is junk. A commercial distribution has to "just work", and it fails if a second installation (by someone else) changes PYTHONPATH to suit their app. I am trying to get to "just works", no excuses, no complications. > > I suggest that archive files MUST be put into a known directory. > > Why? Maybe this works on Windows; on Unix this is asking for trouble > because it prevents users from augmenting the installation provided by > the sysadmin. Even on newer Windows versions, users without admin > perms may not be allowed to add files to that privileged directory. It works on Windows because programs install themselves in their own subdirectories, and can put files there instead of /windows/system32. This holds true for Windows 2000 also. A Unix-style installation to /windows/system32 would (may?) require "administrator" privilege. On Unix you are right. I didn't think of that because I am the Unix sysadmin here, so I can put things where I want. The Windows solution doesn't fit with Unix, because executables go in a ./bin directory and putting library files there is a no-no. Hmmmm... This needs more thought. Anyone else have ideas?? > > We should also have the ability to append archive files to the > > executable or a shared library assuming the OS allows this > > OK for the executable. I'm not sure what the point is of appending an > archive to the shared library? Anyway, does it matter (on Windows) if > you add it to python16.dll or to python.exe? The point of using python16.dll is to append the Python library to it, and append to python.exe (or use files) for everything else. That way, the 1.6 interpreter is linked to the 1.6 Lib, upgrading to 1.7 means replacing only one file, and there is no wasted storage in multiple Lib's. I am thinking of multiple Python programs in different directories. But maybe you are right. On Windows, if python.exe can be put in /windows/system32 then it really doesn't matter. > OK, OK. I think most zip tools have a way to turn off the > compression. (Anyway, it's a matter of more I/O time vs. more CPU > time; hardare for both is getting better faster than we can tweak the > code :-) Well, if Python now has its own compression that is built in and comes with it, then that is different. Maybe compression is OK. > Well, the question is really if we want flexibility or archive files. > I care more about the flexibility. If we get a clear vote for archive > files, I see no problem with implementing that first. I don't like flexibility, I like standardization and simplicity. Flexibility just encourages users to do the wrong thing. Everyone vote please. I don't have a solid feeling about what people want, only what they don't like. > > If the Python library is available as an archive, I think > > startup will be greatly improved anyway. > > Really? I know about all the system calls it makes, but I don't > really see much of a delay -- I have a prompt in well under 0.1 > second. So do I. I guess I was just echoing someone else's complaint. JimA From mal at lemburg.com Fri Nov 19 00:28:31 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Fri, 19 Nov 1999 00:28:31 +0100 Subject: [Python-Dev] file modes (was: just say no...) References: <Pine.LNX.4.10.9911181358380.10639-100000@nebula.lyra.org> Message-ID: <38348B9F.A31B09C4@lemburg.com> Greg Stein wrote: > > On Thu, 18 Nov 1999, M.-A. Lemburg wrote: > > Tim Peters wrote: > > > [MAL] > > > > File objects opened in text mode will use "t#" and binary > > > > ones use "s#". > > > > > > [Greg Stein] > > > > ... > > > > The real annoying thing would be to assume that opening a file as 'r' > > > > means that I *meant* text mode and to start using "t#". > > > > > > Isn't that exactly what MAL said would happen? Note that a "t" flag for > > > "text mode" is an MS extension -- C doesn't define "t", and Python doesn't > > > either; a lone "r" has always meant text mode. > > > > Em, I think you've got something wrong here: "t#" refers to the > > parsing marker used for writing data to files opened in text mode. > > Nope. We've got it right :-) > > Tim and I used 'r' and "t" to refer to file-open modes. I used "t#" to > refer to the parse marker. Ah, ok. But "t" as file opener is non-portable anyways, so I'll skip it here :-) > >... > > I guess you won't notice any difference: strings define both > > interfaces ("s#" and "t#") to mean the same thing. Only other > > buffer compatible types may now fail to write to text files > > -- which is not so bad, because it forces the programmer to > > rethink what he really intended when opening the file in text > > mode. > > It *is* bad if it breaks my existing programs in subtle ways that are a > bitch to track down. > > > Besides, if you are writing portable scripts you should pay > > close attention to "r" vs. "rb" anyway. > > I'm not writing portable scripts. I mentioned that once before. I don't > want a difference between 'r' and 'rb' on my Linux box. It was never there > before, I'm lazy, and I don't want to see it added :-). > > Honestly, I don't know offhand of any Python types that repond to "s#" and > "t#" in different ways, such that changing file.write would end up writing > something different (and thereby breaking existing code). > > I just don't like introduce text/binary to *nix platforms where it didn't > exist before. Please remember that up until now you were probably only using strings to write to files. Python strings don't differentiate between "t#" and "s#" so you wont see any change in function or find subtle errors being introduced. If you are already using the buffer feature for e.g. array which also implement "s#" but don't support "t#" for obvious reasons you'll run into trouble, but then: arrays are binary data, so changing from text mode to binary mode is well worth the effort even if you just consider it a nuisance. Since the buffer interface and its consequences haven't published yet, there are probably very few users out there who would actually run into any problems. And even if they do, its a good chance to catch subtle bugs which would only have shown up when trying to port to another platform. I'll leave the rest for Guido to answer, since it was his idea ;-) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Fri Nov 19 00:41:32 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Fri, 19 Nov 1999 00:41:32 +0100 Subject: [Python-Dev] Unicode Proposal: Version 0.7 References: <382C0A54.E6E8328D@lemburg.com> <382D625B.DC14DBDE@lemburg.com> <38316685.7977448D@lemburg.com> <3834425A.8E9C3B7E@lemburg.com> <14388.20218.294814.234327@dolphin.mojam.com> Message-ID: <38348EAC.82B41A4D@lemburg.com> Skip Montanaro wrote: > > I haven't been following this discussion closely at all, and have no > previous experience with Unicode, so please pardon a couple stupid questions > from the peanut gallery: > > 1. What does U+0061 mean (other than 'a')? That is, what is U? U+XXXX means Unicode character with ordinal hex number XXXX. It is basically just another way to say, hey I want the Unicode character at position 0xXXXX in the Unicode spec. > 2. I saw nothing about encodings in the Codec/StreamReader/StreamWriter > description. Given a Unicode object with encoding e1, how do I write > it to a file that is to be encoded with encoding e2? Seems like I > would do something like > > u1 = unicode(s, encoding=e1) > f = open("somefile", "wb") > u2 = unicode(u1, encoding=e2) > f.write(u2) > > Is that how it would be done? Does this question even make sense? The unicode() constructor converts all input to Unicode as basis for other conversions. In the above example, s would be converted to Unicode using the assumption that the bytes in s represent characters encoded using the encoding given in e1. The line with u2 would raise a TypeError, because u1 is not a string. To convert a Unicode object u1 to another encoding, you would have to call the .encode() method with the intended new encoding. The Unicode object will then take care of the conversion of its internal Unicode data into a string using the given encoding, e.g. you'd write: f.write(u1.encode(e2)) > 3. What will the impact be on programmers such as myself currently > living with blinders on (that is, writing in plain old 7-bit ASCII)? If you don't want your scripts to know about Unicode, nothing will really change. In case you do use e.g. Latin-1 characters in your scripts for strings, you are asked to include a pragma in the comment lines at the beginning of the script (so that programmers viewing your code using other encoding have a chance to figure out what you've written). Here's the text from the proposal: """ Note that you should provide some hint to the encoding you used to write your programs as pragma line in one the first few comment lines of the source file (e.g. '# source file encoding: latin-1'). If you only use 7-bit ASCII then everything is fine and no such notice is needed, but if you include Latin-1 characters not defined in ASCII, it may well be worthwhile including a hint since people in other countries will want to be able to read you source strings too. """ Other than that you can continue to use normal strings like you always have. Hope that clarifies things at least a bit, -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mhammond at skippinet.com.au Fri Nov 19 01:27:09 1999 From: mhammond at skippinet.com.au (Mark Hammond) Date: Fri, 19 Nov 1999 11:27:09 +1100 Subject: [Python-Dev] file modes (was: just say no...) In-Reply-To: <38348B9F.A31B09C4@lemburg.com> Message-ID: <003401bf3224$d231be30$0501a8c0@bobcat> [MAL] > If you are already using the buffer feature for e.g. array which > also implement "s#" but don't support "t#" for obvious reasons > you'll run into trouble, but then: arrays are binary data, > so changing from text mode to binary mode is well worth the > effort even if you just consider it a nuisance. Breaking existing code that works should be considered more than a nuisance. However, one answer would be to have "t#" _prefer_ to use the text buffer, but not insist on it. eg, the logic for processing "t#" could check if the text buffer is supported, and if not move back to the blob buffer. This should mean that all existing code still works, except for objects that support both buffers to mean different things. AFAIK there are no objects that qualify today, so it should work fine. Unix users _will_ need to revisit their thinking about "text mode" vs "binary mode" when writing these new objects (such as Unicode), but IMO that is more than reasonable - Unix users dont bother qualifying the open mode of their files, simply because it has no effect on their files. If for certain objects or requirements there _is_ a distinction, then new code can start to think these issues through. "Portable File IO" will simply be extended from simply "portable among all platforms" to "portable among all platforms and objects". Mark. From gmcm at hypernet.com Fri Nov 19 03:23:44 1999 From: gmcm at hypernet.com (Gordon McMillan) Date: Thu, 18 Nov 1999 21:23:44 -0500 Subject: [Python-Dev] Import redesign (was: Python 1.6 status) In-Reply-To: <38348045.BB95F783@interet.com> Message-ID: <1269144272-21594530@hypernet.com> [Guido] > > I think the standard format should be a subclass of zip or jar > > (which is itself a subclass of zip). We have already written > > (at CNRI, as yet unreleased) the necessary Python tools to > > manipulate zip archives; moreover 3rd party tools are > > abundantly available, both on Unix and on Windows (as well as > > in Java). Zip files also lend themselves to self-extracting > > archives and similar things, because the file index is at the > > end, so I think that Greg & Gordon should be happy. No problem (I created my own formats for relatively minor reasons). [JimA] > Think about multiple packages in multiple zip files. The zip > files store file directories. That means we would need a > sys.zippath to search the zip files. I don't want another > PYTHONPATH phenomenon. What if sys.path looked like: [DirImporter('.'), ZlibImporter('c:/python/stdlib.pyz'), ...] > Greg Stein and I once discussed this (and Gordon I think). They > argued that the directories should be flattened. That is, think > of all directories which can be reached on PYTHONPATH. Throw > away all initial paths. The resultant archive has *.pyc at the > top level, as well as package directories only. The search path > is "." in every archive file. No directory information is > stored, only module names, some with dots. While I do flat archives (no dots, but that's a different story), there's no reason the archive couldn't be structured. Flat archives are definitely simpler. [JimA] > > > I don't like sys.path at all. It is currently part of the > > > problem. [Guido] > > Eh? That's the first thing I hear something bad about it. > > Maybe that's because you live on Windows -- on Unix, search > > paths are ubiquitous. > > On windows, just print sys.path. It is junk. A commercial > distribution has to "just work", and it fails if a second > installation (by someone else) changes PYTHONPATH to suit their > app. I am trying to get to "just works", no excuses, no > complications. Py_Initialize (); PyRun_SimpleString ("import sys; del sys.path[1:]"); Yeah, there's a hole there. Fixable if you could do a little pre- Py_Initialize twiddling. > > > I suggest that archive files MUST be put into a known > > > directory. No way. Hard code a directory? Overwrite someone else's Python "standalone"? Write to a C: partition that is deliberately sized to hold nothing but Windows? Make network installations impossible? > > Why? Maybe this works on Windows; on Unix this is asking for > > trouble because it prevents users from augmenting the > > installation provided by the sysadmin. Even on newer Windows > > versions, users without admin perms may not be allowed to add > > files to that privileged directory. > > It works on Windows because programs install themselves in their > own subdirectories, and can put files there instead of > /windows/system32. This holds true for Windows 2000 also. A > Unix-style installation to /windows/system32 would (may?) require > "administrator" privilege. There's nothing Unix-style about installing to /Windows/system32. 'Course *they* have symbolic links that actually work... > On Unix you are right. I didn't think of that because I am the > Unix sysadmin here, so I can put things where I want. The > Windows solution doesn't fit with Unix, because executables go in > a ./bin directory and putting library files there is a no-no. > Hmmmm... This needs more thought. Anyone else have ideas?? The official Windows solution is stuff in registry about app paths and such. Putting the dlls in the exe's directory is a workaround which works and is more managable than the official solution. > > > We should also have the ability to append archive files to > > > the executable or a shared library assuming the OS allows > > > this That's a handy trick on Windows, but it's got nothing to do with Python. > > Well, the question is really if we want flexibility or archive > > files. I care more about the flexibility. If we get a clear > > vote for archive files, I see no problem with implementing that > > first. > > I don't like flexibility, I like standardization and simplicity. > Flexibility just encourages users to do the wrong thing. I've noticed that the people who think there should only be one way to do things never agree on what it is. > Everyone vote please. I don't have a solid feeling about > what people want, only what they don't like. Flexibility. You can put Christian's favorite Einstein quote here too. > > > If the Python library is available as an archive, I think > > > startup will be greatly improved anyway. > > > > Really? I know about all the system calls it makes, but I > > don't really see much of a delay -- I have a prompt in well > > under 0.1 second. > > So do I. I guess I was just echoing someone else's complaint. Install some stuff. Deinstall some of it. Repeat (mixing up the order) until your registry and hard drive are shattered into tiny little fragments. It doesn't take long (there's lots of stuff a defragmenter can't touch once it's there). - Gordon From mal at lemburg.com Fri Nov 19 10:08:44 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Fri, 19 Nov 1999 10:08:44 +0100 Subject: [Python-Dev] file modes (was: just say no...) References: <003401bf3224$d231be30$0501a8c0@bobcat> Message-ID: <3835139C.344F3EEE@lemburg.com> Mark Hammond wrote: > > [MAL] > > > If you are already using the buffer feature for e.g. array which > > also implement "s#" but don't support "t#" for obvious reasons > > you'll run into trouble, but then: arrays are binary data, > > so changing from text mode to binary mode is well worth the > > effort even if you just consider it a nuisance. > > Breaking existing code that works should be considered more than a > nuisance. Its an error that pretty easy to fix... that's what I was referring to with "nuisance". All you have to do is open the file in binary mode and you're done. BTW, the change will only effect platforms that don't differ between text and binary mode, e.g. Unix ones. > However, one answer would be to have "t#" _prefer_ to use the text > buffer, but not insist on it. eg, the logic for processing "t#" could > check if the text buffer is supported, and if not move back to the > blob buffer. I doubt that this is conform to what the buffer interface want's to reflect: if the getcharbuf slot is not implemented this means "I am not text". If you would write non-text to a text file, this may cause line breaks to be interpreted in ways that are incompatible with the binary data, i.e. when you read the data back in, it may fail to load because e.g. '\n' was converted to '\r\n'. > This should mean that all existing code still works, except for > objects that support both buffers to mean different things. AFAIK > there are no objects that qualify today, so it should work fine. Well, even though the code would work, it might break badly someday for the above reasons. Better fix that now when there aren't too many possible cases around than at some later point where the user has to figure out the problem for himself due to the system not warning him about this. > Unix users _will_ need to revisit their thinking about "text mode" vs > "binary mode" when writing these new objects (such as Unicode), but > IMO that is more than reasonable - Unix users dont bother qualifying > the open mode of their files, simply because it has no effect on their > files. If for certain objects or requirements there _is_ a > distinction, then new code can start to think these issues through. > "Portable File IO" will simply be extended from simply "portable among > all platforms" to "portable among all platforms and objects". Right. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 42 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Fri Nov 19 10:56:03 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Fri, 19 Nov 1999 10:56:03 +0100 Subject: [Python-Dev] Codecs and StreamCodecs References: <002401bf31b3$bf16c230$0501a8c0@bobcat> <3833E5EC.AAFE5016@lemburg.com> <199911181537.KAA03911@eric.cnri.reston.va.us> <383427ED.45A01BBB@lemburg.com> <199911181637.LAA04260@eric.cnri.reston.va.us> Message-ID: <38351EB3.153FCDFC@lemburg.com> Guido van Rossum wrote: > > > Like a path of search functions ? Not a bad idea... I will still > > want the internal dict for caching purposes though. I'm not sure > > how often these encodings will be, but even a few hundred function > > call will slow down the Unicode implementation quite a bit. > > Of course. (It's like sys.modules caching the results of an import). I've fixed the "path of search functions" approach in the latest version of the spec. > [...] > > def flush(self): > > > > """ Flushed the codec buffers used for keeping state. > > > > Returns values are not defined. Implementations are free to > > return None, raise an exception (in case there is pending > > data in the buffers which could not be decoded) or > > return any remaining data from the state buffers used. > > > > """ > > I don't know where this came from, but a flush() should work like > flush() on a file. It came from Fredrik's proposal. > It doesn't return a value, it just sends any > remaining data to the underlying stream (for output). For input it > shouldn't be supported at all. > > The idea is that flush() should do the same to the encoder state that > close() followed by a reopen() would do. Well, more or less. But if > the process were to be killed right after a flush(), the data written > to disk should be a complete encoding, and not have a lingering shift > state. Ok. I've modified the API as follows: StreamWriter: def flush(self): """ Flushes and resets the codec buffers used for keeping state. Calling this method should ensure that the data on the output is put into a clean state, that allows appending of new fresh data without having to rescan the whole stream to recover state. """ pass StreamReader: def read(self,chunksize=0): """ Decodes data from the stream self.stream and returns a tuple (Unicode object, bytes consumed). chunksize indicates the approximate maximum number of bytes to read from the stream for decoding purposes. The decoder can modify this setting as appropriate. The default value 0 indicates to read and decode as much as possible. The chunksize is intended to prevent having to decode huge files in one step. The method should use a greedy read strategy meaning that it should read as much data as is allowed within the definition of the encoding and the given chunksize, e.g. if optional encoding endings or state markers are available on the stream, these should be read too. """ ... the base class should provide a default implementation of this method using self.decode ... def reset(self): """ Resets the codec buffers used for keeping state. Note that no stream repositioning should take place. This method is primarely intended to recover from decoding errors. """ pass The .reset() method replaces the .flush() method on StreamReaders. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 42 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Fri Nov 19 10:22:48 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Fri, 19 Nov 1999 10:22:48 +0100 Subject: [Python-Dev] Import redesign (was: Python 1.6 status) References: <1269187709-18981857@hypernet.com> <199911181530.KAA03887@eric.cnri.reston.va.us> Message-ID: <383516E8.EE66B527@lemburg.com> Guido van Rossum wrote: > > Let's first complete the requirements gathering. Are these > requirements reasonable? Will they make an implementation too > complex? Am I missing anything? Since you were asking: I would like functionality equivalent to my latest import patch for a slightly different lookup scheme for module import inside packages to become a core feature. If it becomes a core feature I promise to never again start threads about relative imports :-) Here's the summary again: """ [The patch] changes the default import mechanism to work like this: >>> import d # from directory a/b/c/ try a.b.c.d try a.b.d try a.d try d fail instead of just doing the current two-level lookup: >>> import d # from directory a/b/c/ try a.b.c.d try d fail As a result, relative imports referring to higher level packages work out of the box without any ugly underscores in the import name. Plus the whole scheme is pretty simple to explain and straightforward. """ You can find the patch attached to the message "Walking up the package hierarchy" in the python-dev mailing list archive. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 42 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From captainrobbo at yahoo.com Fri Nov 19 14:01:04 1999 From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=) Date: Fri, 19 Nov 1999 05:01:04 -0800 (PST) Subject: [Python-Dev] Codecs and StreamCodecs Message-ID: <19991119130104.21726.rocketmail@ web605.yahoomail.com> --- "M.-A. Lemburg" <mal at lemburg.com> wrote: > Guido van Rossum wrote: > > I don't know where this came from, but a flush() > should work like > > flush() on a file. > > It came from Fredrik's proposal. > > > It doesn't return a value, it just sends any > > remaining data to the underlying stream (for > output). For input it > > shouldn't be supported at all. > > > > The idea is that flush() should do the same to the > encoder state that > > close() followed by a reopen() would do. Well, > more or less. But if > > the process were to be killed right after a > flush(), the data written > > to disk should be a complete encoding, and not > have a lingering shift > > state. > This could be useful in real life. For example, iso-2022-jp has a 'single-byte-mode' and a 'double-byte-mode' with shift-sequences to separate them. The rule is that each line in the text file or email message or whatever must begin and end in single-byte mode. So I would take flush() to mean 'shift back to ASCII now'. Calling flush and reopen would thus "almost" get the same data across. I'm trying to think if it would be dangerous. Do web and ftp servers often call flush() in the middle of transmitting a block of text? - Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From fredrik at pythonware.com Fri Nov 19 14:33:50 1999 From: fredrik at pythonware.com (Fredrik Lundh) Date: Fri, 19 Nov 1999 14:33:50 +0100 Subject: [Python-Dev] Codecs and StreamCodecs References: <19991119130104.21726.rocketmail@ web605.yahoomail.com> Message-ID: <000701bf3292$b7c49130$f29b12c2@secret.pythonware.com> Andy Robinson <captainrobbo at yahoo.com> wrote: > So I would take flush() to mean 'shift back to > ASCII now'. if we're still talking about my "just one codec, please" proposal, that's exactly what encoder.flush should do. while decoder.flush should raise an ex- ception if you're still in double byte mode (at least if running in 'strict' mode). > Calling flush and reopen would thus "almost" get the > same data across. > > I'm trying to think if it would be dangerous. Do web > and ftp servers often call flush() in the middle of > transmitting a block of text? again, if we're talking about my proposal, these flush methods are only called by the string or stream wrappers, never by the applications. see the original post for de- tails. </F> From gstein at lyra.org Fri Nov 19 14:29:50 1999 From: gstein at lyra.org (Greg Stein) Date: Fri, 19 Nov 1999 05:29:50 -0800 (PST) Subject: [Python-Dev] Import redesign [LONG] In-Reply-To: <199911181530.KAA03887@eric.cnri.reston.va.us> Message-ID: <Pine.LNX.4.10.9911190404580.10639-100000@nebula.lyra.org> On Thu, 18 Nov 1999, Guido van Rossum wrote: > Gordon McMillan wrote: >... > > I think imputil's emulation of the builtin importer is more of a > > demonstration than a serious implementation. As for speed, it > > depends on the test. > > Agreed. I like some of imputil's features, but I think the API > need to be redesigned. It what ways? It sounds like you've applied some thought. Do you have any concrete ideas yet, or "just a feeling" :-) I'm working through some changes from JimA right now, and would welcome other suggestions. I think there may be some outstanding stuff from MAL, but I'm not sure (Marc?) >... > So here's a challenge: redesign the import API from scratch. I would suggest starting with imputil and altering as necessary. I'll use that viewpoint below. > Let me start with some requirements. > > Compatibility issues: > --------------------- > > - the core API may be incompatible, as long as compatibility layers > can be provided in pure Python Which APIs are you referring to? The "imp" module? The C functions? The __import__ and reload builtins? I'm guessing some of imp, the two builtins, and only one or two C functions. > - support for rexec functionality No problem. I can think of a number of ways to do this. > - support for freeze functionality No problem. A function in "imp" must be exposed to Python to support this within the imputil framework. > - load .py/.pyc/.pyo files and shared libraries from files No problem. Again, a function is needed for platform-specific loading of shared libraries. > - support for packages No problem. Demo's in current imputil. > - sys.path and sys.modules should still exist; sys.path might > have a slightly different meaning I would suggest that both retain their *exact* meaning. We introduce sys.importers -- a list of importers to check, in sequence. The first importer on that list uses sys.path to look for and load modules. The second importer loads builtins and frozen code (i.e. modules not on sys.path). Users can insert/append new importers or alter sys.path as before. sys.modules continues to record name:module mappings. > - $PYTHONPATH and $PYTHONHOME should still be supported No problem. > (I wouldn't mind a splitting up of importdl.c into several > platform-specific files, one of which is chosen by the configure > script; but that's a bit of a separate issue.) Easy enough. The standard importer can select the appropriate platform-specific module/function to perform the load. i.e. these can move to Modules/ and be split into a module-per-platform. > New features: > ------------- > > - Integrated support for Greg Ward's distribution utilities (i.e. a > module prepared by the distutil tools should install painlessly) I don't know the specific requirements/functionality that would be required here (does Greg? :-), but I can't imagine any problem with this. > - Good support for prospective authors of "all-in-one" packaging tool > authors like Gordon McMillan's win32 installer or /F's squish. (But > I *don't* require backwards compatibility for existing tools.) Um. *No* problem. :-) > - Standard import from zip or jar files, in two ways: > > (1) an entry on sys.path can be a zip/jar file instead of a directory; > its contents will be searched for modules or packages While this could easily be done, I might argue against it. Old apps/modules that process sys.path might get confused. If compatibility is not an issue, then "No problem." An alternative would be an Importer instance added to sys.importers that is configured for a specific archive (in other words, don't add the zip file to sys.path, add ZipImporter(file) to sys.importers). Another alternative is an Importer that looks at a "sys.py_archives" list. Or an Importer that has a py_archives instance attribute. > (2) a file in a directory that's on sys.path can be a zip/jar file; > its contents will be considered as a package (note that this is > different from (1)!) No problem. This will slow things down, as a stat() for *.zip and/or *.jar must be done, in addition to *.py, *.pyc, and *.pyo. > I don't particularly care about supporting all zip compression > schemes; if Java gets away with only supporting gzip compression > in jar files, so can we. I presume we would support whatever zlib gives us, and no more. > - Easy ways to subclass or augment the import mechanism along > different dimensions. For example, while none of the following > features should be part of the core implementation, it should be > easy to add any or all: > > - support for a new compression scheme to the zip importer Presuming ZipImporter is a class (derived from Importer), then this ability is wholly dependent upon the author of ZipImporter providing the hook. The Importer class is already designed for subclassing (and its interface is very narrow, which means delegation is also *very* easy; see imputil.FuncImporter). > - support for a new archive format, e.g. tar A cakewalk. Gordon, JimA, and myself each have archive formats. :-) > - a hook to import from URLs or other data sources (e.g. a > "module server" imported in CORBA) (this needn't be supported > through $PYTHONPATH though) No problem at all. > - a hook that imports from compressed .py or .pyc/.pyo files No problem at all. > - a hook to auto-generate .py files from other filename > extensions (as currently implemented by ILU) No problem at all. > - a cache for file locations in directories/archives, to improve > startup time No problem at all. > - a completely different source of imported modules, e.g. for an > embedded system or PalmOS (which has no traditional filesystem) No problem at all. In each of the above cases, the Importer.get_code() method just needs to grab the byte codes from the XYZ data source. That data source can be cmopressed, across a network, on-the-fly generated, or whatever. Each importer can certainly create a cache based on its concept of "location". In some cases, that would be a mapping from module name to filesystem path, or to a URL, or to a compiled-in, frozen module. > - Note that different kinds of hooks should (ideally, and within > reason) properly combine, as follows: if I write a hook to recognize > .spam files and automatically translate them into .py files, and you > write a hook to support a new archive format, then if both hooks are > installed together, it should be possible to find a .spam file in an > archive and do the right thing, without any extra action. Right? Ack. Very, very difficult. The imputil scheme combines the concept of locating/loading into one step. There is only one "hook" in the imputil system. Its semantic is "map this name to a code/module object and return it; if you don't have it, then return None." Your compositing example is based on the capabilities of the find-then-load paradigm of the existing "ihooks.py". One module finds something (foo.spam) and the other module loads it (by generating a .py). All is not lost, however. I can easily envision the get_code() hook as allowing any kind of return type. If it isn't a code or module object, then another hook is called to transform it. [ actually, I'd design it similarly: a *series* of hooks would be called until somebody transforms the foo.spam into a code/module object. ] The compositing would be limited ony by the (Python-based) Importer classes. For example, my ZipImporter might expect to zip up .pyc files *only*. Obviously, you would want to alter this to support zipping any file, then use the suffic to determine what to do at unzip time. > - It should be possible to write hooks in C/C++ as well as Python Use FuncImporter to delegate to an extension module. This is one of the benefits of imputil's single/narrow interface. > - Applications embedding Python may supply their own implementations, > default search path, etc., but don't have to if they want to piggyback > on an existing Python installation (even though the latter is > fraught with risk, it's cheaper and easier to understand). An application would have full control over the contents of sys.importers. For a restricted execution app, it might install an Importer that loads files from *one* directory only which is configured from a specific Win32 Registry entry. That importer could also refuse to load shared modules. The BuiltinImporter would still be present (although the app would certainly omit all but the necessary builtins from the build). Frozen modules could be excluded. > Implementation: > --------------- > > - There must clearly be some code in C that can import certain > essential modules (to solve the chicken-or-egg problem), but I don't > mind if the majority of the implementation is written in Python. > Using Python makes it easy to subclass. I posited once before that the cost of import is mostly I/O rather than CPU, so using Python should not be an issue. MAL demonstrated that a good design for the Importer classes is also required. Based on this, I'm a *strong* advocate of moving as much as possible into Python (to get Python's ease-of-coding with little relative cost). The (core) C code should be able to search a path for a module and import it. It does not require dynamic loading or packages. This will be used to import exceptions.py, then imputil.py, then site.py. The platform-specific module that perform dynamic-loading must be a statically linked module (in Modules/ ... it doesn't have to be in the Python/ directory). site.py can complete the bootstrap by setting up sys.importers with the appropriate Importer instances (this is where an application can define its own policy). sys.path was initially set by the import.c bootstrap code (from the compiled-in path and environment variables). Note that imputil.py would not install any hooks when it is loaded. That is up to site.py. This implies the core C code will import a total of three modules using its builtin system. After that, the imputil mechanism would be importing everything (site.py would .install() an Importer which then takes over the __import__ hook). Further note that the "import" Python statement could be simplified to use only the hook. However, this would require the core importer to inject some module names into the imputil module's namespace (since it couldn't use an import statement until a hook was installed). While this simplification is "neat", it complicates the run-time system (the import statement is broken until a hook is installed). Therefore, the core C code must also support importing builtins. "sys" and "imp" are needed by imputil to bootstrap. The core importer should not need to deal with dynamic-load modules. To support frozen apps, the core importer would need to support loading the three modules as frozen modules. The builtin/frozen importing would be exposed thru "imp" for use by imputil for future imports. imputil would load and use the (builtin) platform-specific module to do dynamic-load imports. > - In order to support importing from zip/jar files using compression, > we'd at least need the zlib extension module and hence libz itself, > which may not be available everywhere. Yes. I don't see this as a requirement, though. We wouldn't start to use these by default, would we? Or insist on zlib being present? I see this as more along the lines of "we have provided a standardized Importer to do this, *provided* you have zlib support." > - I suppose that the bootstrap is solved using a mechanism very > similar to what freeze currently used (other solutions seem to be > platform dependent). The bootstrap that I outlined above could be done in C code. The import code would be stripped down dramatically because you'll drop package support and dynamic loading. Alternatively, you could probably do the path-scanning in Python and freeze that into the interpreter. Personally, I don't like this idea as it would not buy you much at all (it would still need to return to C for accessing a number of scanning functions and module importing funcs). > - I also want to still support importing *everything* from the > filesystem, if only for development. (It's hard enough to deal with > the fact that exceptions.py is needed during Py_Initialize(); > I want to be able to hack on the import code written in Python > without having to rebuild the executable all the time. My outline above does not freeze anything. Everything resides in the filesystem. The C code merely needs a path-scanning loop and functions to import .py*, builtin, and frozen types of modules. If somebody nukes their imputil.py or site.py, then they return to Python 1.4 behavior where the core interpreter uses a path for importing (i.e. no packages). They lose dynamically-loaded module support. > Let's first complete the requirements gathering. Are these > requirements reasonable? Will they make an implementation too > complex? Am I missing anything? I'm not a fan of the compositing due to it requiring a change to semantics that I believe are very useful and very clean. However, I outlined a possible, clean solution to do that (a secondary set of hooks for transforming get_code() return values). The requirements are otherwise reasonable to me, as I see that they can all be readily solved (i.e. they aren't burdensome). While this email may be long, I do not believe the resulting system would be complex. From the user-visible side of things, nothing would be changed. sys.path is still present and operates as before. They *do* have new functionality they can grow into, though (sys.importers). The underlying C code is simplified, and the platform-specific dynamic-load stuff can be distributed to distinct modules, as needed (e.g. BeOS/dynloadmodule.c and PC/dynloadmodule.c). > Finally, to what extent does this impact the desire for dealing > differently with the Python bytecode compiler (e.g. supporting > optimizers written in Python)? And does it affect the desire to > implement the read-eval-print loop (the >>> prompt) in Python? If the three startup files require byte-compilation, then you could have some issues (i.e. the byte-compiler must be present). Once you hit site.py, you have a "full" environment and can easily detect and import a read-eval-print loop module (i.e. why return to Python? just start things up right there). site.py can also install new optimizers as desired, a new Python-based parser or compiler, or whatever... If Python is built without a parser or compiler (I hope that's an option!), then the three startup modules would simply be frozen into the executable. Cheers, -g -- Greg Stein, http://www.lyra.org/ From bwarsaw at cnri.reston.va.us Fri Nov 19 17:30:15 1999 From: bwarsaw at cnri.reston.va.us (Barry A. Warsaw) Date: Fri, 19 Nov 1999 11:30:15 -0500 (EST) Subject: [Python-Dev] CVS log messages with diffs References: <199911161700.MAA02716@eric.cnri.reston.va.us> Message-ID: <14389.31511.706588.20840@anthem.cnri.reston.va.us> There was a suggestion to start augmenting the checkin emails to include the diffs of the checkin. This would let you keep a current snapshot of the tree without having to do a direct `cvs update'. I think I can add this without a ton of pain. It would not be optional however, and the emails would get larger (and some checkins could be very large). There's also the question of whether to generate unified or context diffs. Personally, I find context diffs easier to read; unified diffs are smaller but not by enough to really matter. So here's an informal poll. If you don't care either way, you don't need to respond. Otherwise please just respond to me and not to the list. 1. Would you like to start receiving diffs in the checkin messages? 2. If you answer `yes' to #1 above, would you prefer unified or context diffs? -Barry From bwarsaw at cnri.reston.va.us Fri Nov 19 18:04:51 1999 From: bwarsaw at cnri.reston.va.us (Barry A. Warsaw) Date: Fri, 19 Nov 1999 12:04:51 -0500 (EST) Subject: [Python-Dev] Another 1.6 wish Message-ID: <14389.33587.947368.547023@anthem.cnri.reston.va.us> We had some discussion a while back about enabling thread support by default, if the underlying OS supports it obviously. I'd like to see that happen for 1.6. IIRC, this shouldn't be too hard -- just a few tweaks of the configure script (and who knows what for those minority platforms that don't use configure :). -Barry From akuchlin at mems-exchange.org Fri Nov 19 18:07:07 1999 From: akuchlin at mems-exchange.org (Andrew M. Kuchling) Date: Fri, 19 Nov 1999 12:07:07 -0500 (EST) Subject: [Python-Dev] Another 1.6 wish In-Reply-To: <14389.33587.947368.547023@anthem.cnri.reston.va.us> References: <14389.33587.947368.547023@anthem.cnri.reston.va.us> Message-ID: <14389.33723.270207.374259@amarok.cnri.reston.va.us> Barry A. Warsaw writes: >We had some discussion a while back about enabling thread support by >default, if the underlying OS supports it obviously. I'd like to see That reminds me... what about the free threading patches? Perhaps they should be added to the list of issues to consider for 1.6. -- A.M. Kuchling http://starship.python.net/crew/amk/ Oh, my fingers! My arms! My legs! My everything! Argh... -- The Doctor, in "Nightmare of Eden" From petrilli at amber.org Fri Nov 19 18:23:02 1999 From: petrilli at amber.org (Christopher Petrilli) Date: Fri, 19 Nov 1999 12:23:02 -0500 Subject: [Python-Dev] Another 1.6 wish In-Reply-To: <14389.33723.270207.374259@amarok.cnri.reston.va.us>; from akuchlin@mems-exchange.org on Fri, Nov 19, 1999 at 12:07:07PM -0500 References: <14389.33587.947368.547023@anthem.cnri.reston.va.us> <14389.33723.270207.374259@amarok.cnri.reston.va.us> Message-ID: <19991119122302.B23400@trump.amber.org> Andrew M. Kuchling [akuchlin at mems-exchange.org] wrote: > Barry A. Warsaw writes: > >We had some discussion a while back about enabling thread support by > >default, if the underlying OS supports it obviously. I'd like to see Yes pretty please! One of the biggest problems we have in the Zope world is that for some unknown reason, most of hte Linux RPMs don't have threading on in them, so people end up having to compile it anyway... while this is a silly thing, it does create problems, and means that we deal with a lot of "dumb" problems. > That reminds me... what about the free threading patches? Perhaps > they should be added to the list of issues to consider for 1.6. My recolection was that unfortunately MOST of the time, they actually slowed down things because of the number of locks involved... Guido can no doubt shed more light onto this, but... there was a reason. Chris -- | Christopher Petrilli | petrilli at amber.org From gmcm at hypernet.com Fri Nov 19 19:22:37 1999 From: gmcm at hypernet.com (Gordon McMillan) Date: Fri, 19 Nov 1999 13:22:37 -0500 Subject: [Python-Dev] Import redesign (was: Python 1.6 status) In-Reply-To: <199911181530.KAA03887@eric.cnri.reston.va.us> References: Your message of "Thu, 18 Nov 1999 09:19:48 EST." <1269187709-18981857@hypernet.com> Message-ID: <1269086690-25057991@hypernet.com> [Guido] > Compatibility issues: > --------------------- > > - the core API may be incompatible, as long as compatibility > layers can be provided in pure Python Good idea. Question: we have keyword import, __import__, imp and PyImport_*. Which of those (if any) define the "core API"? [rexec, freeze: yes] > - load .py/.pyc/.pyo files and shared libraries from files Shared libraries? Might that not involve some rather shady platform-specific magic? If it can be kept kosher, I'm all for it; but I'd say no if it involved, um, undocumented features. > support for packages Absolutely. I'll just comment that the concept of package.__path__ is also affected by the next point. > > - sys.path and sys.modules should still exist; sys.path might > have a slightly different meaning > > - $PYTHONPATH and $PYTHONHOME should still be supported If sys.path changes meaning, should not $PYTHONPATH also? > New features: > ------------- > > - Integrated support for Greg Ward's distribution utilities (i.e. > a > module prepared by the distutil tools should install > painlessly) I assume that this is mostly a matter of $PYTHONPATH and other path manipulation mechanisms? > - Good support for prospective authors of "all-in-one" packaging > tool > authors like Gordon McMillan's win32 installer or /F's squish. > (But I *don't* require backwards compatibility for existing > tools.) I guess you've forgotten: I'm that *really* tall guy <wink>. > - Standard import from zip or jar files, in two ways: > > (1) an entry on sys.path can be a zip/jar file instead of a > directory; > its contents will be searched for modules or packages I don't mind this, but it depends on whether sys.path changes meaning. > (2) a file in a directory that's on sys.path can be a zip/jar > file; > its contents will be considered as a package (note that > this is different from (1)!) But it's affected by the same considerations (eg, do we start with filesystem names and wrap them in importers, or do we just start with importer instances / specifications for importer instances). > I don't particularly care about supporting all zip compression > schemes; if Java gets away with only supporting gzip > compression in jar files, so can we. I think this is a matter of what zip compression is officially blessed. I don't mind if it's none; providing / creating zipped versions for platforms that support it is nearly trivial. > - Easy ways to subclass or augment the import mechanism along > different dimensions. For example, while none of the following > features should be part of the core implementation, it should > be easy to add any or all: > > - support for a new compression scheme to the zip importer > > - support for a new archive format, e.g. tar > > - a hook to import from URLs or other data sources (e.g. a > "module server" imported in CORBA) (this needn't be supported > through $PYTHONPATH though) Which begs the question of the meaning of sys.path; and if it's still filesystem names, how do you get one of these in there? > - a hook that imports from compressed .py or .pyc/.pyo files > > - a hook to auto-generate .py files from other filename > extensions (as currently implemented by ILU) > > - a cache for file locations in directories/archives, to > improve > startup time > > - a completely different source of imported modules, e.g. for > an > embedded system or PalmOS (which has no traditional > filesystem) > > - Note that different kinds of hooks should (ideally, and within > reason) properly combine, as follows: if I write a hook to > recognize .spam files and automatically translate them into .py > files, and you write a hook to support a new archive format, > then if both hooks are installed together, it should be > possible to find a .spam file in an archive and do the right > thing, without any extra action. Right? A bit of discussion: I've got 2 kinds of archives. One can contain anything & is much like a zip (and probably should be a zip). The other contains only compressed .pyc or .pyo. The latter keys contents by logical name, not filesystem name. No extensions, and when a package is imported, the code object returned is the __init__ code object, (vs returning None and letting the import mechanism come back and ask for package.__init__). When you're building an archive, you have to go thru the .py / .pyc / .pyo / is it a package / maybe compile logic anyway. Why not get it all over with, so that at runtime there's no choices to be made. Which means (for this kind of archive) that including somebody's .spam in your archive isn't a matter of a hook, but a matter of adding to the archive's build smarts. > - It should be possible to write hooks in C/C++ as well as Python > > - Applications embedding Python may supply their own > implementations, > default search path, etc., but don't have to if they want to > piggyback on an existing Python installation (even though the > latter is fraught with risk, it's cheaper and easier to > understand). A way of tweaking that which will become sys.path before Py_Initialize would be *most* welcome. > Implementation: > --------------- > > - There must clearly be some code in C that can import certain > essential modules (to solve the chicken-or-egg problem), but I > don't mind if the majority of the implementation is written in > Python. Using Python makes it easy to subclass. > > - In order to support importing from zip/jar files using > compression, > we'd at least need the zlib extension module and hence libz > itself, which may not be available everywhere. > > - I suppose that the bootstrap is solved using a mechanism very > similar to what freeze currently used (other solutions seem to > be platform dependent). There are other possibilites here, but I have only half- formulated ideas at the moment. The critical part for embedding is to be able to *completely* control all path related logic. > - I also want to still support importing *everything* from the > filesystem, if only for development. (It's hard enough to deal > with the fact that exceptions.py is needed during > Py_Initialize(); I want to be able to hack on the import code > written in Python without having to rebuild the executable all > the time. > > Let's first complete the requirements gathering. Are these > requirements reasonable? Will they make an implementation too > complex? Am I missing anything? I'll summarize as follows: 1) What "sys.path" means (and how it's construction can be manipulated) is critical. 2) See 1. > Finally, to what extent does this impact the desire for dealing > differently with the Python bytecode compiler (e.g. supporting > optimizers written in Python)? And does it affect the desire to > implement the read-eval-print loop (the >>> prompt) in Python? I can assure you that code.py runs fine out of an archive :-). - Gordon From gstein at lyra.org Fri Nov 19 22:06:14 1999 From: gstein at lyra.org (Greg Stein) Date: Fri, 19 Nov 1999 13:06:14 -0800 (PST) Subject: [Python-Dev] Import redesign [LONG] In-Reply-To: <Pine.WNT.4.04.9911190928100.315-100000@rigoletto.ski.org> Message-ID: <Pine.LNX.4.10.9911191258180.10639-100000@nebula.lyra.org> [ taking the liberty to CC: this back to python-dev ] On Fri, 19 Nov 1999, David Ascher wrote: > > > (2) a file in a directory that's on sys.path can be a zip/jar file; > > > its contents will be considered as a package (note that this is > > > different from (1)!) > > > > No problem. This will slow things down, as a stat() for *.zip and/or *.jar > > must be done, in addition to *.py, *.pyc, and *.pyo. > > Aside: it strikes me that for Python programs which import lots of files, > 'front-loading' the stat calls could make sense. When you first look at a > directory in sys.path, you read the entire directory in memory, and > successive imports do a stat on the directory to see if it's changed, and > if not use the in-memory data. Or am I completely off my rocker here? Not at all. I thought of this last night after my email. Since the Importer can easily retain state, it can hold a cache of the directory listings. If it doesn't find the file in its cached state, then it can reload the information from disk. If it finds it in the cache, but not on disk, then it can remove the item from its cache. The problem occurs when you path is [A, B], the file is in B, and you add something to A on-the-fly. The cache might direct the importer at B, missing your file. Of course, with the appropriate caveats/warnings, the system would work quite well. It really only breaks during development (which is one reason why I didn't accept some caching changes to imputil from MAL; but that was for the Importer in there; Python's new Importer could have a cache). I'm also not quite sure what the cost of reading a directory is, compared to issuing a bunch of stat() calls. Each directory read is an opendir/readdir(s)/closedir. Note that the DBM approach is kind of similar, but will amortize this cost over many processes. Cheers, -g -- Greg Stein, http://www.lyra.org/ From Jasbahr at origin.EA.com Fri Nov 19 21:59:11 1999 From: Jasbahr at origin.EA.com (Asbahr, Jason) Date: Fri, 19 Nov 1999 14:59:11 -0600 Subject: [Python-Dev] Another 1.6 wish Message-ID: <11A17AA2B9EAD111BCEA00A0C9B4179303385C08@molach.origin.ea.com> My first Python-Dev post. :-) >We had some discussion a while back about enabling thread support by >default, if the underlying OS supports it obviously. What's the consensus about Python microthreads -- a likely candidate for incorporation in 1.6 (or later)? Also, we have a couple minor convenience functions for Python in an MSDEV environment, an exposure of OutputDebugString for writing to the DevStudio log window and a means of tripping DevStudio C/C++ layer breakpoints from Python code (currently experimental). The msvcrt module seems like a likely candidate for these, would these be welcome additions? Thanks, Jason Asbahr Origin Systems, Inc. jasbahr at origin.ea.com From gstein at lyra.org Fri Nov 19 22:35:34 1999 From: gstein at lyra.org (Greg Stein) Date: Fri, 19 Nov 1999 13:35:34 -0800 (PST) Subject: [Python-Dev] Re: [Python-checkins] CVS log messages with diffs In-Reply-To: <14389.31511.706588.20840@anthem.cnri.reston.va.us> Message-ID: <Pine.LNX.4.10.9911191310510.10639-101000@nebula.lyra.org> On Fri, 19 Nov 1999, Barry A. Warsaw wrote: > There was a suggestion to start augmenting the checkin emails to > include the diffs of the checkin. This would let you keep a current > snapshot of the tree without having to do a direct `cvs update'. I've been using diffs-in-checkin for review, rather than to keep a local snapshot updated. I guess you use the email for this (procmail truly is frightening), but I think for most people it would be for purposes of review. >...context vs unifed... > So here's an informal poll. If you don't care either way, you don't > need to respond. Otherwise please just respond to me and not to the > list. > > 1. Would you like to start receiving diffs in the checkin messages? Absolutely. > 2. If you answer `yes' to #1 above, would you prefer unified or > context diffs? Don't care. I've attached an archive of the files that I use in my CVS repository to do emailed diffs. These came from Ken Coar (an Apache guy) as an extraction from the Apache repository. Yes, they do use Perl. I'm not a Perl guy, so I probably would break things if I tried to "fix" the scripts by converting them to Python (in fact, Greg Ward helped to improve log_accum.pl for me!). I certainly would not be adverse to Python versions of these files, or other cleanups. I trimmed down the "avail" file, leaving a few examples. It works with cvs_acls.pl to provide per-CVS-module read/write access control. I'm currently running mod_dav, PyOpenGL, XML-SIG, PyWin32, and two other small projects out of this repository. It has been working quite well. Cheers, -g -- Greg Stein, http://www.lyra.org/ -------------- next part -------------- A non-text attachment was scrubbed... Name: cvs-for-barry.tar.gz Type: application/octet-stream Size: 9668 bytes Desc: URL: <http://mail.python.org/pipermail/python-dev/attachments/19991119/45a7f916/attachment-0001.obj> From bwarsaw at cnri.reston.va.us Fri Nov 19 22:45:14 1999 From: bwarsaw at cnri.reston.va.us (Barry A. Warsaw) Date: Fri, 19 Nov 1999 16:45:14 -0500 (EST) Subject: [Python-Dev] Re: [Python-checkins] CVS log messages with diffs References: <14389.31511.706588.20840@anthem.cnri.reston.va.us> <Pine.LNX.4.10.9911191310510.10639-101000@nebula.lyra.org> Message-ID: <14389.50410.358686.637483@anthem.cnri.reston.va.us> >>>>> "GS" == Greg Stein <gstein at lyra.org> writes: GS> I've been using diffs-in-checkin for review, rather than to GS> keep a local snapshot updated. Interesting; I hadn't though about this use for the diffs. GS> I've attached an archive of the files that I use in my CVS GS> repository to do emailed diffs. These came from Ken Coar (an GS> Apache guy) as an extraction from the Apache repository. Yes, GS> they do use Perl. I'm not a Perl guy, so I probably would GS> break things if I tried to "fix" the scripts by converting GS> them to Python (in fact, Greg Ward helped to improve GS> log_accum.pl for me!). I certainly would not be adverse to GS> Python versions of these files, or other cleanups. Well, we all know Greg Ward's one of those subversive types, but then again it's great to have (hopefully now-loyal) defectors in our camp, just to keep us honest :) Anyway, thanks for sending the code, it'll come in handy if I get stuck. Of course, my P**l skills are so rusted I don't think even an oilcan-armed Dorothy could lube 'em up, so I'm not sure how much use I can put them to. Besides, I already have a huge kludge that gets run on each commit, and I don't think it'll be too hard to add diff generation... IF the informal vote goes that way. -Barry From gmcm at hypernet.com Fri Nov 19 22:56:20 1999 From: gmcm at hypernet.com (Gordon McMillan) Date: Fri, 19 Nov 1999 16:56:20 -0500 Subject: [Python-Dev] Import redesign [LONG] In-Reply-To: <Pine.LNX.4.10.9911191258180.10639-100000@nebula.lyra.org> References: <Pine.WNT.4.04.9911190928100.315-100000@rigoletto.ski.org> Message-ID: <1269073918-25826188@hypernet.com> [David Ascher got involuntarily forwarded] > > Aside: it strikes me that for Python programs which import lots > > of files, 'front-loading' the stat calls could make sense. > > When you first look at a directory in sys.path, you read the > > entire directory in memory, and successive imports do a stat on > > the directory to see if it's changed, and if not use the > > in-memory data. Or am I completely off my rocker here? I posted something here about dircache not too long ago. Essentially, I found it completely unreliable on NT and on Linux to stat the directory. There was some test code attached. - Gordon From gstein at lyra.org Fri Nov 19 23:09:36 1999 From: gstein at lyra.org (Greg Stein) Date: Fri, 19 Nov 1999 14:09:36 -0800 (PST) Subject: [Python-Dev] Another 1.6 wish In-Reply-To: <19991119122302.B23400@trump.amber.org> Message-ID: <Pine.LNX.4.10.9911191359370.10639-100000@nebula.lyra.org> On Fri, 19 Nov 1999, Christopher Petrilli wrote: > Andrew M. Kuchling [akuchlin at mems-exchange.org] wrote: > > Barry A. Warsaw writes: > > >We had some discussion a while back about enabling thread support by > > >default, if the underlying OS supports it obviously. I'd like to see Definitely. I think you still want a --disable-threads option, but the default really ought to include them. > Yes pretty please! One of the biggest problems we have in the Zope world > is that for some unknown reason, most of hte Linux RPMs don't have threading > on in them, so people end up having to compile it anyway... while this > is a silly thing, it does create problems, and means that we deal with > a lot of "dumb" problems. Yah. It's a pain. My RedHat 6.1 box has 1.5.2 with threads. I haven't actually had to build my own Python(!). Man... imagine that. After almost five years of using Linux/Python, I can actually rely on the OS getting it right! :-) > > That reminds me... what about the free threading patches? Perhaps > > they should be added to the list of issues to consider for 1.6. > > My recolection was that unfortunately MOST of the time, they actually > slowed down things because of the number of locks involved... Guido > can no doubt shed more light onto this, but... there was a reason. Yes, there were problems in the first round with locks and lock contention. The main issue is that a list must always use a lock to keep itself consistent. Always. There is no way for an application to say "hey, list object! I've got a higher-level construct here that guarantees there will be no cross-thread use of this list. Ignore the locking." Another issue that can't be avoided is using atomic increment/decrement for the object refcounts. Guido has already asked me about free threading patches for 1.6. I don't know if his intent was to include them, or simply to have them available for those who need them. Certainly, this time around they will be simpler since Guido folded in some of the support stuff (e.g. PyThreadState and per-thread exceptions). There are some other supporting changes that could definitely go into the core interpreter. The slow part comes when you start to add integrity locks to list, dict, etc. That is when the question on whether to include free threading comes up. Design-wise, there is a change or two that I would probably make. Note that shoving free-threading into the standard interpreter would get more eyeballs at the thing, and that people may have great ideas for reducing the overheads. Cheers, -g -- Greg Stein, http://www.lyra.org/ From gstein at lyra.org Fri Nov 19 23:11:02 1999 From: gstein at lyra.org (Greg Stein) Date: Fri, 19 Nov 1999 14:11:02 -0800 (PST) Subject: [Python-Dev] Another 1.6 wish In-Reply-To: <11A17AA2B9EAD111BCEA00A0C9B4179303385C08@molach.origin.ea.com> Message-ID: <Pine.LNX.4.10.9911191409570.10639-100000@nebula.lyra.org> On Fri, 19 Nov 1999, Asbahr, Jason wrote: > >We had some discussion a while back about enabling thread support by > >default, if the underlying OS supports it obviously. > > What's the consensus about Python microthreads -- a likely candidate > for incorporation in 1.6 (or later)? microthreads? eh? > Also, we have a couple minor convenience functions for Python in an > MSDEV environment, an exposure of OutputDebugString for writing to > the DevStudio log window and a means of tripping DevStudio C/C++ layer > breakpoints from Python code (currently experimental). The msvcrt > module seems like a likely candidate for these, would these be > welcome additions? Sure. I don't see why not. I know that I've use OutputDebugString a bazillion times from the Python layer. The breakpoint thingy... dunno, but I don't see a reason to exclude it. Cheers, -g -- Greg Stein, http://www.lyra.org/ From skip at mojam.com Fri Nov 19 23:11:38 1999 From: skip at mojam.com (Skip Montanaro) Date: Fri, 19 Nov 1999 16:11:38 -0600 (CST) Subject: [Python-Dev] Import redesign [LONG] In-Reply-To: <Pine.LNX.4.10.9911191258180.10639-100000@nebula.lyra.org> References: <Pine.WNT.4.04.9911190928100.315-100000@rigoletto.ski.org> <Pine.LNX.4.10.9911191258180.10639-100000@nebula.lyra.org> Message-ID: <14389.51994.809130.22062@dolphin.mojam.com> Greg> The problem occurs when you path is [A, B], the file is in B, and Greg> you add something to A on-the-fly. The cache might direct the Greg> importer at B, missing your file. Typically your path will be relatively short (< 20 directories), right? Just stat the directories before consulting the cache. If any changed since the last time the cache was built, then invalidate the entire cache (or that portion of the cached information that is downstream from the first modified directory). It's still going to be cheaper than performing listdir for each directory in the path, and like you said, only require flushes during development or installation actions. Skip Montanaro | http://www.mojam.com/ skip at mojam.com | http://www.musi-cal.com/ 847-971-7098 | Python: Programming the way Guido indented... From skip at mojam.com Fri Nov 19 23:15:14 1999 From: skip at mojam.com (Skip Montanaro) Date: Fri, 19 Nov 1999 16:15:14 -0600 (CST) Subject: [Python-Dev] Import redesign [LONG] In-Reply-To: <1269073918-25826188@hypernet.com> References: <Pine.WNT.4.04.9911190928100.315-100000@rigoletto.ski.org> <1269073918-25826188@hypernet.com> Message-ID: <14389.52210.833368.249942@dolphin.mojam.com> Gordon> I posted something here about dircache not too long ago. Gordon> Essentially, I found it completely unreliable on NT and on Linux Gordon> to stat the directory. There was some test code attached. The modtime of the directory's stat info should only change if you add or delete entries in the directory. Were you perhaps expecting changes when other operations took place, like rewriting an existing file? Skip Montanaro | http://www.mojam.com/ skip at mojam.com | http://www.musi-cal.com/ 847-971-7098 | Python: Programming the way Guido indented... From skip at mojam.com Fri Nov 19 23:34:42 1999 From: skip at mojam.com (Skip Montanaro) Date: Fri, 19 Nov 1999 16:34:42 -0600 Subject: [Python-Dev] Import redesign [LONG] In-Reply-To: <1269073918-25826188@hypernet.com> References: <Pine.WNT.4.04.9911190928100.315-100000@rigoletto.ski.org> <1269073918-25826188@hypernet.com> Message-ID: <199911192234.QAA24710@dolphin.mojam.com> Gordon wrote: Gordon> I posted something here about dircache not too long ago. Gordon> Essentially, I found it completely unreliable on NT and on Linux Gordon> to stat the directory. There was some test code attached. to which I replied: Skip> The modtime of the directory's stat info should only change if you Skip> add or delete entries in the directory. Were you perhaps Skip> expecting changes when other operations took place, like rewriting Skip> an existing file? I took a couple minutes to write a simple script to check things. It created a file, changed its mode, then unlinked it. I was a bit surprised that deleting a file didn't appear to change the directory's mod time. Then I realized that since file times are only recorded with one-second precision, you might see no change to the directory's mtime in some circumstances. Adding a sleep to the script between directory operations resolved the apparent inconsistency. Still, as Gordon stated, you probably can't count on directory modtimes to tell you when to invalidate the cache. It's consistent, just not reliable... if-we-slow-import-down-enough-we-can-use-this-trick-though-ly y'rs, Skip Montanaro | http://www.mojam.com/ skip at mojam.com | http://www.musi-cal.com/ 847-971-7098 | Python: Programming the way Guido indented... From mhammond at skippinet.com.au Sat Nov 20 01:04:28 1999 From: mhammond at skippinet.com.au (Mark Hammond) Date: Sat, 20 Nov 1999 11:04:28 +1100 Subject: [Python-Dev] Another 1.6 wish In-Reply-To: <11A17AA2B9EAD111BCEA00A0C9B4179303385C08@molach.origin.ea.com> Message-ID: <005f01bf32ea$d0b82b90$0501a8c0@bobcat> > Also, we have a couple minor convenience functions for Python in an > MSDEV environment, an exposure of OutputDebugString for writing to > the DevStudio log window and a means of tripping DevStudio C/C++ layer > breakpoints from Python code (currently experimental). The msvcrt > module seems like a likely candidate for these, would these be > welcome additions? These are both available in the win32api module. They dont really fit in the "msvcrt" module, as they are not part of the C runtime library, but the win32 API itself. This is really a pointer to the fact that some or all of the win32api should be moved into the core - registry access is the thing people most want, but there are plenty of other useful things that people reguarly use... Guido objects to the coding style, but hopefully that wont be a big issue. IMO, the coding style isnt "bad" - it is just more an "MS" flavour than a "Python" flavour - presumably people reading the code will have some experience with Windows, so it wont look completely foreign to them. The good thing about taking it "as-is" is that it has been fairly well bashed on over a few years, so is really quite stable. The final "coding style" issue is that there are no "doc strings" - all documentation is embedded in C comments, and extracted using a tool called "autoduck" (similar to "autodoc"). However, Im sure we can arrange something there, too. Mark. From jcw at equi4.com Sat Nov 20 01:21:43 1999 From: jcw at equi4.com (Jean-Claude Wippler) Date: Sat, 20 Nov 1999 01:21:43 +0100 Subject: [Python-Dev] Import redesign [LONG] References: <Pine.WNT.4.04.9911190928100.315-100000@rigoletto.ski.org> <1269073918-25826188@hypernet.com> <199911192234.QAA24710@dolphin.mojam.com> Message-ID: <3835E997.8A4F5BC5@equi4.com> Skip Montanaro wrote: > [dir stat cache times] > I took a couple minutes to write a simple script to check things. It > created a file, changed its mode, then unlinked it. I was a bit > surprised that deleting a file didn't appear to change the directory's > mod time. Then I realized that since file times are only recorded > with one-second Or two, on Windows with older (FAT, as opposed to VFAT) file systems. > precision, you might see no change to the directory's mtime in some > circumstances. Adding a sleep to the script between directory > operations resolved the apparent inconsistency. Still, as Gordon > stated, you probably can't count on directory modtimes to tell you > when to invalidate the cache. It's consistent, just not reliable... > > if-we-slow-import-down-enough-we-can-use-this-trick-though-ly y'rs, If the dir stat time is less than 2 seconds ago, flush - always. If the dir stat time says it hasn't been changed for at least 2 seconds then you can cache all entries and trust that any change is detected. In other words: take the *current* time into account, then it can work. I think. Maybe. Until you get into network drives and clock skew... -- Jean-Claude From gmcm at hypernet.com Sat Nov 20 04:43:32 1999 From: gmcm at hypernet.com (Gordon McMillan) Date: Fri, 19 Nov 1999 22:43:32 -0500 Subject: [Python-Dev] Import redesign [LONG] In-Reply-To: <3835E997.8A4F5BC5@equi4.com> Message-ID: <1269053086-27079185@hypernet.com> Jean-Claude wrote: > Skip Montanaro wrote: > > > [dir stat cache times] > > ... Then I realized that since > > file times are only recorded with one-second > > Or two, on Windows with older (FAT, as opposed to VFAT) file > systems. Oh lordy, it gets worse. With a time.sleep(1.0) between new files, Linux detects the change in the dir's mtime immediately. Cool. On NT, I get an average 2.0 sec delay. But sometimes it doesn't detect a delay in 100 secs (and my script quits). Then I added a stat of some file in the directory before the stat of the directory, (not the file I added). Now it acts just like Linux - no delay (on both FAT and NTFS partitions). OK... > I think. Maybe. Until you get into network drives and clock > skew... No success whatsoever in either direction across Samba. In fact the mtime of my Linux home directory as seen from NT is Jan 1, 1980. - Gordon From gstein at lyra.org Sat Nov 20 13:06:48 1999 From: gstein at lyra.org (Greg Stein) Date: Sat, 20 Nov 1999 04:06:48 -0800 (PST) Subject: [Python-Dev] updated imputil Message-ID: <Pine.LNX.4.10.9911200356050.10639-100000@nebula.lyra.org> I've updated imputil... The main changes is that I added SysPathImporter and BuiltinImporter. I also did some restructing to help with bootstrapping the module (remove dependence on os.py). For testing a revamped Python import system, you can importing the thing and call imputil._test_revamp() to set it up. This will load normal, builtin, and frozen modules via imputil. Dynamic modules are still handled by Python, however. I ran a timing comparisons of importing all modules in /usr/lib/python1.5 (using standard and imputil-based importing). The standard mechanism can do it in about 8.8 seconds. Through imputil, it does it in about 13.0 seconds. Note that I haven't profiled/optimized any of the Importer stuff (yet). The point about dynamic modules actually discovered a basic problem that I need to resolve now. The current imputil assumes that if a particular Importer loaded the top-level module in a package, then that Importer is responsible for loading all other modules within that package. In my particular test, I tried to import "xml.parsers.pyexpat". The two package modules were handled by SysPathImporter. The pyexpat module is a dynamic load module, so it is *not* handled by the Importer -- bam. Failure. Basically, each part of "xml.parsers.pyexpat" may need to use a different Importer... Off to ponder, -g -- Greg Stein, http://www.lyra.org/ From gstein at lyra.org Sat Nov 20 13:11:37 1999 From: gstein at lyra.org (Greg Stein) Date: Sat, 20 Nov 1999 04:11:37 -0800 (PST) Subject: [Python-Dev] updated imputil In-Reply-To: <Pine.LNX.4.10.9911200356050.10639-100000@nebula.lyra.org> Message-ID: <Pine.LNX.4.10.9911200411060.10639-100000@nebula.lyra.org> oops... forgot: http://www.lyra.org/greg/python/imputil.py -g On Sat, 20 Nov 1999, Greg Stein wrote: > I've updated imputil... The main changes is that I added SysPathImporter > and BuiltinImporter. I also did some restructing to help with > bootstrapping the module (remove dependence on os.py). > > For testing a revamped Python import system, you can importing the thing > and call imputil._test_revamp() to set it up. This will load normal, > builtin, and frozen modules via imputil. Dynamic modules are still > handled by Python, however. > > I ran a timing comparisons of importing all modules in /usr/lib/python1.5 > (using standard and imputil-based importing). The standard mechanism can > do it in about 8.8 seconds. Through imputil, it does it in about 13.0 > seconds. Note that I haven't profiled/optimized any of the Importer stuff > (yet). > > The point about dynamic modules actually discovered a basic problem that I > need to resolve now. The current imputil assumes that if a particular > Importer loaded the top-level module in a package, then that Importer is > responsible for loading all other modules within that package. In my > particular test, I tried to import "xml.parsers.pyexpat". The two package > modules were handled by SysPathImporter. The pyexpat module is a dynamic > load module, so it is *not* handled by the Importer -- bam. Failure. > > Basically, each part of "xml.parsers.pyexpat" may need to use a different > Importer... > > Off to ponder, > -g > > -- > Greg Stein, http://www.lyra.org/ > > > _______________________________________________ > Python-Dev maillist - Python-Dev at python.org > http://www.python.org/mailman/listinfo/python-dev > -- Greg Stein, http://www.lyra.org/ From skip at mojam.com Sat Nov 20 15:16:58 1999 From: skip at mojam.com (Skip Montanaro) Date: Sat, 20 Nov 1999 08:16:58 -0600 (CST) Subject: [Python-Dev] Import redesign [LONG] In-Reply-To: <1269053086-27079185@hypernet.com> References: <3835E997.8A4F5BC5@equi4.com> <1269053086-27079185@hypernet.com> Message-ID: <14390.44378.83128.546732@dolphin.mojam.com> Gordon> No success whatsoever in either direction across Samba. In fact Gordon> the mtime of my Linux home directory as seen from NT is Jan 1, Gordon> 1980. Ain't life grand? :-( Ah, well, it was a nice idea... S From jim at interet.com Mon Nov 22 17:43:39 1999 From: jim at interet.com (James C. Ahlstrom) Date: Mon, 22 Nov 1999 11:43:39 -0500 Subject: [Python-Dev] Import redesign [LONG] References: <Pine.LNX.4.10.9911190404580.10639-100000@nebula.lyra.org> Message-ID: <383972BB.C65DEB26@interet.com> Greg Stein wrote: > > I would suggest that both retain their *exact* meaning. We introduce > sys.importers -- a list of importers to check, in sequence. The first > importer on that list uses sys.path to look for and load modules. The > second importer loads builtins and frozen code (i.e. modules not on > sys.path). We should retain the current order. I think is is: first builtin, next frozen, next sys.path. I really think frozen modules should be loaded in preference to sys.path. After all, they are compiled in. > Users can insert/append new importers or alter sys.path as before. I agree with Greg that sys.path should remain as it is. A list of importers can add the extra functionality. Users will probably want to adjust the order of the list. > > Implementation: > > --------------- > > > > - There must clearly be some code in C that can import certain > > essential modules (to solve the chicken-or-egg problem), but I don't > > mind if the majority of the implementation is written in Python. > > Using Python makes it easy to subclass. > > I posited once before that the cost of import is mostly I/O rather than > CPU, so using Python should not be an issue. MAL demonstrated that a good > design for the Importer classes is also required. Based on this, I'm a > *strong* advocate of moving as much as possible into Python (to get > Python's ease-of-coding with little relative cost). Yes, I agree. And I think the main() should be written in Python. Lots of Python should be written in Python. > The (core) C code should be able to search a path for a module and import > it. It does not require dynamic loading or packages. This will be used to > import exceptions.py, then imputil.py, then site.py. But these can be frozen in (as you mention below). I dislike depending on sys.path to load essential modules. If they are not frozen in, then we need a command line argument to specify their path, with sys.path used otherwise. Jim Ahlstrom From jim at interet.com Mon Nov 22 18:25:46 1999 From: jim at interet.com (James C. Ahlstrom) Date: Mon, 22 Nov 1999 12:25:46 -0500 Subject: [Python-Dev] Import redesign (was: Python 1.6 status) References: <1269144272-21594530@hypernet.com> Message-ID: <38397C9A.DF6B7112@interet.com> Gordon McMillan wrote: > [JimA] > > Think about multiple packages in multiple zip files. The zip > > files store file directories. That means we would need a > > sys.zippath to search the zip files. I don't want another > > PYTHONPATH phenomenon. > > What if sys.path looked like: > [DirImporter('.'), ZlibImporter('c:/python/stdlib.pyz'), ...] Well, that changes the current meaning of sys.path. > > > > I suggest that archive files MUST be put into a known > > > > directory. > > No way. Hard code a directory? Overwrite someone else's > Python "standalone"? Write to a C: partition that is > deliberately sized to hold nothing but Windows? Make > network installations impossible? Ooops. I didn't mean a known directory you couldn't change. But I did mean a directory you shouldn't change. But you are right. The directory should be configurable. But I would still like to see a highly encouraged directory. I don't yet have a good design for this. Anyone have ideas on an official way to find library files? I think a Python library file is a Good Thing, but it is not useful if the archive can't be found. I am thinking of a busy SysAdmin with someone nagging him/her to install Python. SysAdmin doesn't want another headache. What if Python becomes popular and users want it on Unix and PC's? More work! There should be a standard way to do this that just works and is dumb-stupid-simple. This is a Python promotion issue. Yes everyone here can make sys.path work, but that is not the point. > The official Windows solution is stuff in registry about app > paths and such. Putting the dlls in the exe's directory is a > workaround which works and is more managable than the > official solution. I agree completely. > > > > We should also have the ability to append archive files to > > > > the executable or a shared library assuming the OS allows > > > > this > > That's a handy trick on Windows, but it's got nothing to do > with Python. It also works on Linux. I don't know about other systems. > Flexibility. You can put Christian's favorite Einstein quote here > too. I hope we can still have ease of use with all this flexibility. As I said, we need to promote Python. Jim Ahlstrom From mal at lemburg.com Tue Nov 23 14:32:42 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Tue, 23 Nov 1999 14:32:42 +0100 Subject: [Python-Dev] Unicode Proposal: Version 0.8 References: <382C0A54.E6E8328D@lemburg.com> <382D625B.DC14DBDE@lemburg.com> <38316685.7977448D@lemburg.com> <3834425A.8E9C3B7E@lemburg.com> Message-ID: <383A977A.C20E6518@lemburg.com> FYI, I've uploaded a new version of the proposal which includes the encodings package, definition of the 'raw unicode escape' encoding (available via e.g. ur""), Unicode format strings and a new method .breaklines(). The latest version of the proposal is available at: http://starship.skyport.net/~lemburg/unicode-proposal.txt Older versions are available as: http://starship.skyport.net/~lemburg/unicode-proposal-X.X.txt Some POD (points of discussion) that are still open: ? Stream readers: What about .readline(), .readlines() ? These could be implemented using .read() as generic functions instead of requiring their implementation by all codecs. Also see Line Breaks. ? Python interface for the Unicode property database ? What other special Unicode formatting characters should be enhanced to work with Unicode input ? Currently only the following special semantics are defined: u"%s %s" % (u"abc", "abc") should return u"abc abc". Pretty quiet around here lately... -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 38 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From jcw at equi4.com Tue Nov 23 16:17:36 1999 From: jcw at equi4.com (Jean-Claude Wippler) Date: Tue, 23 Nov 1999 16:17:36 +0100 Subject: [Python-Dev] New thread ideas in Perl-land Message-ID: <383AB010.DD46A1FB@equi4.com> Just got a note about a paper on a new way of dealing with threads, as presented to the Perl-Porters list. The idea is described in: http://www.cpan.org/modules/by-authors/id/G/GB/GBARTELS/thread_0001.txt I have no time to dive in, comment, or even judge the relevance of this, but perhaps someone else on this list wishes to check it out. The author of this is Greg London <bartels at pixelmagic.com>. -- Jean-Claude From mhammond at skippinet.com.au Tue Nov 23 23:45:14 1999 From: mhammond at skippinet.com.au (Mark Hammond) Date: Wed, 24 Nov 1999 09:45:14 +1100 Subject: [Python-Dev] Unicode Proposal: Version 0.8 In-Reply-To: <383A977A.C20E6518@lemburg.com> Message-ID: <002301bf3604$68fd8f00$0501a8c0@bobcat> > Pretty quiet around here lately... My guess is that most positions and opinions have been covered. It is now probably time for less talk, and more code! It is time to start an implementation plan? Do we start with /F's Unicode implementation (which /G *smirk* seemed to approve of)? Who does what? When can we start to play with it? And a key point that seems to have been thrust in our faces at the start and hardly mentioned recently - does the proposal as it stands meet our sponsor's (HP) requirements? Mark. From gstein at lyra.org Wed Nov 24 01:40:44 1999 From: gstein at lyra.org (Greg Stein) Date: Tue, 23 Nov 1999 16:40:44 -0800 (PST) Subject: [Python-Dev] Re: updated imputil In-Reply-To: <Pine.LNX.4.10.9911200356050.10639-100000@nebula.lyra.org> Message-ID: <Pine.LNX.4.10.9911231549120.10639-100000@nebula.lyra.org> <enable-ramble-mode> :-) On Sat, 20 Nov 1999, Greg Stein wrote: >... > The point about dynamic modules actually discovered a basic problem that I > need to resolve now. The current imputil assumes that if a particular > Importer loaded the top-level module in a package, then that Importer is > responsible for loading all other modules within that package. In my > particular test, I tried to import "xml.parsers.pyexpat". The two package > modules were handled by SysPathImporter. The pyexpat module is a dynamic > load module, so it is *not* handled by the Importer -- bam. Failure. > > Basically, each part of "xml.parsers.pyexpat" may need to use a different > Importer... I've thought about this and decided the issue is with my particular Importer, rather than the imputil design. The PathImporter traverses a set of paths and establishes a package hierarchy based on a filesystem layout. It should be able to load dynamic modules from within that filesystem area. A couple alternatives, and why I don't believe they work as well: * A separate importer to just load dynamic libraries: this would need to replicate PathImporter's mapping of Python module/package hierarchy onto the filesystem. There would also be a sequencing issue because one Importer's paths would be searched before the other's paths. Current Python import rules establishes that a module earlier in sys.path (whether a dyn-lib or not) is loaded before one later in the path. This behavior could be broken if two Importers were used. * A design whereby other types of modules can be placed into the filesystem and multiple Importers are used to load parts of the path (e.g. PathImporter for xml.parsers and DynLibImporter for pyexpat). This design doesn't work well because the mapping of Python module/package to the filesystem is established by PathImporter -- try to mix a "private" mapping design among Importers creates too much coupling. There is also an argument that the design is fundamentally incorrect :-). I would argue against that, however. I'm not sure what form an argument *against* imputil would be, so I'm not sure how to preempty it :-). But we can get an idea of various arguments by hypothesizing different scenarios and requireing that the imputil design satisifies them. In the above two alternatives, they were examing the use of a secondary Importer to load things out of the filesystem (and it explained why two Importers in whatever configuration is not a good thing). Let's state for argument's sake that files of some type T must be placable within the filesystem (i.e. according to the layout defined by PathImporter). We'll also say that PathImporter doesn't understand T, since the latter was designed later or is private to some app. The way to solve this is to allow PathImporter to recognize it through some configuration of the instance (e.g. self.recognized_types). A set of hooks in the PathImporter would then understand how to map files of type T to a code or module object. (alternatively, a generalized set of hooks at the Importer class level) Note that you could easily have a utility function that scans sys.importers for a PathImporter instance and adds the data to recognize a new type -- this would allow for simple installation of new types. Note that PathImporter inherently defines a 1:1 mapping from a module to a file. Archives (zip or jar files) cannot be recognized and handled by PathImporter. An archive defines an entirely different style of mapping between a module/package and a file in the filesystem. Of course, an Importer that uses archives can certainly look for them in sys.path. The imputil design is derived directly from the "import" statement. "Here is a module/package name, give me a module." (this is embodied in the get_code() method in Importer) The find/load design established by ihooks is very filesystem-based. In many situations, a find/load is very intertwined. If you want to take the URL case, then just examine the actual network activity -- preferably, you want a single transaction (e.g. one HTTP GET). Find/load implies two transactions. With nifty context handling between the two steps, you can get away with a single transaction. But the point is that the design requires you to get work around its inherent two-step mechanism and establish a single step. This is weird, of course, because importing is never *just* a find or a load, but always both. Well... since I've satisfied to myself that PathImporter needs to load dynamic lib modules, I'm off to code it... Cheers, -g -- Greg Stein, http://www.lyra.org/ From gstein at lyra.org Wed Nov 24 02:45:29 1999 From: gstein at lyra.org (Greg Stein) Date: Tue, 23 Nov 1999 17:45:29 -0800 (PST) Subject: [Python-Dev] breaking out code for dynamic loading Message-ID: <Pine.LNX.4.10.9911231731000.10639-100000@nebula.lyra.org> Guido, I can't find the message, but it seems that at some point you mentioned wanting to break out importdl.c into separate files. The configure process could then select the appropriate one to use for the platform. Sounded great until I looked at importdl.c. There are a 13 variants of dynamic loading. That would imply 13 separate files/modules. I'd be happy to break these out, but are you actually interested in that many resulting modules? If so, then any suggestions for naming? (e.g. aix_dynload, win32_dynload, mac_dynload) Here are the variants: * NeXT, using FVM shlibs (USE_RLD) * NeXT, using frameworks (USE_DYLD) * dl / GNU dld (USE_DL) * SunOS, IRIX 5 shared libs (USE_SHLIB) * AIX dynamic linking (_AIX) * Win32 platform (MS_WIN32) * Win16 platform (MS_WIN16) * OS/2 dynamic linking (PYOS_OS2) * Mac CFM (USE_MAC_DYNAMIC_LOADING) * HP/UX dyn linking (hpux) * NetBSD shared libs (__NetBSD__) * FreeBSD shared libs (__FreeBSD__) * BeOS shared libs (__BEOS__) Could I suggest a new top-level directory in the Python distribution named "Platform"? Move BeOS, PC, and PCbuild in there (bring back Mac?). Add new directories for each of the above platforms and move the appropriate portion of importdl.c into there as a Python C Extension Module. (the module would still be statically linked into the interpreter!) ./configure could select the module and write a Setup.dynload, much like it does with Setup.thread. Cheers, -g -- Greg Stein, http://www.lyra.org/ From gstein at lyra.org Wed Nov 24 03:43:50 1999 From: gstein at lyra.org (Greg Stein) Date: Tue, 23 Nov 1999 18:43:50 -0800 (PST) Subject: [Python-Dev] another round of imputil work completed In-Reply-To: <Pine.LNX.4.10.9911231549120.10639-100000@nebula.lyra.org> Message-ID: <Pine.LNX.4.10.9911231837480.10639-100000@nebula.lyra.org> On Tue, 23 Nov 1999, Greg Stein wrote: >... > Well... since I've satisfied to myself that PathImporter needs to load > dynamic lib modules, I'm off to code it... All right. imputil.py now comes with code to emulate the builtin Python import mechanism. It loads all the same types of files, uses sys.path, and (pointed out by JimA) loads builtins before looking on the path. The only "feature" it doesn't support is using package.__path__ to look for submodules. I never liked that thing, so it isn't in there. (imputil *does* set the __path__ attribute, tho) Code is available at: http://www.lyra.org/greg/python/imputil.py Next step is to add a "standard" library/archive format. JimA and I have been tossing some stuff back and forth on this. Cheers, -g -- Greg Stein, http://www.lyra.org/ From mal at lemburg.com Wed Nov 24 09:34:52 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Wed, 24 Nov 1999 09:34:52 +0100 Subject: [Python-Dev] Unicode Proposal: Version 0.8 References: <002301bf3604$68fd8f00$0501a8c0@bobcat> Message-ID: <383BA32C.2E6F4780@lemburg.com> Mark Hammond wrote: > > > Pretty quiet around here lately... > > My guess is that most positions and opinions have been covered. It is > now probably time for less talk, and more code! Or that everybody is on holidays... like Guido. > It is time to start an implementation plan? Do we start with /F's > Unicode implementation (which /G *smirk* seemed to approve of)? Who > does what? When can we start to play with it? This depends on whether HP agrees on the current specs. If they do, there should be code by mid December, I guess. > And a key point that seems to have been thrust in our faces at the > start and hardly mentioned recently - does the proposal as it stands > meet our sponsor's (HP) requirements? Haven't heard anything from them yet (this is probably mainly due to Guido being offline). -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 37 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal at lemburg.com Wed Nov 24 10:32:46 1999 From: mal at lemburg.com (M.-A. Lemburg) Date: Wed, 24 Nov 1999 10:32:46 +0100 Subject: [Python-Dev] Import Design Message-ID: <383BB0BE.BF116A28@lemburg.com> Before hooking on to some more PathBuiltinImporters ;-), I'd like to spawn a thread leading in a different direction... There has been some discussion on what we really expect of the import mechanism to be able to do. Here's a summary of what I think we need: * compatibility with the existing import mechanism * imports from library archives (e.g. .pyl or .par-files) * a modified intra package import lookup scheme (the thingy which I call "walk-me-up-Scotty" patch -- see previous posts) And for some fancy stuff: * imports from URLs (e.g. these could be put on the path for automatic inclusion in the import scan or be passed explicitly to __import__) * a (file based) static lookup cache to enhance lookup performance which is enabled via a command line switch (rather than being enabled per default), so that the user can decide whether to apply this optimization or not The point I want to make is: there aren't all that many features we are really looking for, so why not incorporate these into the builtin importer and only *then* start thinking about schemes for hooks, managers, etc. ?! -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 37 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From captainrobbo at yahoo.com Wed Nov 24 12:40:16 1999 From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=) Date: Wed, 24 Nov 1999 03:40:16 -0800 (PST) Subject: [Python-Dev] Unicode Proposal: Version 0.8 Message-ID: <19991124114016.7706.rocketmail@web601.mail.yahoo.com> --- Mark Hammond <mhammond at skippinet.com.au> wrote: > > Pretty quiet around here lately... > > My guess is that most positions and opinions have > been covered. It is > now probably time for less talk, and more code! > > It is time to start an implementation plan? Do we > start with /F's > Unicode implementation (which /G *smirk* seemed to > approve of)? Who > does what? When can we start to play with it? > > And a key point that seems to have been thrust in > our faces at the > start and hardly mentioned recently - does the > proposal as it stands > meet our sponsor's (HP) requirements? > > Mark. I had a long chat with them on Friday :-) They want it done, but nobody is actively working on it now as far as I can tell, and they are very busy. The per-thread thing was a red herring - they just want to be able to do (for example) web servers handling different encodings from a central unicode database, so per-output-stream works just fine. They will be at IPC8; I'd suggest that a round of prototyping, we insist they read it and then discuss it at IPC8, and be prepared to rework things thereafter are important. Hopefully then we'll have a plan on how to tackle the much larger (but less interesting to python-dev) job of writing and verifying all the codecs and utilities. Andy Robinson ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Thousands of Stores. Millions of Products. All in one place. Yahoo! Shopping: http://shopping.yahoo.com From jim at interet.com Wed Nov 24 15:43:57 1999 From: jim at interet.com (James C. Ahlstrom) Date: Wed, 24 Nov 1999 09:43:57 -0500 Subject: [Python-Dev] Re: updated imputil References: <Pine.LNX.4.10.9911231549120.10639-100000@nebula.lyra.org> Message-ID: <383BF9AD.E183FB98@interet.com> Greg Stein wrote: > * A separate importer to just load dynamic libraries: this would need to > replicate PathImporter's mapping of Python module/package hierarchy onto > the filesystem. There would also be a sequencing issue because one > Importer's paths would be searched before the other's paths. Current > Python import rules establishes that a module earlier in sys.path > (whether a dyn-lib or not) is loaded before one later in the path. This > behavior could be broken if two Importers were used. I would like to argue that on Windows, import of dynamic libraries is broken. If a file something.pyd is imported, then sys.path is searched to find the module. If a file something.dll is imported, the same thing happens. But Windows defines its own search order for *.dll files which Python ignores. I would suggest that this is wrong for files named *.dll, but OK for files named *.pyd. A SysAdmin should be able to install and maintain *.dll as she has been trained to do. This makes maintaining Python installations simpler and more un-surprising. I have no solution to the backward compatibilty problem. But the code is only a couple lines. A LoadLibrary() call does its own path searching. Jim Ahlstrom From jim at interet.com Wed Nov 24 16:06:17 1999 From: jim at interet.com (James C. Ahlstrom) Date: Wed, 24 Nov 1999 10:06:17 -0500 Subject: [Python-Dev] Import Design References: <383BB0BE.BF116A28@lemburg.com> Message-ID: <383BFEE9.B4FE1F19@interet.com> "M.-A. Lemburg" wrote: > The point I want to make is: there aren't all that many features > we are really looking for, so why not incorporate these into > the builtin importer and only *then* start thinking about > schemes for hooks, managers, etc. ?! Marc has made this point before, and I think it should be considered carefully. It is a lot of work to re-create the current import logic in Python and it is almost guaranteed to be slower. So why do it? I like imputil.py because it leads to very simple Python installations. I view this as a Python promotion issue. If we have a boot mechanism plus archive files, we can have few-file Python installations with package addition being just adding another file. But at least some of this code must be in C. I volunteer to write the rest of it in C if that is what people want. But it would add two hundred more lines of code to import.c. So maybe now is the time to switch to imputil, instead of waiting for later. But I am indifferent as long as I can tell a Python user to just put an archive file libpy.pyl in his Python directory and everything will Just Work. Jim Ahlstrom From bwarsaw at python.org Tue Nov 30 21:23:40 1999 From: bwarsaw at python.org (Barry Warsaw) Date: Tue, 30 Nov 1999 15:23:40 -0500 (EST) Subject: [Python-Dev] CFP Developers' Day - 8th International Python Conference Message-ID: <14404.12876.847116.288848@anthem.cnri.reston.va.us> Hello Python Developers! Thursday January 27 2000, the final day of the 8th International Python Conference is Developers' Day, where Python hackers get together to discuss and reach agreements on the outstanding issues facing Python. This is also your once-a-year chance for face-to-face interactions with Python's creator Guido van Rossum and other experienced Python developers. To make Developers' Day a success, we need you! We're looking for a few good champions to lead topic sessions. As a champion, you will choose a topic that fires you up and write a short position paper for publication on the web prior to the conference. You'll also prepare introductory material for the topic overview session, and lead a 90 minute topic breakout group. We've had great champions and topics in previous years, and many features of today's Python had their start at past Developers' Days. This is your chance to help shape the future of Python for 1.6, 2.0 and beyond. If you are interested in becoming a topic champion, you must email me by Wednesday December 15, 1999. For more information, please visit the IPC8 Developers' Day web page at <http://www.python.org/workshops/2000-01/devday.html> This page has more detail on schedule, suggested topics, important dates, etc. To volunteer as a champion, or to ask other questions, you can email me at bwarsaw at python.org. -Barry