Mailman 3 November 1999 - Python-Dev

Re: cached encoding (Re: [Python-Dev] Internationalization Toolkit)
by Andy Robinson Nov. 11, 1999

Nov. 11, 1999

--- "Barry A. Warsaw" <bwarsaw(a)cnri.reston.va.us> wrote: > > I'm starting to think about devday topics. Sounds > like an I18n > session would be very useful. Champions? > I'm willing to explain what the fuss is about to bemused onlookers and give some examples of problems it should be able to solve - plenty of good slides and screen shots. I'll stay well away from the C implementation issues. Regards, Andy ===== Andy Robinson Robinson Analytics Ltd. ---------------… [View More]

1 0

Re: cached encoding (Re: [Python-Dev] Internationalization Toolkit)
by Andy Robinson Nov. 11, 1999

Nov. 11, 1999

> 2. Are there plans for an internationalization > session at IPC8? Perhaps a > few key players could be locked into a room for a > couple days, to emerge > bloodied, but with an implementation in-hand... Excellent idea. - Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for … [View More]

1 0

Re: [Python-Dev] I18N Toolkit
by Jack Jansen Nov. 10, 1999

Nov. 10, 1999

> a slightly hairer design issue is what combinations > of pattern and string the new 're' will handle. > > the first two are obvious: > > ordinary pattern, ordinary string > unicode pattern, unicode string > > but what about these? > > ordinary pattern, unicode string > unicode pattern, ordinary string I think the logical thing to do would be to "promote" the ordinary pattern or string to unicode, in a similar way to what … [View More]

1 0

Internationalisation Case Study
by Andy Robinson Nov. 9, 1999

Nov. 9, 1999

Guido has asked me to get involved in this discussion, as I've been working practically full-time on i18n for the last year and a half and have done quite a bit with Python in this regard. I thought the most helpful thing would be to describe the real-world business problems I have been tackling so people can understand what one might want from an encoding toolkit. In this (long) post I have included: 1. who I am and what I want to do 2. useful sources of info 3. a real world i18n project 4. … [View More]what I'd like to see in an encoding toolkit Grab a coffee - this is a long one. 1. Who I am -------------- Firstly, credentials. I'm a Python programmer by night, and when I can involve it in my work which happens perhaps 20% of the time. More relevantly, I did a postgrad course in Japanese Studies and lived in Japan for about two years; in 1990 when I returned, I was speaking fairly fluently and could read a newspaper with regular reference tio a dictionary. Since then my Japanese has atrophied badly, but it is good enough for IT purposes. For the last year and a half I have been internationalizing a lot of systems - more on this below. My main personal interest is that I am hoping to launch a company using Python for reporting, data cleaning and transformation. An encoding library is sorely needed for this. 2. Sources of Knowledge ------------------------------ We should really go for world class advice on this. Some people who could really contribute to this discussion are: - Ken Lunde, author of "CJKV Information Processing" and head of Asian Type Development at Adobe. - Jeffrey Friedl, author of "Mastering Regular Expressions", and a long time Japan resident and expert on things Japanese - Maybe some of the Ruby community? I'll list up books URLs etc. for anyone who needs them on request. 3. A Real World Project ---------------------------- 18 months ago I was offered a contract with one of the world's largest investment management companies (which I will nickname HugeCo) , who (after many years having analysts out there) were launching a business in Japan to attract savers; due to recent legal changes, Japanese people can now freely buy into mutual funds run by foreign firms. Given the 2% they historically get on their savings, and the 12% that US equities have returned for most of this century, this is a business with huge potential. I've been there for a while now, rotating through many different IT projects. HugeCo runs its non-US business out of the UK. The core deal-processing business runs on IBM AS400s. These are kind of a cross between a relational database and a file system, and speak their own encoding called EBCDIC. Five years ago the AS400 had limited connectivity to everything else, so they also started deploying Sybase databases on Unix to support some functions. This means 'mirroring' data between the two systems on a regular basis. IBM has always included encoding information on the AS400 and it converts from EBCDIC to ASCII on request with most of the transfer tools (FTP, database queries etc.) To make things work for Japan, everyone realised that a double-byte representation would be needed. Japanese has about 7000 characters in most IT-related character sets, and there are a lot of ways to store it. Here's a potted language lesson. (Apologies to people who really know this field -- I am not going to be fully pedantic or this would take forever). Japanese includes two phonetic alphabets (each with about 80-90 characters), the thousands of Kanji, and English characters, often all in the same sentence. The first attempt to display something was to make a single -byte character set which included ASCII, and a simplified (and very ugly) katakana alphabet in the upper half of the code page. So you could spell out the sounds of Japanese words using 'half width katakana'. The basic 'character set' is Japan Industrial Standard 0208 ("JIS"). This was defined in 1978, the first official Asian character set to be defined by a government. This can be thought of as a printed chart showing the characters - it does not define their storage on a computer. It defined a logical 94 x 94 grid, and each character has an index in this grid. The "JIS" encoding was a way of mixing ASCII and Japanese in text files and emails. Each Japanese character had a double-byte value. It had 'escape sequences' to say 'You are now entering ASCII territory' or the opposite. In 1978 Microsoft quickly came up with Shift-JIS, a smarter encoding. This basically said "Look at the next byte. If below 127, it is ASCII; if between A and B, it is a half-width katakana; if between B and C, it is the first half of a double-byte character and the next one is the second half". Extended Unix Code (EUC) does similar tricks. Both have the property that there are no control characters, and ASCII is still ASCII. There are a few other encodings too. Unfortunately for me and HugeCo, IBM had their own standard before the Japanese government did, and it differs; it is most commonly called DBCS (Double-Byte Character Set). This involves shift-in and shift-out sequences (0x16 and 0x17, cannot remember which way round), so you can mix single and double bytes in a field. And we used AS400s for our core processing. So, back to the problem. We had a FoxPro system using ShiftJIS on the desks in Japan which we wanted to replace in stages, and an AS400 database to replace it with. The first stage was to hook them up so names and addresses could be uploaded to the AS400, and data files consisting of daily report input could be downloaded to the PCs. The AS400 supposedly had a library which did the conversions, but no one at IBM knew how it worked. The people who did all the evaluations had basically proved that 'Hello World' in Japanese could be stored on an AS400, but never looked at the conversion issues until mid-project. Not only did we need a conversion filter, we had the problem that the character sets were of different sizes. So it was possible - indeed, likely - that some of our ten thousand customers' names and addresses would contain characters only on one system or the other, and fail to survive a round trip. (This is the absolute key issue for me - will a given set of data survive a round trip through various encoding conversions?) We figured out how to get the AS400 do to the conversions during a file transfer in one direction, and I wrote some Python scripts to make up files with each official character in JIS on a line; these went up with conversion, came back binary, and I was able to build a mapping table and 'reverse engineer' the IBM encoding. It was straightforward in theory, "fun" in practice. I then wrote a python library which knew about the AS400 and Shift-JIS encodings, and could translate a string between them. It could also detect corruption and warn us when it occurred. (This is another key issue - you will often get badly encoded data, half a kanji or a couple of random bytes, and need to be clear on your strategy for handling it in any library). It was slow, but it got us our gateway in both directions, and it warned us of bad input. 360 characters in the DBCS encoding actually appear twice, so perfect round trips are impossible, but practically you can survive with some validation of input at both ends. The final story was that our names and addresses were mostly safe, but a few obscure symbols weren't. A big issue was that field lengths varied. An address field 40 characters long on a PC might grow to 42 or 44 on an AS400 because of the shift characters, so the software would truncate the address during import, and cut a kanji in half. This resulted in a string that was illegal DBCS, and errors in the database. To guard against this, you need really picky input validation. You not only ask 'is this string valid Shift-JIS', you check it will fit on the other system too. The next stage was to bring in our Sybase databases. Sybase make a Unicode database, which works like the usual one except that all your SQL code suddenly becomes case sensitive - more (unrelated) fun when you have 2000 tables. Internally it stores data in UTF8, which is a 'rearrangement' of Unicode which is much safer to store in conventional systems. Basically, a UTF8 character is between one and three bytes, there are no nulls or control characters, and the ASCII characters are still the same ASCII characters. UTF8<->Unicode involves some bit twiddling but is one-to-one and entirely algorithmic. We had a product to 'mirror' data between AS400 and Sybase, which promptly broke when we fed it Japanese. The company bought a library called Unilib to do conversions, and started rewriting the data mirror software. This library (like many) uses Unicode as a central point in all conversions, and offers most of the world's encodings. We wanted to test it, and used the Python routines to put together a regression test. As expected, it was mostly right but had some differences, which we were at least able to document. We also needed to rig up a daily feed from the legacy FoxPro database into Sybase while it was being replaced (about six months). We took the same library, built a DLL wrapper around it, and I interfaced to this with DynWin , so we were able to do the low-level string conversion in compiled code and the high-level control in Python. A FoxPro batch job wrote out delimited text in shift-JIS; Python read this in, ran it through the DLL to convert it to UTF8, wrote that out as UTF8 delimited files, ftp'ed them to an in directory on the Unix box ready for daily import. At this point we had a lot of fun with field widths - Shift-JIS is much more compact than UTF8 when you have a lot of kanji (e.g. address fields). Another issue was half-width katakana. These were the earliest attempt to get some form of Japanese out of a computer, and are single-byte characters above 128 in Shift-JIS - but are not part of the JIS0208 standard. They look ugly and are discouraged; but when you ar enterinh a long address in a field of a database, and it won't quite fit, the temptation is to go from two-bytes-per -character to one (just hit F7 in windows) to save space. Unilib rejected these (as would Java), but has optional modes to preserve them or 'expand them out' to their full-width equivalents. The final technical step was our reports package. This is a 4GL using a really horrible 1980s Basic-like language which reads in fixed-width data files and writes out Postscript; you write programs saying 'go to x,y' and 'print customer_name', and can build up anything you want out of that. It's a monster to develop in, but when done it really works - million page jobs no problem. We had bought into this on the promise that it supported Japanese; actually, I think they had got the equivalent of 'Hello World' out of it, since we had a lot of problems later. The first stage was that the AS400 would send down fixed width data files in EBCDIC and DBCS. We ran these through a C++ conversion utility, again using Unilib. We had to filter out and warn about corrupt fields, which the conversion utility would reject. Surviving records then went into the reports program. It then turned out that the reports program only supported some of the Japanese alphabets. Specifically, it had a built in font switching system whereby when it encountered ASCII text, it would flip to the most recent single byte text, and when it found a byte above 127, it would flip to a double byte font. This is because many Chinese fonts do (or did) not include English characters, or included really ugly ones. This was wrong for Japanese, and made the half-width katakana unprintable. I found out that I could control fonts if I printed one character at a time with a special escape sequence, so wrote my own bit-scanning code (tough in a language without ord() or bitwise operations) to examine a string, classify every byte, and control the fonts the way I wanted. So a special subroutine is used for every name or address field. This is apparently not unusual in GUI development (especially web browsers) - you rarely find a complete Unicode font, so you have to switch fonts on the fly as you print a string. After all of this, we had a working system and knew quite a bit about encodings. Then the curve ball arrived: User Defined Characters! It is not true to say that there are exactly 6879 characters in Japanese, and more than counting the number of languages on the Indian sub-continent or the types of cheese in France. There are historical variations and they evolve. Some people's names got missed out, and others like to write a kanji in an unusual way. Others arrived from China where they have more complex variants of the same characters. Despite the Japanese government's best attempts, these people have dug their heels in and want to keep their names the way they like them. My first reaction was 'Just Say No' - I basically said that it one of these customers (14 out of a database of 8000) could show me a tax form or phone bill with the correct UDC on it, we would implement it but not otherwise (the usual workaround is to spell their name phonetically in katakana). But our marketing people put their foot down. A key factor is that Microsoft has 'extended the standard' a few times. First of all, Microsoft and IBM include an extra 360 characters in their code page which are not in the JIS0208 standard. This is well understood and most encoding toolkits know what 'Code Page 932' is Shift-JIS plus a few extra characters. Secondly, Shift-JIS has a User-Defined region of a couple of thousand characters. They have lately been taking Chinese variants of Japanese characters (which are readable but a bit old-fashioned - I can imagine pipe-smoking professors using these forms as an affectation) and adding them into their standard Windows fonts; so users are getting used to these being available. These are not in a standard. Thirdly, they include something called the 'Gaiji Editor' in Japanese Win95, which lets you add new characters to the fonts on your PC within the user-defined region. The first step was to review all the PCs in the Tokyo office, and get one centralized extension font file on a server. This was also fun as people had assigned different code points to characters on differene machines, so what looked correct on your word processor was a black square on mine. Effectively, each company has its own custom encoding a bit bigger than the standard. Clearly, none of these extensions would convert automatically to the other platforms. Once we actually had an agreed list of code points, we scanned the database by eye and made sure that the relevant people were using them. We decided that space for 128 User-Defined Characters would be allowed. We thought we would need a wrapper around Unilib to intercept these values and do a special conversion; but to our amazement it worked! Somebody had already figured out a mapping for at least 1000 characters for all the Japanes encodings, and they did the round trips from Shift-JIS to Unicode to DBCS and back. So the conversion problem needed less code than we thought. This mapping is not defined in a standard AFAIK (certainly not for DBCS anyway). We did, however, need some really impressive validation. When you input a name or address on any of the platforms, the system should say (a) is it valid for my encoding? (b) will it fit in the available field space in the other platforms? (c) if it contains user-defined characters, are they the ones we know about, or is this a new guy who will require updates to our fonts etc.? Finally, we got back to the display problems. Our chosen range had a particular first byte. We built a miniature font with the characters we needed starting in the lower half of the code page. I then generalized by name-printing routine to say 'if the first character is XX, throw it away, and print the subsequent character in our custom font'. This worked beautifully - not only could we print everything, we were using type 1 embedded fonts for the user defined characters, so we could distill it and also capture it for our internal document imaging systems. So, that is roughly what is involved in building a Japanese client reporting system that spans several platforms. I then moved over to the web team to work on our online trading system for Japan, where I am now - people will be able to open accounts and invest on the web. The first stage was to prove it all worked. With HTML, Java and the Web, I had high hopes, which have mostly been fulfilled - we set an option in the database connection to say 'this is a UTF8 database', and Java converts it to Unicode when reading the results, and we set another option saying 'the output stream should be Shift-JIS' when we spew out the HTML. There is one limitations: Java sticks to the JIS0208 standard, so the 360 extra IBM/Microsoft Kanji and our user defined characters won't work on the web. You cannot control the fonts on someone else's web browser; management accepted this because we gave them no alternative. Certain customers will need to be warned, or asked to suggest a standard version of a charactere if they want to see their name on the web. I really hope the web actually brings character usage in line with the standard in due course, as it will save a fortune. Our system is multi-language - when a customer logs in, we want to say 'You are a Japanese customer of our Tokyo Operation, so you see page X in language Y'. The language strings all all kept in UTF8 in XML files, so the same file can hold many languages. This and the database are the real-world reasons why you want to store stuff in UTF8. There are very few tools to let you view UTF8, but luckily there is a free Word Processor that lets you type Japanese and save it in any encoding; so we can cut and paste between Shift-JIS and UTF8 as needed. And that's it. No climactic endings and a lot of real world mess, just like life in IT. But hopefully this gives you a feel for some of the practical stuff internationalisation projects have to deal with. See my other mail for actual suggestions - Andy Robinson ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com [View Less]

1 0

wish list
by Guido van Rossum Nov. 8, 1999

Nov. 8, 1999

I got the wish list below. Anyone care to comment on how close we are on fulfilling some or all of this? --Guido van Rossum (home page: http://www.python.org/~guido/) ------- Forwarded Message Date: Thu, 04 Nov 1999 20:26:54 +0700 From: "Claudio Ramón" <rmn70(a)hotmail.com> To: guido(a)python.org Hello, I'm a python user (excuse my english, I'm spanish and...). I think it is a very complete language and I use it in solve statistics, phisics, mathematics, chemistry and … [View More]

5 4

updated modules
by Greg Stein Nov. 7, 1999

Nov. 7, 1999

Hi all... I've updated some of the modules at http://www.lyra.org/greg/python/. Specifically, there is a new httplib.py, davlib.py, qp_xml.py, and a new imputil.py. The latter will be updated again RSN with some patches from Jim Ahlstrom. Besides some tweaks/fixes/etc, I've also clarified the ownership and licensing of the things. httplib and davlib are (C) Guido, licensed under the Python license (well... anything he chooses :-). qp_xml and imputil are still Public Domain. I also added some … [View More]

1 0

Re: [Python-Dev] wish list
by Sam Rushing Nov. 5, 1999

Nov. 5, 1999

James C. Ahlstrom writes: > Guido van Rossum wrote: > > I got the wish list below. Anyone care to comment on how close we are > > on fulfilling some or all of this? > > > * GNU CC for Win32 compatibility (compilation of python interpreter and > > "Freeze" utility). I think MingWin32 (Mummint Khan) is a good alternative > > eviting the cygwin dll user. > > I don't know what this means. mingw32: 'minimalist gcc for win32'. it's gcc on win32 … [View More]

3 2

paper available
by Vladimir Marangozov Nov. 5, 1999

Nov. 5, 1999

I've OCR'd Saltzer's paper. It's available temporarily (in MS Word format) at http://sirac.inrialpes.fr/~marangoz/tmp/Saltzer.zip Since there may be legal problems with LNCS, I will disable the link shortly (so those of you who have not received a copy and are interested in reading it, please grab it quickly) If prof. Saltzer agrees (and if he can, legally) put it on his web page, I guess that the paper will show up at http://mit.edu/saltzer/ Jeremy, could you please check this with prof. … [View More]

2 1

Benevolent dictator versus the bureaucratic committee?
by Mark Hammond Nov. 2, 1999

Nov. 2, 1999

I have for some time been wondering about the usefulness of this mailing list. It seems to have produced staggeringly few results since inception. This is not a critisism of any individual, but of the process. It is proof in my mind of how effective the benevolent dictator model is, and how ineffective a language run by committee would be. This "committee" never seems to be capable of reaching a consensus on anything. A number of issues dont seem to provoke any responses. As a result, … [View More]many things seem to die a slow and lingering death. Often there is lots of interesting discussion, but still precious few results. In the pre python-dev days, the process seemed easier - we mailed Guido directly, and he either stated "yea" or "nay" - maybe we didnt get the response we hoped for, but at least we got a response. Now, we have the result that even if Guido does enter into a thread, the noise seems to drown out any hope of getting anything done. Guido seems to be faced with the dilemma of asserting his dictatorship in the face of many dissenting opinions from many people he respects, or putting it in the too hard basket. I fear the latter is the easiest option. At the end of this mail I list some of the major threads over the last few months, and can't see a single thread that has resulted in a CVS checkin, and only one that has resulted in agreement. This, to my mind at least, is proof that things are really not working. I long for the "good old days" - take the replacement of "ni" with built-in functionality, for example. I posit that if this was discussed on python-dev, it would have caused a huge flood of mail, and nothing remotely resembling a consensus. Instead, Guido simply wrote an essay and implemented some code that he personally liked. No debate, no discussion. Still an excellent result. Maybe not a perfect result, but a result nonetheless. However, Guido's time is becoming increasingly limited. So should we consider moving to a "benevolent lieutenent" model, in conjunction with re-ramping up the SIGS? This would provide 2 ways to get things done: * A new SIG. Take relative imports, for example. If we really do need a change in this fairly fundamental area, a SIG would be justified ("import-sig"). The responsibility of the SIG is to form a consensus (and code that reflects it), and report back to Guido (and the main newsgroup) with the result of this. It worked well for RE, and allowed those of us not particularly interested to keep out of the debate. If the SIG can not form consensus, then tough - it dies - and should not be mourned. Presumably Guido would keep a watchful eye over the SIG, providing direction where necessary, but in general stay out of the day to day traffic. New SIGs seem to have stopped since this list creation, and it seems that issues that should be discussed in new SIGS are now discussed here. * Guido could delegate some of his authority to a single individual responsible for a certain limited area - a benevolent lieutenent. We may have a lieutentant responsible for different areas, and could only exercise their authority with small, trivial changes. Eg, the "getopt helper" thread - if a lieutenant was given authority for the "standard library", they could simply make a yea or nay decision, and present it to Guido. Presumably Guido trusts this person he delegated to enough that the majority of the lieutenant's recommendations would be accepted. Presumably there would be a small number of lieutentants, and they would then become the new "python-dev" - say up to 5 people. This list then discusses high level strategies and seek direction from each other when things get murky. This select group of people may not (indeed, probably would not) include me, but I would have no problem with that - I would prefer to see results achieved than have my own ego stroked by being included in a select, but ineffective group. In parting, I repeat this is not a direct critisism, simply an observation of the last few months. I am on this list, so I am definately as guilty as any one else - which is "not at all" - ie, no one is guilty, I simply see it as endemic to a committee with people of diverse backgrounds, skills and opinions. Any thoughts? Long live the dictator! :-) Mark. Recent threads, and my take on the results: * getopt helper? Too much noise regarding semantic changes. * Alternative Approach to Relative Imports * Relative package imports * Path hacking * Towards a Python based import scheme Too much noise - no one could really agree on the semantics. Implementation thrown in the ring, and promptly forgotten. * Corporate installations Very young, but no result at all. * Embedding Python when using different calling conventions Quite young, but no result as yet, and I have no reason to believe there will be. * Catching "return" and "return expr" at compile time Seemed to be blessed - yay! Dont believe I have seen a check-in yet. * More Python command-line features Seemed general agreement, but nothing happened? * Tackling circular dependencies in 2.0? Lots of noise, but no results other than "GC may be there in 2.0" * Buffer interface in abstract.c Determined it could break - no solution proposed. Lots of noise regarding if is is a good idea at all! * mmapfile module No result. * Quick-and-dirty weak references No result. * Portable "spawn" module for core? No result. * Fake threads Seemed to spawn stackless Python, but in the face of Guido being "at best, lukewarm" about this issue, I would again have to conclude "no result". An authorative "no" in this area may have saved lots of effort and heartache. * add Expat to 1.6 No result. * I'd like list.pop to accept an optional second argument giving a default value No result * etc No result. [View Less]

4 3

Misleading syntax error text
by M.-A. Lemburg Nov. 1, 1999

Nov. 1, 1999

[Extracted from the psa-members list...] Gordon McMillan wrote: > > Chris Fama wrote, > > And now the rub: the exact same function definition has passed > > through byte-compilation perfectly OK many times before with no > > problems... of course, this points rather clearly to the > > preceding code, but it illustrates a failing in Python's syntax > > error messages, and IMHO a fairly serious one at that, if this is > > indeed so. > > My simple … [View More]

2 2