
Guido has asked me to get involved in this discussion, as I've been working practically full-time on i18n for the last year and a half and have done quite a bit with Python in this regard. I thought the most helpful thing would be to describe the real-world business problems I have been tackling so people can understand what one might want from an encoding toolkit. In this (long) post I have included: 1. who I am and what I want to do 2. useful sources of info 3. a real world i18n project 4. what I'd like to see in an encoding toolkit Grab a coffee - this is a long one. 1. Who I am -------------- Firstly, credentials. I'm a Python programmer by night, and when I can involve it in my work which happens perhaps 20% of the time. More relevantly, I did a postgrad course in Japanese Studies and lived in Japan for about two years; in 1990 when I returned, I was speaking fairly fluently and could read a newspaper with regular reference tio a dictionary. Since then my Japanese has atrophied badly, but it is good enough for IT purposes. For the last year and a half I have been internationalizing a lot of systems - more on this below. My main personal interest is that I am hoping to launch a company using Python for reporting, data cleaning and transformation. An encoding library is sorely needed for this. 2. Sources of Knowledge ------------------------------ We should really go for world class advice on this. Some people who could really contribute to this discussion are: - Ken Lunde, author of "CJKV Information Processing" and head of Asian Type Development at Adobe. - Jeffrey Friedl, author of "Mastering Regular Expressions", and a long time Japan resident and expert on things Japanese - Maybe some of the Ruby community? I'll list up books URLs etc. for anyone who needs them on request. 3. A Real World Project ---------------------------- 18 months ago I was offered a contract with one of the world's largest investment management companies (which I will nickname HugeCo) , who (after many years having analysts out there) were launching a business in Japan to attract savers; due to recent legal changes, Japanese people can now freely buy into mutual funds run by foreign firms. Given the 2% they historically get on their savings, and the 12% that US equities have returned for most of this century, this is a business with huge potential. I've been there for a while now, rotating through many different IT projects. HugeCo runs its non-US business out of the UK. The core deal-processing business runs on IBM AS400s. These are kind of a cross between a relational database and a file system, and speak their own encoding called EBCDIC. Five years ago the AS400 had limited connectivity to everything else, so they also started deploying Sybase databases on Unix to support some functions. This means 'mirroring' data between the two systems on a regular basis. IBM has always included encoding information on the AS400 and it converts from EBCDIC to ASCII on request with most of the transfer tools (FTP, database queries etc.) To make things work for Japan, everyone realised that a double-byte representation would be needed. Japanese has about 7000 characters in most IT-related character sets, and there are a lot of ways to store it. Here's a potted language lesson. (Apologies to people who really know this field -- I am not going to be fully pedantic or this would take forever). Japanese includes two phonetic alphabets (each with about 80-90 characters), the thousands of Kanji, and English characters, often all in the same sentence. The first attempt to display something was to make a single -byte character set which included ASCII, and a simplified (and very ugly) katakana alphabet in the upper half of the code page. So you could spell out the sounds of Japanese words using 'half width katakana'. The basic 'character set' is Japan Industrial Standard 0208 ("JIS"). This was defined in 1978, the first official Asian character set to be defined by a government. This can be thought of as a printed chart showing the characters - it does not define their storage on a computer. It defined a logical 94 x 94 grid, and each character has an index in this grid. The "JIS" encoding was a way of mixing ASCII and Japanese in text files and emails. Each Japanese character had a double-byte value. It had 'escape sequences' to say 'You are now entering ASCII territory' or the opposite. In 1978 Microsoft quickly came up with Shift-JIS, a smarter encoding. This basically said "Look at the next byte. If below 127, it is ASCII; if between A and B, it is a half-width katakana; if between B and C, it is the first half of a double-byte character and the next one is the second half". Extended Unix Code (EUC) does similar tricks. Both have the property that there are no control characters, and ASCII is still ASCII. There are a few other encodings too. Unfortunately for me and HugeCo, IBM had their own standard before the Japanese government did, and it differs; it is most commonly called DBCS (Double-Byte Character Set). This involves shift-in and shift-out sequences (0x16 and 0x17, cannot remember which way round), so you can mix single and double bytes in a field. And we used AS400s for our core processing. So, back to the problem. We had a FoxPro system using ShiftJIS on the desks in Japan which we wanted to replace in stages, and an AS400 database to replace it with. The first stage was to hook them up so names and addresses could be uploaded to the AS400, and data files consisting of daily report input could be downloaded to the PCs. The AS400 supposedly had a library which did the conversions, but no one at IBM knew how it worked. The people who did all the evaluations had basically proved that 'Hello World' in Japanese could be stored on an AS400, but never looked at the conversion issues until mid-project. Not only did we need a conversion filter, we had the problem that the character sets were of different sizes. So it was possible - indeed, likely - that some of our ten thousand customers' names and addresses would contain characters only on one system or the other, and fail to survive a round trip. (This is the absolute key issue for me - will a given set of data survive a round trip through various encoding conversions?) We figured out how to get the AS400 do to the conversions during a file transfer in one direction, and I wrote some Python scripts to make up files with each official character in JIS on a line; these went up with conversion, came back binary, and I was able to build a mapping table and 'reverse engineer' the IBM encoding. It was straightforward in theory, "fun" in practice. I then wrote a python library which knew about the AS400 and Shift-JIS encodings, and could translate a string between them. It could also detect corruption and warn us when it occurred. (This is another key issue - you will often get badly encoded data, half a kanji or a couple of random bytes, and need to be clear on your strategy for handling it in any library). It was slow, but it got us our gateway in both directions, and it warned us of bad input. 360 characters in the DBCS encoding actually appear twice, so perfect round trips are impossible, but practically you can survive with some validation of input at both ends. The final story was that our names and addresses were mostly safe, but a few obscure symbols weren't. A big issue was that field lengths varied. An address field 40 characters long on a PC might grow to 42 or 44 on an AS400 because of the shift characters, so the software would truncate the address during import, and cut a kanji in half. This resulted in a string that was illegal DBCS, and errors in the database. To guard against this, you need really picky input validation. You not only ask 'is this string valid Shift-JIS', you check it will fit on the other system too. The next stage was to bring in our Sybase databases. Sybase make a Unicode database, which works like the usual one except that all your SQL code suddenly becomes case sensitive - more (unrelated) fun when you have 2000 tables. Internally it stores data in UTF8, which is a 'rearrangement' of Unicode which is much safer to store in conventional systems. Basically, a UTF8 character is between one and three bytes, there are no nulls or control characters, and the ASCII characters are still the same ASCII characters. UTF8<->Unicode involves some bit twiddling but is one-to-one and entirely algorithmic. We had a product to 'mirror' data between AS400 and Sybase, which promptly broke when we fed it Japanese. The company bought a library called Unilib to do conversions, and started rewriting the data mirror software. This library (like many) uses Unicode as a central point in all conversions, and offers most of the world's encodings. We wanted to test it, and used the Python routines to put together a regression test. As expected, it was mostly right but had some differences, which we were at least able to document. We also needed to rig up a daily feed from the legacy FoxPro database into Sybase while it was being replaced (about six months). We took the same library, built a DLL wrapper around it, and I interfaced to this with DynWin , so we were able to do the low-level string conversion in compiled code and the high-level control in Python. A FoxPro batch job wrote out delimited text in shift-JIS; Python read this in, ran it through the DLL to convert it to UTF8, wrote that out as UTF8 delimited files, ftp'ed them to an in directory on the Unix box ready for daily import. At this point we had a lot of fun with field widths - Shift-JIS is much more compact than UTF8 when you have a lot of kanji (e.g. address fields). Another issue was half-width katakana. These were the earliest attempt to get some form of Japanese out of a computer, and are single-byte characters above 128 in Shift-JIS - but are not part of the JIS0208 standard. They look ugly and are discouraged; but when you ar enterinh a long address in a field of a database, and it won't quite fit, the temptation is to go from two-bytes-per -character to one (just hit F7 in windows) to save space. Unilib rejected these (as would Java), but has optional modes to preserve them or 'expand them out' to their full-width equivalents. The final technical step was our reports package. This is a 4GL using a really horrible 1980s Basic-like language which reads in fixed-width data files and writes out Postscript; you write programs saying 'go to x,y' and 'print customer_name', and can build up anything you want out of that. It's a monster to develop in, but when done it really works - million page jobs no problem. We had bought into this on the promise that it supported Japanese; actually, I think they had got the equivalent of 'Hello World' out of it, since we had a lot of problems later. The first stage was that the AS400 would send down fixed width data files in EBCDIC and DBCS. We ran these through a C++ conversion utility, again using Unilib. We had to filter out and warn about corrupt fields, which the conversion utility would reject. Surviving records then went into the reports program. It then turned out that the reports program only supported some of the Japanese alphabets. Specifically, it had a built in font switching system whereby when it encountered ASCII text, it would flip to the most recent single byte text, and when it found a byte above 127, it would flip to a double byte font. This is because many Chinese fonts do (or did) not include English characters, or included really ugly ones. This was wrong for Japanese, and made the half-width katakana unprintable. I found out that I could control fonts if I printed one character at a time with a special escape sequence, so wrote my own bit-scanning code (tough in a language without ord() or bitwise operations) to examine a string, classify every byte, and control the fonts the way I wanted. So a special subroutine is used for every name or address field. This is apparently not unusual in GUI development (especially web browsers) - you rarely find a complete Unicode font, so you have to switch fonts on the fly as you print a string. After all of this, we had a working system and knew quite a bit about encodings. Then the curve ball arrived: User Defined Characters! It is not true to say that there are exactly 6879 characters in Japanese, and more than counting the number of languages on the Indian sub-continent or the types of cheese in France. There are historical variations and they evolve. Some people's names got missed out, and others like to write a kanji in an unusual way. Others arrived from China where they have more complex variants of the same characters. Despite the Japanese government's best attempts, these people have dug their heels in and want to keep their names the way they like them. My first reaction was 'Just Say No' - I basically said that it one of these customers (14 out of a database of 8000) could show me a tax form or phone bill with the correct UDC on it, we would implement it but not otherwise (the usual workaround is to spell their name phonetically in katakana). But our marketing people put their foot down. A key factor is that Microsoft has 'extended the standard' a few times. First of all, Microsoft and IBM include an extra 360 characters in their code page which are not in the JIS0208 standard. This is well understood and most encoding toolkits know what 'Code Page 932' is Shift-JIS plus a few extra characters. Secondly, Shift-JIS has a User-Defined region of a couple of thousand characters. They have lately been taking Chinese variants of Japanese characters (which are readable but a bit old-fashioned - I can imagine pipe-smoking professors using these forms as an affectation) and adding them into their standard Windows fonts; so users are getting used to these being available. These are not in a standard. Thirdly, they include something called the 'Gaiji Editor' in Japanese Win95, which lets you add new characters to the fonts on your PC within the user-defined region. The first step was to review all the PCs in the Tokyo office, and get one centralized extension font file on a server. This was also fun as people had assigned different code points to characters on differene machines, so what looked correct on your word processor was a black square on mine. Effectively, each company has its own custom encoding a bit bigger than the standard. Clearly, none of these extensions would convert automatically to the other platforms. Once we actually had an agreed list of code points, we scanned the database by eye and made sure that the relevant people were using them. We decided that space for 128 User-Defined Characters would be allowed. We thought we would need a wrapper around Unilib to intercept these values and do a special conversion; but to our amazement it worked! Somebody had already figured out a mapping for at least 1000 characters for all the Japanes encodings, and they did the round trips from Shift-JIS to Unicode to DBCS and back. So the conversion problem needed less code than we thought. This mapping is not defined in a standard AFAIK (certainly not for DBCS anyway). We did, however, need some really impressive validation. When you input a name or address on any of the platforms, the system should say (a) is it valid for my encoding? (b) will it fit in the available field space in the other platforms? (c) if it contains user-defined characters, are they the ones we know about, or is this a new guy who will require updates to our fonts etc.? Finally, we got back to the display problems. Our chosen range had a particular first byte. We built a miniature font with the characters we needed starting in the lower half of the code page. I then generalized by name-printing routine to say 'if the first character is XX, throw it away, and print the subsequent character in our custom font'. This worked beautifully - not only could we print everything, we were using type 1 embedded fonts for the user defined characters, so we could distill it and also capture it for our internal document imaging systems. So, that is roughly what is involved in building a Japanese client reporting system that spans several platforms. I then moved over to the web team to work on our online trading system for Japan, where I am now - people will be able to open accounts and invest on the web. The first stage was to prove it all worked. With HTML, Java and the Web, I had high hopes, which have mostly been fulfilled - we set an option in the database connection to say 'this is a UTF8 database', and Java converts it to Unicode when reading the results, and we set another option saying 'the output stream should be Shift-JIS' when we spew out the HTML. There is one limitations: Java sticks to the JIS0208 standard, so the 360 extra IBM/Microsoft Kanji and our user defined characters won't work on the web. You cannot control the fonts on someone else's web browser; management accepted this because we gave them no alternative. Certain customers will need to be warned, or asked to suggest a standard version of a charactere if they want to see their name on the web. I really hope the web actually brings character usage in line with the standard in due course, as it will save a fortune. Our system is multi-language - when a customer logs in, we want to say 'You are a Japanese customer of our Tokyo Operation, so you see page X in language Y'. The language strings all all kept in UTF8 in XML files, so the same file can hold many languages. This and the database are the real-world reasons why you want to store stuff in UTF8. There are very few tools to let you view UTF8, but luckily there is a free Word Processor that lets you type Japanese and save it in any encoding; so we can cut and paste between Shift-JIS and UTF8 as needed. And that's it. No climactic endings and a lot of real world mess, just like life in IT. But hopefully this gives you a feel for some of the practical stuff internationalisation projects have to deal with. See my other mail for actual suggestions - Andy Robinson ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com
participants (1)
-
Andy Robinson