[I18n-sig] Re: [Python-Dev] Unicode debate
Andy Robinson
andy@reportlab.com
Fri, 28 Apr 2000 17:12:39 +0100
Guido> In practical applications that manipulate text, encodings creep up
all
Guido> the time. I remember a talk or message by Andy Robinson about the
Guido> messiness of producing printed reports in Japanese for a large
Guido> investment firm. Most off the issues that took his time had to do
Guido> with encodings, if I recall correctly. (Andy, do you remember what
Guido> I'm talking about? Do you have a URL?)
Guido>
I attach the 'Case Study' I posted to the python-dev list
when I first joined. If anyone else can tell their own
stories, however long or short, I feel it would be a
useful addition to the present discussion.
- Andy
>To: python-dev@python.org
>Subject: [Python-Dev] Internationalisation Case Study
>From: Andy Robinson <captainrobbo@yahoo.com>
>Date: Tue, 9 Nov 1999 05:57:46 -0800 (PST)
>
>Guido has asked me to get involved in this discussion,
>as I've been working practically full-time on i18n for
>the last year and a half and have done quite a bit
>with Python in this regard. I thought the most
>helpful thing would be to describe the real-world
>business problems I have been tackling so people can
>understand what one might want from an encoding
>toolkit. In this (long) post I have included:
>1. who I am and what I want to do
>2. useful sources of info
>3. a real world i18n project
>4. what I'd like to see in an encoding toolkit
>
>
>Grab a coffee - this is a long one.
>
>1. Who I am
>--------------
>Firstly, credentials. I'm a Python programmer by
>night, and when I can involve it in my work which
>happens perhaps 20% of the time. More relevantly, I
>did a postgrad course in Japanese Studies and lived in
>Japan for about two years; in 1990 when I returned, I
>was speaking fairly fluently and could read a
>newspaper with regular reference tio a dictionary.
>Since then my Japanese has atrophied badly, but it is
>good enough for IT purposes. For the last year and a
>half I have been internationalizing a lot of systems -
>more on this below.
>
>My main personal interest is that I am hoping to
>launch a company using Python for reporting, data
>cleaning and transformation. An encoding library is
>sorely needed for this.
>
>2. Sources of Knowledge
>------------------------------
>We should really go for world class advice on this.
>Some people who could really contribute to this
>discussion are:
>- Ken Lunde, author of "CJKV Information Processing"
>and head of Asian Type Development at Adobe.
>- Jeffrey Friedl, author of "Mastering Regular
>Expressions", and a long time Japan resident and
>expert on things Japanese
>- Maybe some of the Ruby community?
>
>I'll list up books URLs etc. for anyone who needs them
>on request.
>
>3. A Real World Project
>----------------------------
>18 months ago I was offered a contract with one of the
>world's largest investment management companies (which
>I will nickname HugeCo) , who (after many years having
>analysts out there) were launching a business in Japan
>to attract savers; due to recent legal changes,
>Japanese people can now freely buy into mutual funds
>run by foreign firms. Given the 2% they historically
>get on their savings, and the 12% that US equities
>have returned for most of this century, this is a
>business with huge potential. I've been there for a
>while now,
>rotating through many different IT projects.
>
>HugeCo runs its non-US business out of the UK. The
>core deal-processing business runs on IBM AS400s.
>These are kind of a cross between a relational
>database and a file system, and speak their own
>encoding called EBCDIC. Five years ago the AS400
>had limited
>connectivity to everything else, so they also started
>deploying Sybase databases on Unix to support some
>functions. This means 'mirroring' data between the
>two systems on a regular basis. IBM has always
>included encoding information on the AS400 and it
>converts from EBCDIC to ASCII on request with most of
>the transfer tools (FTP, database queries etc.)
>
>To make things work for Japan, everyone realised that
>a double-byte representation would be needed.
>Japanese has about 7000 characters in most IT-related
>character sets, and there are a lot of ways to store
>it. Here's a potted language lesson. (Apologies to
>people who really know this field -- I am not going to
>be fully pedantic or this would take forever).
>
>Japanese includes two phonetic alphabets (each with
>about 80-90 characters), the thousands of Kanji, and
>English characters, often all in the same sentence.
>The first attempt to display something was to
>make a single -byte character set which included
>ASCII, and a simplified (and very ugly) katakana
>alphabet in the upper half of the code page. So you
>could spell out the sounds of Japanese words using
>'half width katakana'.
>
>The basic 'character set' is Japan Industrial Standard
>0208 ("JIS"). This was defined in 1978, the first
>official Asian character set to be defined by a
>government. This can be thought of as a printed
>chart
>showing the characters - it does not define their
>storage on a computer. It defined a logical 94 x 94
>grid, and each character has an index in this grid.
>
>The "JIS" encoding was a way of mixing ASCII and
>Japanese in text files and emails. Each Japanese
>character had a double-byte value. It had 'escape
>sequences' to say 'You are now entering ASCII
>territory' or the opposite. In 1978 Microsoft
>quickly came up with Shift-JIS, a smarter encoding.
>This basically said "Look at the next byte. If below
>127, it is ASCII; if between A and B, it is a
>half-width
>katakana; if between B and C, it is the first half of
>a double-byte character and the next one is the second
>half". Extended Unix Code (EUC) does similar tricks.
>Both have the property that there are no control
>characters, and ASCII is still ASCII. There are a few
>other encodings too.
>
>Unfortunately for me and HugeCo, IBM had their own
>standard before the Japanese government did, and it
>differs; it is most commonly called DBCS (Double-Byte
>Character Set). This involves shift-in and shift-out
>sequences (0x16 and 0x17, cannot remember which way
>round), so you can mix single and double bytes in a
>field. And we used AS400s for our core processing.
>
>So, back to the problem. We had a FoxPro system using
>ShiftJIS on the desks in Japan which we wanted to
>replace in stages, and an AS400 database to replace it
>with. The first stage was to hook them up so names
>and addresses could be uploaded to the AS400, and data
>files consisting of daily report input could be
>downloaded to the PCs. The AS400 supposedly had a
>library which did the conversions, but no one at IBM
>knew how it worked. The people who did all the
>evaluations had basically proved that 'Hello World' in
>Japanese could be stored on an AS400, but never looked
>at the conversion issues until mid-project. Not only
>did we need a conversion filter, we had the problem
>that the character sets were of different sizes. So
>it was possible - indeed, likely - that some of our
>ten thousand customers' names and addresses would
>contain characters only on one system or the other,
>and fail to
>survive a round trip. (This is the absolute key issue
>for me - will a given set of data survive a round trip
>through various encoding conversions?)
>
>We figured out how to get the AS400 do to the
>conversions during a file transfer in one direction,
>and I wrote some Python scripts to make up files with
>each official character in JIS on a line; these went
>up with conversion, came back binary, and I was able
>to build a mapping table and 'reverse engineer' the
>IBM encoding. It was straightforward in theory, "fun"
>in practice. I then wrote a python library which knew
>about the AS400 and Shift-JIS encodings, and could
>translate a string between them. It could also detect
>corruption and warn us when it occurred. (This is
>another key issue - you will often get badly encoded
>data, half a kanji or a couple of random bytes, and
>need to be clear on your strategy for handling it in
>any library). It was slow, but it got us our gateway
>in both directions, and it warned us of bad input. 360
>characters in the DBCS encoding actually appear twice,
>so perfect round trips are impossible, but practically
>you can survive with some validation of input at both
>ends. The final story was that our names and
>addresses were mostly safe, but a few obscure symbols
>weren't.
>
>A big issue was that field lengths varied. An address
>field 40 characters long on a PC might grow to 42 or
>44 on an AS400 because of the shift characters, so the
>software would truncate the address during import, and
>cut a kanji in half. This resulted in a string that
>was illegal DBCS, and errors in the database. To
>guard against this, you need really picky input
>validation. You not only ask 'is this string valid
>Shift-JIS', you check it will fit on the other system
>too.
>
>The next stage was to bring in our Sybase databases.
>Sybase make a Unicode database, which works like the
>usual one except that all your SQL code suddenly
>becomes case sensitive - more (unrelated) fun when
>you have 2000 tables. Internally it stores data in
>UTF8, which is a 'rearrangement' of Unicode which is
>much safer to store in conventional systems.
>Basically, a UTF8 character is between one and three
>bytes, there are no nulls or control characters, and
>the ASCII characters are still the same ASCII
>characters. UTF8<->Unicode involves some bit
>twiddling but is one-to-one and entirely algorithmic.
>
>We had a product to 'mirror' data between AS400 and
>Sybase, which promptly broke when we fed it Japanese.
>The company bought a library called Unilib to do
>conversions, and started rewriting the data mirror
>software. This library (like many) uses Unicode as a
>central point in all conversions, and offers most of
>the world's encodings. We wanted to test it, and used
>the Python routines to put together a regression
>test. As expected, it was mostly right but had some
>differences, which we were at least able to document.
>
>We also needed to rig up a daily feed from the legacy
>FoxPro database into Sybase while it was being
>replaced (about six months). We took the same
>library, built a DLL wrapper around it, and I
>interfaced to this with DynWin , so we were able to do
>the low-level string conversion in compiled code and
>the high-level
>control in Python. A FoxPro batch job wrote out
>delimited text in shift-JIS; Python read this in, ran
>it through the DLL to convert it to UTF8, wrote that
>out as UTF8 delimited files, ftp'ed them to an
>in directory on the Unix box ready for daily import.
>At this point we had a lot of fun with field widths -
>Shift-JIS is much more compact than UTF8 when you have
>a lot of kanji (e.g. address fields).
>
>Another issue was half-width katakana. These were the
>earliest attempt to get some form of Japanese out of a
>computer, and are single-byte characters above 128 in
>Shift-JIS - but are not part of the JIS0208 standard.
>
>They look ugly and are discouraged; but when you ar
>enterinh a long address in a field of a database, and
>it won't quite fit, the temptation is to go from
>two-bytes-per -character to one (just hit F7 in
>windows) to save space. Unilib rejected these (as
>would Java), but has optional modes to preserve them
>or 'expand them out' to their full-width equivalents.
>
>
>The final technical step was our reports package.
>This is a 4GL using a really horrible 1980s Basic-like
>language which reads in fixed-width data files and
>writes out Postscript; you write programs saying 'go
>to x,y' and 'print customer_name', and can build up
>anything you want out of that. It's a monster to
>develop in, but when done it really works -
>million page jobs no problem. We had bought into this
>on the promise that it supported Japanese; actually, I
>think they had got the equivalent of 'Hello World' out
>of it, since we had a lot of problems later.
>
>The first stage was that the AS400 would send down
>fixed width data files in EBCDIC and DBCS. We ran
>these through a C++ conversion utility, again using
>Unilib. We had to filter out and warn about corrupt
>fields, which the conversion utility would reject.
>Surviving records then went into the reports program.
>
>It then turned out that the reports program only
>supported some of the Japanese alphabets.
>Specifically, it had a built in font switching system
>whereby when it encountered ASCII text, it would flip
>to the most recent single byte text, and when it found
>a byte above 127, it would flip to a double byte font.
> This is because many Chinese fonts do (or did)
>not include English characters, or included really
>ugly ones. This was wrong for Japanese, and made the
>half-width katakana unprintable. I found out that I
>could control fonts if I printed one character at a
>time with a special escape sequence, so wrote my own
>bit-scanning code (tough in a language without ord()
>or bitwise operations) to examine a string, classify
>every byte, and control the fonts the way I wanted.
>So a special subroutine is used for every name or
>address field. This is apparently not unusual in GUI
>development (especially web browsers) - you rarely
>find a complete Unicode font, so you have to switch
>fonts on the fly as you print a string.
>
>After all of this, we had a working system and knew
>quite a bit about encodings. Then the curve ball
>arrived: User Defined Characters!
>
>It is not true to say that there are exactly 6879
>characters in Japanese, and more than counting the
>number of languages on the Indian sub-continent or the
>types of cheese in France. There are historical
>variations and they evolve. Some people's names got
>missed out, and others like to write a kanji in an
>unusual way. Others arrived from China where they
>have more complex variants of the same characters.
>Despite the Japanese government's best attempts, these
>people have dug their heels in and want to keep their
>names the way they like them. My first reaction was
>'Just Say No' - I basically said that it one of these
>customers (14 out of a database of 8000) could show me
>a tax form or phone bill with the correct UDC on it,
>we would implement it but not otherwise (the usual
>workaround is to spell their name phonetically in
>katakana). But our marketing people put their foot
>down.
>
>A key factor is that Microsoft has 'extended the
>standard' a few times. First of all, Microsoft and
>IBM include an extra 360 characters in their code page
>which are not in the JIS0208 standard. This is well
>understood and most encoding toolkits know what 'Code
>Page 932' is Shift-JIS plus a few extra characters.
>Secondly, Shift-JIS has a User-Defined region of a
>couple of thousand characters. They have lately been
>taking Chinese variants of Japanese characters (which
>are readable but a bit old-fashioned - I can imagine
>pipe-smoking professors using these forms as an
>affectation) and adding them into their standard
>Windows fonts; so users are getting used to these
>being available. These are not in a standard.
>Thirdly, they include something called the 'Gaiji
>Editor' in Japanese Win95, which lets you add new
>characters to the fonts on your PC within the
>user-defined region. The first step was to review all
>the PCs in the Tokyo office, and get one centralized
>extension font file on a server. This was also fun as
>people had assigned different code points to
>characters on differene machines, so what looked
>correct on your word processor was a black square on
>mine. Effectively, each company has its own custom
>encoding a bit bigger than the standard.
>
>Clearly, none of these extensions would convert
>automatically to the other platforms.
>
>Once we actually had an agreed list of code points, we
>scanned the database by eye and made sure that the
>relevant people were using them. We decided that
>space for 128 User-Defined Characters would be
>allowed. We thought we would need a wrapper around
>Unilib to intercept these values and do a special
>conversion; but to our amazement it worked! Somebody
>had already figured out a mapping for at least 1000
>characters for all the Japanes encodings, and they did
>the round trips from Shift-JIS to Unicode to DBCS and
>back. So the conversion problem needed less code than
>we thought. This mapping is not defined in a standard
>AFAIK (certainly not for DBCS anyway).
>
>We did, however, need some really impressive
>validation. When you input a name or address on any
>of the platforms, the system should say
>(a) is it valid for my encoding?
>(b) will it fit in the available field space in the
>other platforms?
>(c) if it contains user-defined characters, are they
>the ones we know about, or is this a new guy who will
>require updates to our fonts etc.?
>
>Finally, we got back to the display problems. Our
>chosen range had a particular first byte. We built a
>miniature font with the characters we needed starting
>in the lower half of the code page. I then
>generalized by name-printing routine to say 'if the
>first character is XX, throw it away, and print the
>subsequent character in our custom font'. This worked
>beautifully - not only could we print everything, we
>were using type 1 embedded fonts for the user defined
>characters, so we could distill it and also capture it
>for our internal document imaging systems.
>
>So, that is roughly what is involved in building a
>Japanese client reporting system that spans several
>platforms.
>
>I then moved over to the web team to work on our
>online trading system for Japan, where I am now -
>people will be able to open accounts and invest on the
>web. The first stage was to prove it all worked.
>With HTML, Java and the Web, I had high hopes, which
>have mostly been fulfilled - we set an option in the
>database connection to say 'this is a UTF8 database',
>and Java converts it to Unicode when reading the
>results, and we set another option saying 'the output
>stream should be Shift-JIS' when we spew out the HTML.
> There is one limitations: Java sticks to the JIS0208
>standard, so the 360 extra IBM/Microsoft Kanji and our
>user defined characters won't work on the web. You
>cannot control the fonts on someone else's web
>browser; management accepted this because we gave them
>no alternative. Certain customers will need to be
>warned, or asked to suggest a standard version of a
>charactere if they want to see their name on the web.
>I really hope the web actually brings character usage
>in line with the standard in due course, as it will
>save a fortune.
>
>Our system is multi-language - when a customer logs
>in, we want to say 'You are a Japanese customer of our
>Tokyo Operation, so you see page X in language Y'.
>The language strings all all kept in UTF8 in XML
>files, so the same file can hold many languages. This
>and the database are the real-world reasons why you
>want to store stuff in UTF8. There are very few tools
>to let you view UTF8, but luckily there is a free Word
>Processor that lets you type Japanese and save it in
>any encoding; so we can cut and paste between
>Shift-JIS and UTF8 as needed.
>
>And that's it. No climactic endings and a lot of real
>world mess, just like life in IT. But hopefully this
>gives you a feel for some of the practical stuff
>internationalisation projects have to deal with. See
>my other mail for actual suggestions
>
>- Andy Robinson
>
>=====
>Andy Robinson
>Robinson Analytics Ltd.
>------------------
>My opinions are the official policy of Robinson Analytics Ltd.
>They just vary from day to day.