Mailman 3 Internationalisation Case Study - Python-Dev

Nov. 9, 1999

      Guido has asked me to get involved in this discussion,
as I've been working practically full-time on i18n for
the last year and a half and have done quite a bit
with Python in this regard.  I thought the most
helpful thing would be to describe the real-world
business problems I have been tackling so people can
understand what one might want from an encoding
toolkit.  In this (long) post I have included:
1. who I am and what I want to do
2. useful sources of info
3. a real world i18n project
4. what I'd like to see in an encoding toolkit

Grab a coffee - this is a long one.

1. Who I am
--------------
Firstly, credentials.  I'm a Python programmer by
night, and when I can involve it in my work which
happens perhaps 20% of the time.  More relevantly, I
did a postgrad course in Japanese Studies and lived in
Japan for about two years; in 1990 when I returned, I
was speaking fairly fluently and could read a
newspaper with regular reference tio a dictionary. 
Since then my Japanese has atrophied badly, but it is
good enough for IT purposes.  For the last year and a
half I have been internationalizing a lot of systems -
more on this below.

My main personal interest is that I am hoping to
launch a company using Python for reporting, data
cleaning and transformation.  An encoding library is
sorely needed for this.

2. Sources of Knowledge
------------------------------
We should really go for world class advice on this. 
Some people who could really contribute to this
discussion are:
- Ken Lunde, author of "CJKV Information Processing"
and head of Asian Type Development at Adobe.  
- Jeffrey Friedl, author of "Mastering Regular
Expressions", and a long time Japan resident and
expert on things Japanese
- Maybe some of the Ruby community?

I'll list up books URLs etc. for anyone who needs them
on request.

3. A Real World Project
----------------------------
18 months ago I was offered a contract with one of the
world's largest investment management companies (which
I will nickname HugeCo) , who (after many years having
analysts out there) were launching a business in Japan
to attract savers; due to recent legal changes,
Japanese people can now freely buy into mutual funds
run by foreign firms.  Given the 2% they historically
get on their savings, and the 12% that US equities
have returned for most of this century, this is a
business with huge potential.  I've been there for a
while now, 
rotating through many different IT projects.

HugeCo runs its non-US business out of the UK.  The
core deal-processing business runs on IBM AS400s. 
These are kind of a cross between a relational
database and a file system, and speak their own
encoding called EBCDIC.    Five years ago the AS400
had limited
connectivity to everything else, so they also started
deploying Sybase databases on Unix to support some
functions.  This means 'mirroring' data between the
two systems on a regular basis.  IBM has always
included encoding information on the AS400 and it
converts from EBCDIC to ASCII on request with most of
the transfer tools (FTP, database queries etc.)

To make things work for Japan, everyone realised that
a double-byte representation would be needed. 
Japanese has about 7000 characters in most IT-related
character sets, and there are a lot of ways to store
it.  Here's a potted language lesson.  (Apologies to
people who really know this field -- I am not going to
be fully pedantic or this would take forever).

Japanese includes two phonetic alphabets (each with
about 80-90 characters), the thousands of Kanji, and
English characters, often all in the same sentence.  
The first attempt to display something was to
make a single -byte character set which included
ASCII, and a simplified (and very ugly) katakana
alphabet in the upper half of the code page.  So you
could spell out the sounds of Japanese words using
'half width katakana'. 

The basic 'character set' is Japan Industrial Standard
0208 ("JIS"). This was defined in 1978, the first
official Asian character set to be defined by a
government.   This can be thought of as a printed
chart
showing the characters - it does not define their
storage on a computer.   It defined a logical 94 x 94
grid, and each character has an index in this grid.

The "JIS" encoding was a way of mixing ASCII and
Japanese in text files and emails.  Each Japanese
character had a double-byte value. It had 'escape
sequences' to say 'You are now entering ASCII
territory' or the opposite.   In 1978 Microsoft
quickly came up with Shift-JIS, a smarter encoding. 
This basically said "Look at the next byte.  If below
127, it is ASCII; if between A and B, it is a
half-width
katakana; if between B and C, it is the first half of
a double-byte character and the next one is the second
half".  Extended Unix Code (EUC) does similar tricks. 
Both have the property that there are no control
characters, and ASCII is still ASCII.  There are a few
other encodings too.

Unfortunately for me and HugeCo, IBM had their own
standard before the Japanese government did, and it
differs; it is most commonly called DBCS (Double-Byte
Character Set).  This involves shift-in and shift-out
sequences (0x16 and 0x17, cannot remember which way
round), so you can mix single and double bytes in a
field.  And we used AS400s for our core processing.

So, back to the problem.  We had a FoxPro system using
ShiftJIS on the desks in Japan which we wanted to
replace in stages, and an AS400 database to replace it
with.  The first stage was to hook them up so names
and addresses could be uploaded to the AS400, and data
files consisting of daily report input could be
downloaded to the PCs.  The AS400 supposedly had a
library which did the conversions, but no one at IBM
knew how it worked.  The people who did all the
evaluations had basically proved that 'Hello World' in
Japanese could be stored on an AS400, but never looked
at the conversion issues until mid-project. Not only
did we need a conversion filter, we had the problem
that the character sets were of different sizes.  So
it was possible - indeed, likely - that some of our
ten thousand customers' names and addresses would
contain characters only on one system or the other,
and fail to
survive a round trip.  (This is the absolute key issue
for me - will a given set of data survive a round trip
through various encoding conversions?)

We figured out how to get the AS400 do to the
conversions during a file transfer in one direction,
and I wrote some Python scripts to make up files with
each official character in JIS on a line; these went
up with conversion, came back binary, and I was able
to build a mapping table and 'reverse engineer' the
IBM encoding.  It was straightforward in theory, "fun"
in practice.  I then wrote a python library which knew
about the AS400 and Shift-JIS encodings, and could
translate a string between them.  It could also detect
corruption and warn us when it occurred.  (This is
another key issue - you will often get badly encoded
data, half a kanji or a couple of random bytes, and
need to be clear on your strategy for handling it in
any library).  It was slow, but it got us our gateway
in both directions, and it warned us of bad input. 360
characters in the DBCS encoding actually appear twice,
so perfect round trips are impossible, but practically
you can survive with some validation of input at both
ends.  The final story was that our names and
addresses were mostly safe, but a few obscure symbols
weren't.

A big issue was that field lengths varied.  An address
field 40 characters long on a PC might grow to 42 or
44 on an AS400 because of the shift characters, so the
software would truncate the address during import, and
cut a kanji in half.  This resulted in a string that
was illegal DBCS, and errors in the database.  To
guard against this, you need really picky input
validation.  You not only ask 'is this string valid
Shift-JIS', you check it will fit on the other system
too.

The next stage was to bring in our Sybase databases. 
Sybase make a Unicode database, which works like the
usual one except that all your SQL code suddenly
becomes case sensitive - more (unrelated) fun when
you have 2000 tables.  Internally it stores data in
UTF8, which is a 'rearrangement' of Unicode which is
much safer to store in conventional systems.
Basically, a UTF8 character is between one and three
bytes, there are no nulls or control characters, and
the ASCII characters are still the same ASCII
characters.  UTF8<->Unicode involves some bit
twiddling but is one-to-one and entirely algorithmic.

We had a product to 'mirror' data between AS400 and
Sybase, which promptly broke when we fed it Japanese. 
The company bought a library called Unilib to do
conversions, and started rewriting the data mirror
software.  This library (like many) uses Unicode as a
central point in all conversions, and offers most of
the world's encodings.  We wanted to test it, and used
the Python routines to put together a regression
test.  As expected, it was mostly right but had some
differences, which we were at least able to document. 

We also needed to rig up a daily feed from the legacy
FoxPro database into Sybase while it was being
replaced (about six months).  We took the same
library, built a DLL wrapper around it, and I
interfaced to this with DynWin , so we were able to do
the low-level string conversion in compiled code and
the high-level 
control in Python. A FoxPro batch job wrote out
delimited text in shift-JIS; Python read this in, ran
it through the DLL to convert it to UTF8, wrote that
out as UTF8 delimited files, ftp'ed them to an 
in directory on the Unix box ready for daily import. 
At this point we had a lot of fun with field widths -
Shift-JIS is much more compact than UTF8 when you have
a lot of kanji (e.g. address fields).

Another issue was half-width katakana.  These were the
earliest attempt to get some form of Japanese out of a
computer, and are single-byte characters above 128 in
Shift-JIS - but are not part of the JIS0208 standard. 

They look ugly and are discouraged; but when you ar
enterinh a long address in a field of a database, and
it won't quite fit, the temptation is to go from
two-bytes-per -character to one (just hit F7 in
windows) to save space.  Unilib rejected these (as
would Java), but has optional modes to preserve them
or 'expand them out' to their full-width equivalents.

The final technical step was our reports package. 
This is a 4GL using a really horrible 1980s Basic-like
language which reads in fixed-width data files and
writes out Postscript; you write programs saying 'go
to x,y' and 'print customer_name', and can build up
anything you want out of that.  It's a monster to
develop in, but when done it really works - 
million page jobs no problem.  We had bought into this
on the promise that it supported Japanese; actually, I
think they had got the equivalent of 'Hello World' out
of it, since we had a lot of problems later.  

The first stage was that the AS400 would send down
fixed width data files in EBCDIC and DBCS.  We ran
these through a C++ conversion utility, again using
Unilib.  We had to filter out and warn about corrupt 
fields, which the conversion utility would reject. 
Surviving records then went into the reports program.

It then turned out that the reports program only
supported some of the Japanese alphabets. 
Specifically, it had a built in font switching system 
whereby when it encountered ASCII text, it would flip
to the most recent single byte text, and when it found
a byte above 127, it would flip to a double byte font.
 This is because many Chinese fonts do (or did) 
not include English characters, or included really
ugly ones.  This was wrong for Japanese, and made the
half-width katakana unprintable.  I found out that I
could control fonts if I printed one character at a
time with a special escape sequence, so wrote my own
bit-scanning code (tough in a language without ord()
or bitwise operations) to examine a string, classify
every byte, and control the fonts the way I wanted. 
So a special subroutine is used for every name or
address field.  This is apparently not unusual in GUI
development (especially web browsers) - you rarely
find a complete Unicode font, so you have to switch
fonts on the fly as you print a string.

After all of this, we had a working system and knew
quite a bit about encodings.  Then the curve ball
arrived:  User Defined Characters!

It is not true to say that there are exactly 6879
characters in Japanese, and more than counting the
number of languages on the Indian sub-continent or the
types of cheese in France.  There are historical
variations and they evolve.  Some people's names got
missed out, and others like to write a kanji in an
unusual way.   Others arrived from China where they
have more complex variants of the same characters.  
Despite the Japanese government's best attempts, these
people have dug their heels in and want to keep their
names the way they like them.  My first reaction was
'Just Say No' - I basically said that it one of these
customers (14 out of a database of 8000) could show me
a tax form or phone bill with the correct UDC on it,
we would implement it but not otherwise (the usual
workaround is to spell their name phonetically in
katakana).  But our marketing people put their foot
down.  

A key factor is that Microsoft has 'extended the
standard' a few times.  First of all, Microsoft and
IBM include an extra 360 characters in their code page
which are not in the JIS0208 standard.   This is well
understood and most encoding toolkits know what 'Code
Page 932' is Shift-JIS plus a few extra characters. 
Secondly, Shift-JIS has a User-Defined region of a
couple of thousand characters.  They have lately been
taking Chinese variants of Japanese characters (which
are readable but a bit old-fashioned - I can imagine
pipe-smoking professors using these forms as an
affectation) and adding them into their standard
Windows fonts; so users are getting used to these
being available.  These are not in a standard. 
Thirdly, they include something called the 'Gaiji
Editor' in Japanese Win95, which lets you add new
characters to the fonts on your PC within the
user-defined region.  The first step was to review all
the PCs in the Tokyo office, and get one centralized
extension font file on a server.  This was also fun as
people had assigned different code points to
characters on differene machines, so what looked
correct on your word processor was a black square on
mine.   Effectively, each company has its own custom
encoding a bit bigger than the standard.

Clearly, none of these extensions would convert
automatically to the other platforms.

Once we actually had an agreed list of code points, we
scanned the database by eye and made sure that the
relevant people were using them.  We decided that
space for 128 User-Defined Characters would  be
allowed.  We thought we would need a wrapper around
Unilib to intercept these values and do a special
conversion; but to our amazement it worked!  Somebody
had already figured out a mapping for at least 1000
characters for all the Japanes encodings, and they did
the round trips from Shift-JIS to Unicode to DBCS and
back.  So the conversion problem needed less code than
we thought.  This mapping is not defined in a standard
AFAIK (certainly not for DBCS anyway).  

We did, however, need some really impressive
validation.  When you input a name or address on any
of the platforms, the system should say 
(a) is it valid for my encoding?
(b) will it fit in the available field space in the
other platforms?
(c) if it contains user-defined characters, are they
the ones we know about, or is this a new guy who will
require updates to our fonts etc.?

Finally, we got back to the display problems.  Our
chosen range had a particular first byte. We built a
miniature font with the characters we needed starting
in the lower half of the code page.  I then
generalized by name-printing routine to say 'if the
first character is XX, throw it away, and print the
subsequent character in our custom font'.  This worked
beautifully - not only could we print everything, we
were using type 1 embedded fonts for the user defined
characters, so we could distill it and also capture it
for our internal document imaging systems.

So, that is roughly what is involved in building a
Japanese client reporting system that spans several
platforms.

I then moved over to the web team to work on our
online trading system for Japan, where I am now -
people will be able to open accounts and invest on the
web.  The first stage was to prove it all worked. 
With HTML, Java and the Web, I had high hopes, which
have mostly been fulfilled - we set an option in the
database connection to say 'this is a UTF8 database',
and Java converts it to Unicode when reading the
results, and we set another option saying 'the output
stream should be Shift-JIS' when we spew out the HTML.
 There is one limitations:  Java sticks to the JIS0208
standard, so the 360 extra IBM/Microsoft Kanji and our
user defined characters won't work on the web.  You
cannot control the fonts on someone else's web
browser; management accepted this because we gave them
no alternative.  Certain customers will need to be
warned, or asked to suggest a standard version of a
charactere if they want to see their name on the web. 
I really hope the web actually brings character usage
in line with the standard in due course, as it will
save a fortune.

Our system is multi-language - when a customer logs
in, we want to say 'You are a Japanese customer of our
Tokyo Operation, so you see page X in language Y'. 
The language strings all all kept in UTF8 in XML
files, so the same file can hold many languages.  This
and the database are the real-world reasons why you
want to store stuff in UTF8.  There are very few tools
to let you view UTF8, but luckily there is a free Word
Processor that lets you type Japanese and save it in
any encoding; so we can cut and paste between
Shift-JIS and UTF8 as needed.

And that's it.  No climactic endings and a lot of real
world mess, just like life in IT.  But hopefully this
gives you a feel for some of the practical stuff
internationalisation projects have to deal with.  See
my other mail for actual suggestions

- Andy Robinson

=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com

Internationalisation Case Study

Andy Robinson

tags

participants (1)