New subject: I18N Toolkit

Nov. 9, 1999

      ...
...
...
ts1 = TypedString('hello', 'cp932ms')   # specify
encoding, it remembers it
ts2 = TypedString('goodbye','cp5035')  
ts1 + ts2   #or any of a host of other encoding
...
...
...
ts3 = TypedString(ts1, 'cp5035')   #converts it
implicitly going via Unicode
ts4 = ts1.cast('ShiftJIS')   #the developer knows
Here are the features I'd like to see in a Python
Internationalisation Toolkit.  I'm very open to
persuasion about APIs and how to do it, but this is
roughly the functionality I would have wanted for the
last year (see separate post "Internationalization
Case Study"):

Built-in types:
---------------
"Unicode String" and "Normal String".  The normal
string is can hold all 256 possible byte values and is
analogous to java's Byte Array - in other words an
ordinary Python string.  

Unicode strings iterate (and are manipulated) per
character, not per byte. You knew that already.  To
manipulate anything in a funny encoding, you convert
it to Unicode, manipulate it there, then convert it
back.

Easy Conversions
----------------------
This is modelled on Java which I think has it right. 
When you construct a Unicode string, you may supply an
optional encoding argument.  I'm not bothered if
conversion happens in a global function, a constructor
method or whatever.

MyUniString = ToUnicode('hello')   # assumes ASCII
MyUniString = ToUnicode('pretend this is Japanese',
'ShiftJIS')  #specified

The converse applies when converting back.

The encoding designators should agree with Java.  If
data is encountered which is not valid for the
encoding, there are several strategies, and it would
be nice if they could be specified explicitly:
1. replace offending characters with a question mark
2. try to recover intelligently (possible in some
cases)
3. raise an exception

A 'Unicode' designator is needed which performs a
dummy conversion.

File Opening:  
---------------
It should be possible to work with files as we do now
- just streams of binary data.  It should also be
possible to    read, say, a file of locally endoded
addresses into a Unicode string. e.g. open(myfile,
'r', 'ShiftJIS').  

It should also be possible to open a raw Unicode file
and read the bytes into ordinary Python strings, or
Unicode strings.  In this case one needs to watch out
for the byte-order marks at the beginning of the file.

Not sure of a good API to do this.  We could have
OrdinaryFile objects and UnicodeFile objects, or
proliferate the arguments to 'open.

Doing the Conversions
----------------------------
All conversions should go through Unicode as the
central point.  

Here is where we can start to define the territory.

Some conversions are algorithmic, some are lookups,
many are a mixture with some simple state transitions
(e.g. shift characters to denote switches from
double-byte to single-byte).  I'd like to see an
'encoding engine' modelled on something like
mxTextTools - a state machine with a few simple
actions, effectively a mini-language for doing simple
operations.  Then a new encoding can be added in a
data-driven way, and still go at C-like speeds. 
Making this open and extensible (and preferably not
needing to code C to do it) is the only way I can see
to get a really good solid encodings library.  Not all
encodings need go in the standard distribution, but
all should be downloadable from www.python.org.

A generalized two-byte-to-two-byte mapping is 128kb. 
But there are compact forms which can reduce these to
a few kb, and also make the data intelligible. It is
obviously desirable to store stuff compactly if we can
unpack it fast.

Typed Strings
----------------
When you are writing data conversion tools to sit in
the middle of a bunch of databases, you could save a
lot of grief with a string that knows its encoding. 
What follows could be done as a Python wrapper around
something ordinary strings rather than as a new type,
and thus need not be part of the language.  

This is analogous to Martin Fowler's Quantity pattern
in Analysis Patterns, where a number knows its units
and you cannot add dollars and pounds accidentally.  

These would do implicit conversions; and they would
stop you assigning or confusing differently encoded
strings.  They would also validate when constructed. 
'Typecasting' would be allowed but would require
explicit code.  So maybe something like...

options
EncodingError
that in this case the string is compatible.

Going Deeper
----------------
The project I describe involved many more issues than
just a straight conversion.  I envisage an encodings
package or module which power users could get at
directly.  

We have be able to answer the questions:

'is string X a valid instance of encoding Y?'
'is string X nearly a valid instance of encoding Y,
maybe with a little corruption, or is it something
totally different?' - this one might be a task left to
a programmer, but the toolkit should help where it
can.

'can string X be converted from encoding Y to encoding
Z without loss of data?  If not, exactly what will get
trashed' ?  This is a really useful utility.

More generally, I want tools to reason about character
sets and encodings.  I have 'Character Set' and
'Character Mapping' classes - very app-specific and
proprietary - which let me express and answer
questions about whether one character set is a
superset of another and reason about round trips.  I'd
like to do these properly for the toolkit.  They would
need some C support for speed, but I think they could
still be data driven.   So we could have an Endoding
object which could be pickled, and we could keep a
directory full of them as our database.  There might
actually be two encoding objects - one for
single-byte, one for multi-byte, with the same API.

There are so many subtle differences between encodings
(even within the Shift-JIS family) - company X has ten
extra characters, and that is technically a new
encoding.  So it would be really useful to reason
about these and say 'find me all JIS-compatible
encodings', or 'report on the differences between
Shift-JIS and 'cp932ms'.

GUI Issues
-------------
The new Pythonwin breaks somewhat on Japanese - editor
windows are fine but console output is show as
single-byte garbage.  I will try to evaluate IDLE on a
Japanese test box this week.  I think these two need
to work for double-byte languages for our credibility.

Verifiability and printing
-----------------------------
We will need to prove it all works.  This means
looking at text on a screen or on paper.  A really
wicked demo utility would be a GUI which could open
files and convert encodings in an editor window or
spreadsheet window, and specify conversions on
copy/paste.  If it could save a page as HTML (just an
encoding tag and data between <PRE> tags, then we
could use Netscape/IE for verification.  Better still,
a web server demo could convert on python.org and tag
the pages appropriately - browsers support most common
encodings.

All the encoding stuff is ultimately a bit meaningless
without a way to display a character.  I am hoping
that PDF and PDFgen may add a lot of value here. 
Adobe (and Ken Lunde) have spent years coming up with
a general architecture for this stuff in PDF. 
Basically, the multi-byte fonts they use are encoding
independent, and come with a whole bunch of mapping
tables.  So I can ask for the same Japanese font in
any of about ten encodings - font name is a
combination of face name and encoding.  The font
itself does the remapping.  They make available
downloadable font packs for Acrobat 4.0 for most
languages now; these are good places to raid for
building encoding databases.  

It also means that I can write a Python script to
crank out beautiful-looking code page charts for all
of our encodings from the database, and input and
output to regression tests.  I've done it for
Shift-JIS at Fidelity, and would have to rewrite it
once I am out of here.  But I think that some good
graphic design here would lead to a product that blows
people away - an encodings library that can print out
its own contents for viewing and thus help demonstrate
its own correctness (or make errors stick out like a
sore thumb).

Am I mad?  Have I put you off forever?  What I outline
above would be a serious project needing months of
work; I'd be really happy to take a role, if we could
find sponsors for the project.  But I believe we could
define the standard for years to come.  Furthermore,
it would go a long way to making Python the corporate
choice for data cleaning and transformation -
territory I think we should own.

Regards,

Andy Robinson
Robinson Analytics Ltd.

=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com

Internationalization Toolkit

Andy Robinson

Guido van Rossum

Andrew M. Kuchling

Guido van Rossum

Andrew M. Kuchling

Ka-Ping Yee

Barry A. Warsaw

Andrew M. Kuchling

Andrew M. Kuchling

M.-A. Lemburg

Jean-Claude Wippler

M.-A. Lemburg

M.-A. Lemburg

M.-A. Lemburg

M.-A. Lemburg

M.-A. Lemburg

Gordon McMillan

M.-A. Lemburg

M.-A. Lemburg

David Ascher

M.-A. Lemburg

M.-A. Lemburg

M.-A. Lemburg

M.-A. Lemburg

M.-A. Lemburg

Guido van Rossum

M.-A. Lemburg

Guido van Rossum

M.-A. Lemburg

M.-A. Lemburg

Guido van Rossum

M.-A. Lemburg

Skip Montanaro

M.-A. Lemburg

M.-A. Lemburg

Skip Montanaro

Guido van Rossum

Guido van Rossum

M.-A. Lemburg

M.-A. Lemburg

M.-A. Lemburg

Guido van Rossum

M.-A. Lemburg

M.-A. Lemburg

M.-A. Lemburg

M.-A. Lemburg

M.-A. Lemburg

M.-A. Lemburg

Guido van Rossum

M.-A. Lemburg

Guido van Rossum

M.-A. Lemburg

Guido van Rossum

M.-A. Lemburg

M.-A. Lemburg

M.-A. Lemburg

M.-A. Lemburg

M.-A. Lemburg

M.-A. Lemburg

Skip Montanaro

Barry A. Warsaw

Guido van Rossum

Andrew M. Kuchling

Guido van Rossum

Andrew M. Kuchling

Ka-Ping Yee

Barry A. Warsaw

Andrew M. Kuchling

Andrew M. Kuchling

M.-A. Lemburg

Jean-Claude Wippler

M.-A. Lemburg

M.-A. Lemburg