[I18n-sig] Japanese commentary on the Pre-PEP (1 of 4)

Brian Takashi Hooper brian@tomigaya.shibuya.tokyo.jp
Tue, 20 Feb 2001 19:16:07 +0900


Hi there, this is Brian Hooper in Tokyo.

The proposed character model thread seems to have simmered down so I
don't know how interested people will be in this, but I gathered a few
comments about the Pre-PEP from the Japanese Python mailing list, and
translated the responses - I think there were some very good points
brought up, and I'd like to add the messages I received (with the
permission of their authors) to the discussion.  

I've got four messages to post; I'm not such a fast translator so I'll
post the two I have now, and the other two as I finish them.

Here is Atsuo Ishimoto's post - Ishimoto-san wrote and contributed the
CP 932 codec.

---

On Sun, 11 Feb 2001 20:18:51 +0900
Brian Takashi Hooper <brian@tomigaya.shibuya.tokyo.jp> wrote:

> Hi there,
> 
> What does everyone think of the Proposed Character Model?

I'm opposed to it in its present form.  Putting aside for the moment
any criticisms of Unicode itself, building extension modules for Python
would become more difficult and problematic (as Suzuki also pointed out).

For example, given:

PyObject *simple(PyObject *o, PyObject *args)
{
	char *filename;
	if (!PyArg_ParseTuple(args, "s", filename))
		return NULL;
	File *f = fopen(filename, "w");
	if (!f)
		return NULL;
	fprintf(f, "spam");
	fclose(f);
	Py_INCREF(Py_None);
	return Py_None;
}

(Bfrom Python you can write:

sample.simple("$BF|K\8l%U%!%$%kL>(B")

and it will work as is in almost any platform and language environment.
It works because in the present implementation of CPython, the input data
string is treated as simply data by the extension module, which simply
passes it along to the underlying OS or library without interpreting the
content of the data.
 
However, consider the same extension module in the case where all character
sequences are handled by Python internally as Unicode.  PyArg_ParseTuple()
has no way of automatically knowing how to change Unicode characters with an
ordinal value greater than 0xff into the encoding currently supported on the
platform.  In this case, sample.simple("$BF|K\8l%U%!%$%kL>(B") becomes an error.
At present, most of Python's extension modules can be used without having to
explicitly add CJK support - however if this PEP is implemented then most of these
modules will become unusable in their present form.

So, is there any solution for this?

Well, we could take care when writing our Python scripts only to use strings
in such a way that PyArg_ParseTuple() does not cause an error.  There are two
ways to do this:

a. Use byte strings

Instead of using a character string, we could call our function as

sample.simple(b"$BF|K\8l%U%!%$%kL>(B")

and everything then works fine.  However, if we always have to use byte
strings when interacting with extension libraries, then we haven't really
achieved any real improvement in terms of internationalization, and there's
not much point to implementing the PEP in that case...

b. We could use an 8-bit character encoding such as ISO-8859.

Suppose we use ISO-8859-1 instead of Shift-JIS or EUC-JP when creating the
character string.  Since the value of ord() for each character in the string
is always <= 255, PyArg_ParseTuple() will have no problem with it, but in
having to treat legacy encoded data as a different encoding, we haven't
really made it easier to write programs which handle CJK data, or improved
the situation for i18n either.

It could be argued that Unicode strings could be used everywhere else, and
either a. or b. above only when calling legacy code through extension modules
like with simple() above.  However, in the above case, it becomes necessary
for the programmer to be aware of whether the function they are calling is
implemented in legacy C code or not, which isn't really an improvement on the
current state of things.  Moreover, because in converting to Unicode we lose
information about the original string encoding, automatically converting back
to the original string encoding (for example in order to make the distinction
between Unicode supported and non-supported libraries -B) becomes impossible.
Use of a default encoding is discouraged in the PEP, but this is one example 
of why it may be necessary.

So, returning to the extension module example above, we've seen that managing the
problem on the Python script side is difficult.  Another approach might be
to change our extension module to support Unicode:

PyObject *simple(PyObject *o, PyObject *args)
{
	Py_UNICODE *filename;
	if (!PyArg_ParseTuple(args, "u", filename))
		return NULL;
	File *f = ... :-P

If the platform being used has a version of fopen() which has Unicode support,
then there's no problem, but if not, then it's necessary to first convert the
Unicode string to an encoding which _is_ supported on the platform:

PyObject *simple(PyObject *o, PyObject *args)
{
	Py_UNICODE *filename;
	char native_filename[MAX_FILE];
	
	if (!PyArg_ParseTuple(args, "u", filename))
		return NULL;

#IF SJIS
	/* SJIS$B$KJQ49(B */
#ELSE
	/* EUC$B$KJQ49(B */
#ENDIF
	
	FILE *f = fopen(....)

I don't think anyone really wants to write code like this.

Besides adding complexity, it is also hard to ignore the additional
processing cost added by having to convert incoming Unicode arguments.
Furthermore, adding this kind of support isn't likely to be provided by
European or American programmers, since the coincidence of the ISO-8859-1
with the <= 255 range of Unicode makes such explicit support unnecessary
for applications which only use Latin-1 or ASCII.  (So: Non-American/
European programmers will have to add support for libraries they want to
use)

One of Python's strong points is that it makes it easy to wrap and use
existing C libraries - however, the great majority of these C libraries are
still not Unicode compliant.  In that case, then it becomes necessary to
add Unicode->native encoding support for all such C modules one-by-one, as
described above.  It's difficult to see what would be good about that.

Some might react to the above by insisting, "These are just transitional
problems which will soon be solved.  If we restrict things to just a few 
main platforms, then it won't become a big problem."  This position is
however, flawed.  For example, in Windows 95, to say nothing of UNIX-based
OS's, Unicode support is only partial, and there is no Unicode version of
fopen().  Considering the huge number of non-Unicode supported systems cur-
rently in use around the world, we cannot ignore the importance of continuing
to support them.

In conclusion, supposing that Python strings are made to hold only character
data as proposed in the pre-PEP, use of extension modules from non-European
languages becomes much more difficult, and explicit encoding support has to be
added in many cases.  Python's current string implementation has important
implications for its use as a glue language in non-internationlized environments.

-Atsuo Ishimoto

The Japanese (original) version of this opinion is available at
http://www.gembook.org/moin/moin.cgi/OpinionForPepPythonCharacterModel
Comments / feedback appreciated.

P.S. I wonder what Tcl does with this?