[I18n-sig] Re: gettext in the standard library

François Pinard pinard@iro.umontreal.ca
04 Sep 2000 17:26:42 -0400

[Martin von Loewis]

> > > In Python 2, unicode strings are a separate type from byte strings.
> > > The catalog objects will have two methods, one for retrieving a byte
> > > string, as it appears in the mo file, and one for retrieving a unicode
> > > string.  It is then the application developer's choice whether his
> > > application can deal with Unicode messages on output or not.
> > 
> > You are merely re-stating that there is a special API for Unicode, here.
> > I got this already! :-).  My question is about why it is necessary.

> Which part do you deem unnecessary?  The part returning a byte string,
> or the part returning a Unicode string?

Any part in which one has to make a distinction between both types of
strings.  Let's have the translator function returning a string.  It is
not important to know which kind of string.  Python takes care of what
needs care, anyway.  It should be fairly transparent to the programmer,
and our API should be just as transparent.  Shouldn't it?

> So you are proposing that an application cannot tell in advance what
> the return type of _ will be? In some application, writing

> header = '\x01\x01'
> body   = _('warning')
> message = header + body

Perfect.  No problem.  Python will do something proper, whatever the type
of string which `body' receives...

> I think it was decided not to include the JIS something tables in the
> Python 2 distribution, because they are too large to include.

Then, working with JIS translations would require that Japanese users
fetch the JIS tables from other sources.  A script written for JIS will
need such tables, wherever they come from.

It would be nicer if Python was offering them, but...  Hmph! :-)

> In Python 2.0, developers should be aware at all times whether they
> operate on Unicode strings or on byte strings.  Python will try to do the
> right thing if there is a clear right thing, and try to raise exceptions
> whenever it is not so clear what the right thing would be.

I thought that every effort was made (at least for 1.6a1 and 1.6a2) for
developers should just _not_ be aware of the type of strings.  Is 2.0
different?  Or did I wholly miss the issue?  It would make me sad...

If I missed the issue, you may dismiss many things among what I wrote,
as we are then not reasoning on the same grounds.  If elegance has already
been lost from the start, surely, there is no need for me to in trying to
preserve it, and I'm a mere kibitzer :-(.  Tell me before I make a fool
of myself...  Oh!  It is too late already? :-)

> > If something else is needed on output, I thought the intent was to
> > override UTF-8 as an output encoding, yet still use Unicode internally,
> > instead of any MBCS, taking advantage of all the magic Python 2.0 will
> > have in that respect.

> Maybe it's a terminology issue: I consider UTF-8 as a MBCS (multi-byte
> character set); UTF-8 strings are byte strings, not Unicode strings.

I thought that, by using some 8-bit API instead of some Unicode API for
translation matters, you were intending to handle MBCS directly, all over,
instead of relying on Unicode strings.

> > Otherwise, you have to make your Python script aware of those coding
> > a lot more, internationalisation becomes much more intrusive in your
> > sources, while we wanted it to be as light weight as possible.

> I simply want to give users a choice.  If they chose to "let's try
> Unicode", they have the choice.  If they find it all works, well.
> Otherwise, they can go for byte strings, with a different set of
> limitations.

Shouldn't we just have confidence that Python works?  I would rather
see programmers just using strings and then, playing interactively, or
looking at their output, have a slight and momentary astonishment, saying:
"Hey, things apparently turned Unicode at some point", be satisfied by
the results anyway, and not bother much more about the issue.

If we put unusual exceptions aside (like "English" translation, or
Netherlands), users experience could be that things just happen to work
in ASCII when no translation is requested, and just happen to use Unicode

> > > Also, how would goal language determine whether Unicode is a better
> > > representation for messages than some MBCS?

> I did not really ask for an opinion, I asked for an algorithm:

> def mbcs_p(parameters):
>   your code here

If we get Unicode out of the translating routine, there should not be much
more needed, except maybe a final encoding of the output stream.  This,
I feel we did not discuss enough yet (how to connect the translation
function to the output stream encoding, as transparently as possible).
But once again, maybe I missed so much of the whole point about Unicode
and Python, that none of my remarks hold.

François Pinard   http://www.iro.umontreal.ca/~pinard