[Python-Dev] just say no...

M.-A. Lemburg mal@lemburg.com
Sat, 13 Nov 1999 10:37:35 +0100


Greg Stein wrote:
> 
> On Fri, 12 Nov 1999, M.-A. Lemburg wrote:
> > Fredrik Lundh wrote:
> >...
> > > why?  I don't understand why "s" and "s#" has
> > > to deal with encoding issues at all...
> > >
> > > > unless, of course, you want to give up Unicode object support
> > > > for all APIs using these parsers.
> > >
> > > hmm.  maybe that's exactly what I want...
> >
> > If we don't add that support, lot's of existing APIs won't
> > accept Unicode object instead of strings. While it could be
> > argued that automatic conversion to UTF-8 is not transparent
> > enough for the user, the other solution of using str(u)
> > everywhere would probably make writing Unicode-aware code a
> > rather clumsy task and introduce other pitfalls, since str(obj)
> > calls PyObject_Str() which also works on integers, floats,
> > etc.
> 
> No no no...
> 
> "s" and "s#" are NOT SUPPOSED TO return a UTF-8 encoding. They are
> supposed to return the raw bytes.

[I've waited quite some time for you to chime in on this one ;-)]

Let me summarize a bit on the general ideas behind "s", "s#"
and the extra buffer:

First, we have a general design question here: should old code
become Unicode compatible or not. As I recall the original idea
about Unicode integration was to follow Perl's idea to have
scripts become Unicode aware by simply adding a 'use utf8;'.

If this is still the case, then we'll have to come with a
resonable approach for integrating classical string based
APIs with the new type.

Since UTF-8 is a standard (some would probably prefer UTF-7,5 e.g.
the Latin-1 folks) which has some very nice features (see
http://czyborra.com/utf/ ) and which is a true extension of ASCII,
this encoding seems best fit for the purpose.

However, one should not forget that UTF-8 is in fact a
variable length encoding of Unicode characters, that is up to
3 bytes form a *single* character. This is obviously not compatible
with definitions that explicitly state data to be using a
8-bit single character encoding, e.g. indexing in UTF-8 doesn't
work like it does in Latin-1 text.

So if we are to do the integration, we'll have to choose
argument parser markers that allow for multi byte characters.
"t#" does not fall into this category, "s#" certainly does,
"s" is argueable.

Also note that we have to watch out for embedded NULL bytes.
UTF-16 has NULL bytes for every character from the Latin-1
domain. If "s" were to give back a pointer to the internal
buffer which is encoded in UTF-16, you would loose data.
UTF-8 doesn't have this problem, since only NULL bytes
map to (single) NULL bytes.

Now Greg would chime in with the buffer interface and
argue that it should make the underlying internal
format accessible. This is a bad idea, IMHO, since you
shouldn't really have to know what the internal data format
is.

Defining "s#" to return UTF-8 data does not only
make "s" and "s#" return the same data format (which should
always be the case, IMO), but also hides the internal
format from the user and gives him a reliable cross-platform
data representation of Unicode data (note that UTF-8 doesn't
have the byte order problems of UTF-16).

If you are still with, let's look at what "s" and "s#"
do: they return pointers into data areas which have to
be kept alive until the corresponding object dies.

The only way to support this feature is by allocating
a buffer for just this purpose (on the fly and only if
needed to prevent excessive memory load). The other
options of adding new magic parser markers or switching
to more generic one all have one downside: you need to
change existing code which is in conflict with the idea
we started out with.

So, again, the question is: do we want this magical
intergration or not ? Note that this is a design question,
not one of memory consumption...

--

Ok, the above covered Unicode -> String conversion. Mark
mentioned that he wanted the other way around to also
work in the same fashion, ie. automatic String -> Unicode
conversion. 

This could also be done in the same way by
interpreting the string as UTF-8 encoded Unicode... but we
have the same problem: where to put the data without
generating new intermediate objects. Since only newly
written code will use this feature there is a way to do
this though:

PyArg_ParseTuple(args,"s#",&utf8,&len);

If your C API understands UTF-8 there's nothing more to do,
if not, take Greg's option 3 approach:

PyArg_ParseTuple(args,"O",&obj);
unicode = PyUnicode_FromObject(obj);
...
Py_DECREF(unicode);

Here PyUnicode_FromObject() will return a new
reference if obj is an Unicode object or create a new
Unicode object by interpreting str(obj) as UTF-8 encoded string.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    48 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/