[Python-Dev] PEP 393 close to pronouncement

Wed Sep 28 18:44:23 CEST 2011

Guido van Rossum wrote:
> Given the feedback so far, I am happy to pronounce PEP 393 as
> accepted. Martin, congratulations! Go ahead and mark ity as Accepted.
> (But please do fix up the small nits that Victor reported in his
> earlier message.)

I've been working on feedback for the last few days, but I guess it's
too late. Here goes anyway...

I've only read the PEP and not followed the discussion due to lack of
time, so if any of this is no longer valid, that's probably because
the PEP wasn't updated :-)

Resizing
--------

Codecs use resizing a lot. Given that PyCompactUnicodeObject
does not support resizing, most decoders will have to use
PyUnicodeObject and thus not benefit from the memory footprint
advantages of e.g. PyASCIIObject.

Data structure
--------------

The data structure description in the PEP appears to be wrong:

PyASCIIObject has a wchar_t *wstr pointer - I guess this should
be a char *str pointer, otherwise, where's the memory footprint
advantage (esp. on Linux where sizeof(wchar_t) == 4) ?

I also don't see a reason to limit the UCS1 storage version
to ASCII. Accordingly, the object should be called PyLatin1Object
or PyUCS1Object.

Here's the version from the PEP:

"""
typedef struct {
  PyObject_HEAD
  Py_ssize_t length;
  Py_hash_t hash;
  struct {
      unsigned int interned:2;
      unsigned int kind:2;
      unsigned int compact:1;
      unsigned int ascii:1;
      unsigned int ready:1;
  } state;
  wchar_t *wstr;
} PyASCIIObject;

typedef struct {
  PyASCIIObject _base;
  Py_ssize_t utf8_length;
  char *utf8;
  Py_ssize_t wstr_length;
} PyCompactUnicodeObject;
"""

Typedef'ing Py_UNICODE to wchar_t and using wchar_t in existing
code will cause problems on some systems where whcar_t is a
signed type.

Python assumes that Py_UNICODE is unsigned and thus doesn't
check for negative values or takes these into account when
doing range checks or code point arithmetic.

On such platform where wchar_t is signed, it is safer to
typedef Py_UNICODE to unsigned wchar_t.

Accordingly and to prevent further breakage, Py_UNICODE
should not be deprecated and used instead of wchar_t
throughout the code.

Length information
------------------

Py_UNICODE access to the objects assumes that len(obj) ==
length of the Py_UNICODE buffer. The PEP suggests that length
should not take surrogates into account on UCS2 platforms
such as Windows. The causes len(obj) to not match len(wstr).

As a result, Py_UNICODE access to the Unicode objects breaks
when surrogate code points are present in the Unicode object
on UCS2 platforms.

The PEP also does not explain how lone surrogates will be
handled with respect to the length information.

Furthermore, determining len(obj) will require a loop over
the data, checking for surrogate code points. A simple memcpy()
is no longer enough.

I suggest to drop the idea of having len(obj) not count
wstr surrogate code points to maintain backwards compatibility
and allow for working with lone surrogates.

Note that the whole surrogate debate does not have much to
do with this PEP, since it's mainly about memory footprint
savings. I'd also urge to do a reality check with respect
to surrogates and non-BMP code points: in practice you only
very rarely see any non-BMP code points in your data. Making
all Python users pay for the needs of a tiny fraction is
not really fair. Remember: practicality beats purity.

API
---

Victor already described the needed changes.

Performance
-----------

The PEP only lists a few low-level benchmarks as basis for the
performance decrease. I'm missing some more adequate real-life
tests, e.g. using an application framework such as Django
(to the extent this is possible with Python3) or a server
like the Radicale calendar server (which is available for Python3).

I'd also like to see a performance comparison which specifically
uses the existing Unicode APIs to create and work with Unicode
objects. Most extensions will use this way of working with the
Unicode API, either because they want to support Python 2 and 3,
or because the effort it takes to port to the new APIs is
too high. The PEP makes some statements that this is slower,
but doesn't quantify those statements.

Memory savings
--------------

The table only lists string sizes up 8 code points. The memory
savings for these are really only significant for ASCII
strings on 64-bit platforms, if you use the default UCS2
Python build as basis.

For larger strings, I expect the savings to be more significant.
OTOH, a single non-BMP code point in such a string would cause
the savings to drop significantly again.

Complexity
----------

In order to benefit from the new API, any code that has to
deal with low-level Py_UNICODE access to the Unicode objects
will have to be adapted.

For best performance, each algorithm will have to be implemented
for all three storage types.

Not doing so, will result in a slow-down, if I read the PEP
correctly. It's difficult to say, of what scale, since that
information is not given in the PEP, but the added loop over
the complete data array in order to determine the maximum
code point value suggests that it is significant.

Summary
-------

I am not convinced that the memory savings are big enough
to warrant the performance penalty and added complexity
suggested by the PEP.

In times where even smartphones come with multiple GB of RAM,
performance is more important than memory savings.

In practice, using a UCS2 build of Python usually is a good
compromise between memory savings, performance and standards
compatibility. For the few cases where you have to deal with
UCS4 code points, we already have made good progress to
make handling these much easier. IMHO, Python should be
optimized for UCS2 usage, not the rare cases of UCS4 usage
you find in practice.

I do see the advantage for large strings, though.

My personal conclusion
----------------------

Given that I've been working on and maintaining the Python Unicode
implementation actively or by providing assistance for almost
12 years now, I've also thought about whether it's still worth
the effort.

My interests have shifted somewhat into other directions and
I feel that helping Python reach world domination in other ways
makes me happier than fighting over Unicode standards, implementations,
special cases that aren't special enough, and all those other
nitty-gritty details that cause long discussions :-)

So I feel that the PEP 393 change is a good time to draw a line
and leave Unicode maintenance to Ezio, Victor, Martin, and
all the others that have helped over the years. I know it's
in good hands. So here it is:

----------------------------------------------------------------

Hey, that was easy :-)

PS: I'll stick around a bit more for the platform module, pybench
and whatever else comes along where you might be interested in my
input.

Thanks and cheers,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Sep 28 2011)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________
2011-10-04: PyCon DE 2011, Leipzig, Germany                 6 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/