[Python-Dev] PEP 393 close to pronouncement

Wed Sep 28 19:47:22 CEST 2011

> Codecs use resizing a lot. Given that PyCompactUnicodeObject
> does not support resizing, most decoders will have to use
> PyUnicodeObject and thus not benefit from the memory footprint
> advantages of e.g. PyASCIIObject.

No, codecs have been rewritten to not use resizing.

> PyASCIIObject has a wchar_t *wstr pointer - I guess this should
> be a char *str pointer, otherwise, where's the memory footprint
> advantage (esp. on Linux where sizeof(wchar_t) == 4) ?

That's the Py_UNICODE representation for backwards compatibility.
It's normally NULL.

> I also don't see a reason to limit the UCS1 storage version
> to ASCII. Accordingly, the object should be called PyLatin1Object
> or PyUCS1Object.

No, in the ASCII case, the UTF-8 length can be shared with the regular
string length - not so for Latin-1 character above 127.

> Typedef'ing Py_UNICODE to wchar_t and using wchar_t in existing
> code will cause problems on some systems where whcar_t is a
> signed type.
> 
> Python assumes that Py_UNICODE is unsigned and thus doesn't
> check for negative values or takes these into account when
> doing range checks or code point arithmetic.
> 
> On such platform where wchar_t is signed, it is safer to
> typedef Py_UNICODE to unsigned wchar_t.

No. Py_UNICODE values *must* be in the range 0..17*2**16.
Values larger than 17*2**16 are just as bad as negative
values, so having Py_UNICODE unsigned doesn't improve
anything.

> Py_UNICODE access to the objects assumes that len(obj) ==
> length of the Py_UNICODE buffer. The PEP suggests that length
> should not take surrogates into account on UCS2 platforms
> such as Windows. The causes len(obj) to not match len(wstr).

Correct.

> As a result, Py_UNICODE access to the Unicode objects breaks
> when surrogate code points are present in the Unicode object
> on UCS2 platforms.

Incorrect. What specifically do you think would break?

> The PEP also does not explain how lone surrogates will be
> handled with respect to the length information.

Just as any other code point. Python does not special-case
surrogate code points anymore.

> Furthermore, determining len(obj) will require a loop over
> the data, checking for surrogate code points. A simple memcpy()
> is no longer enough.

No, it won't. The length of the Unicode object is stored in
the length field.

> I suggest to drop the idea of having len(obj) not count
> wstr surrogate code points to maintain backwards compatibility
> and allow for working with lone surrogates.

Backwards-compatibility is fully preserved by PyUnicode_GET_SIZE
returning the size of the Py_UNICODE buffer. PyUnicode_GET_LENGTH
returns the true length of the Unicode object.

> Note that the whole surrogate debate does not have much to
> do with this PEP, since it's mainly about memory footprint
> savings. I'd also urge to do a reality check with respect
> to surrogates and non-BMP code points: in practice you only
> very rarely see any non-BMP code points in your data. Making
> all Python users pay for the needs of a tiny fraction is
> not really fair. Remember: practicality beats purity.

That's the whole point of the PEP. You only pay for what
you actually need, and in most cases, it's ASCII.

> For best performance, each algorithm will have to be implemented
> for all three storage types.

This will be a trade-off. I think most developers will be happy
with a single version covering all three cases, especially as it's
much more maintainable.

Kind regards,
Martin