PEP 393 close to pronouncement

Martin has asked me to pronounce on PEP 393, after he's updated it in response to various feedback (including mine :-). I'm currently looking very favorable on it, but I thought I'd give folks here one more chance to bring up showstoppers. So, if you have the time, please review PEP 393 and/or play with the code (the repo is linked from the PEP's References section now). Please limit your feedback to show-stopping issues; we're past the stage of bikeshedding here. It's Good Enough (TM) and we'll have to rest of the 3.3 release cycle to improve incrementally. But we need to get to the point where the code can be committed to the 3.3 branch. In a few days I'll pronounce. -- --Guido van Rossum (python.org/~guido)

Hi, Le lundi 26 septembre 2011 23:00:06, Guido van Rossum a écrit :
So, if you have the time, please review PEP 393 and/or play with the code (the repo is linked from the PEP's References section now).
I played with the code. The full test suite pass on Linux, FreeBSD and Windows. On Windows, there is just one failure in test_configparser, I didn't investigate it yet. I like the new API: a classic loop on the string length, and a macro to read the nth character. The backward compatibility is fully transparent and is already well tested because some modules still use the legacy API. It's quite easy to move from the legacy API to the new API. It's just boring, but it's almost done in the core (unicodeobject.c, but also some modules like _io). Since the introduction of PyASCIIObject, the PEP 393 is really good in memory footprint, especially for ASCII-only strings. In Python, you manipulate a lot of ASCII strings. PEP === It's not clear what is deprecated. It would help to have a full list of the deprecated functions/macros. Sometimes Martin wrote PyUnicode_Ready, sometimes PyUnicode_READY. It's confusing. Typo: PyUnicode_FAST_READY => PyUnicode_READY. "PyUnicode_WRITE_CHAR" is not listed in the New API section. Typo in "PyUnicode_CONVERT_BYTES(from_type, tp_type, begin, end, to)": tp_type => to_type. "PyUnicode_Chr(ch)": Why introducing a new function? PyUnicode_FromOrdinal was not enough? "GDB Debugging Hooks" It's not done yet. "None of the functions in this PEP become part of the stable ABI (PEP 384)." Why? Some functions don't depend on the internal representation, like PyUnicode_Substring or PyUnicode_FindChar. Typo: "In order to port modules to the new API, try to eliminate the use of these API elements: ... PyUnicode_GET_LENGTH ..." PyUnicode_GET_LENGTH is part of the new API. I suppose that you mean PyUnicode_GET_SIZE. Victor

Given the feedback so far, I am happy to pronounce PEP 393 as accepted. Martin, congratulations! Go ahead and mark ity as Accepted. (But please do fix up the small nits that Victor reported in his earlier message.) -- --Guido van Rossum (python.org/~guido)

Guido van Rossum wrote:
I've been working on feedback for the last few days, but I guess it's too late. Here goes anyway... I've only read the PEP and not followed the discussion due to lack of time, so if any of this is no longer valid, that's probably because the PEP wasn't updated :-) Resizing -------- Codecs use resizing a lot. Given that PyCompactUnicodeObject does not support resizing, most decoders will have to use PyUnicodeObject and thus not benefit from the memory footprint advantages of e.g. PyASCIIObject. Data structure -------------- The data structure description in the PEP appears to be wrong: PyASCIIObject has a wchar_t *wstr pointer - I guess this should be a char *str pointer, otherwise, where's the memory footprint advantage (esp. on Linux where sizeof(wchar_t) == 4) ? I also don't see a reason to limit the UCS1 storage version to ASCII. Accordingly, the object should be called PyLatin1Object or PyUCS1Object. Here's the version from the PEP: """ typedef struct { PyObject_HEAD Py_ssize_t length; Py_hash_t hash; struct { unsigned int interned:2; unsigned int kind:2; unsigned int compact:1; unsigned int ascii:1; unsigned int ready:1; } state; wchar_t *wstr; } PyASCIIObject; typedef struct { PyASCIIObject _base; Py_ssize_t utf8_length; char *utf8; Py_ssize_t wstr_length; } PyCompactUnicodeObject; """ Typedef'ing Py_UNICODE to wchar_t and using wchar_t in existing code will cause problems on some systems where whcar_t is a signed type. Python assumes that Py_UNICODE is unsigned and thus doesn't check for negative values or takes these into account when doing range checks or code point arithmetic. On such platform where wchar_t is signed, it is safer to typedef Py_UNICODE to unsigned wchar_t. Accordingly and to prevent further breakage, Py_UNICODE should not be deprecated and used instead of wchar_t throughout the code. Length information ------------------ Py_UNICODE access to the objects assumes that len(obj) == length of the Py_UNICODE buffer. The PEP suggests that length should not take surrogates into account on UCS2 platforms such as Windows. The causes len(obj) to not match len(wstr). As a result, Py_UNICODE access to the Unicode objects breaks when surrogate code points are present in the Unicode object on UCS2 platforms. The PEP also does not explain how lone surrogates will be handled with respect to the length information. Furthermore, determining len(obj) will require a loop over the data, checking for surrogate code points. A simple memcpy() is no longer enough. I suggest to drop the idea of having len(obj) not count wstr surrogate code points to maintain backwards compatibility and allow for working with lone surrogates. Note that the whole surrogate debate does not have much to do with this PEP, since it's mainly about memory footprint savings. I'd also urge to do a reality check with respect to surrogates and non-BMP code points: in practice you only very rarely see any non-BMP code points in your data. Making all Python users pay for the needs of a tiny fraction is not really fair. Remember: practicality beats purity. API --- Victor already described the needed changes. Performance ----------- The PEP only lists a few low-level benchmarks as basis for the performance decrease. I'm missing some more adequate real-life tests, e.g. using an application framework such as Django (to the extent this is possible with Python3) or a server like the Radicale calendar server (which is available for Python3). I'd also like to see a performance comparison which specifically uses the existing Unicode APIs to create and work with Unicode objects. Most extensions will use this way of working with the Unicode API, either because they want to support Python 2 and 3, or because the effort it takes to port to the new APIs is too high. The PEP makes some statements that this is slower, but doesn't quantify those statements. Memory savings -------------- The table only lists string sizes up 8 code points. The memory savings for these are really only significant for ASCII strings on 64-bit platforms, if you use the default UCS2 Python build as basis. For larger strings, I expect the savings to be more significant. OTOH, a single non-BMP code point in such a string would cause the savings to drop significantly again. Complexity ---------- In order to benefit from the new API, any code that has to deal with low-level Py_UNICODE access to the Unicode objects will have to be adapted. For best performance, each algorithm will have to be implemented for all three storage types. Not doing so, will result in a slow-down, if I read the PEP correctly. It's difficult to say, of what scale, since that information is not given in the PEP, but the added loop over the complete data array in order to determine the maximum code point value suggests that it is significant. Summary ------- I am not convinced that the memory savings are big enough to warrant the performance penalty and added complexity suggested by the PEP. In times where even smartphones come with multiple GB of RAM, performance is more important than memory savings. In practice, using a UCS2 build of Python usually is a good compromise between memory savings, performance and standards compatibility. For the few cases where you have to deal with UCS4 code points, we already have made good progress to make handling these much easier. IMHO, Python should be optimized for UCS2 usage, not the rare cases of UCS4 usage you find in practice. I do see the advantage for large strings, though. My personal conclusion ---------------------- Given that I've been working on and maintaining the Python Unicode implementation actively or by providing assistance for almost 12 years now, I've also thought about whether it's still worth the effort. My interests have shifted somewhat into other directions and I feel that helping Python reach world domination in other ways makes me happier than fighting over Unicode standards, implementations, special cases that aren't special enough, and all those other nitty-gritty details that cause long discussions :-) So I feel that the PEP 393 change is a good time to draw a line and leave Unicode maintenance to Ezio, Victor, Martin, and all the others that have helped over the years. I know it's in good hands. So here it is: ---------------------------------------------------------------- Hey, that was easy :-) PS: I'll stick around a bit more for the platform module, pybench and whatever else comes along where you might be interested in my input. Thanks and cheers, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 28 2011)
2011-10-04: PyCon DE 2011, Leipzig, Germany 6 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

No, codecs have been rewritten to not use resizing.
That's the Py_UNICODE representation for backwards compatibility. It's normally NULL.
No, in the ASCII case, the UTF-8 length can be shared with the regular string length - not so for Latin-1 character above 127.
No. Py_UNICODE values *must* be in the range 0..17*2**16. Values larger than 17*2**16 are just as bad as negative values, so having Py_UNICODE unsigned doesn't improve anything.
Correct.
Incorrect. What specifically do you think would break?
The PEP also does not explain how lone surrogates will be handled with respect to the length information.
Just as any other code point. Python does not special-case surrogate code points anymore.
No, it won't. The length of the Unicode object is stored in the length field.
Backwards-compatibility is fully preserved by PyUnicode_GET_SIZE returning the size of the Py_UNICODE buffer. PyUnicode_GET_LENGTH returns the true length of the Unicode object.
That's the whole point of the PEP. You only pay for what you actually need, and in most cases, it's ASCII.
For best performance, each algorithm will have to be implemented for all three storage types.
This will be a trade-off. I think most developers will be happy with a single version covering all three cases, especially as it's much more maintainable. Kind regards, Martin

Wrong. Even if you create a string using the legacy API (e.g. PyUnicode_FromUnicode), the string will be quickly compacted to use the most efficient memory storage (depending on the maximum character). "quickly": at the first call to PyUnicode_READY. Python tries to make all strings ready as early as possible.
For pure ASCII strings, you don't have to store a pointer to the UTF-8 string, nor the length of the UTF-8 string (in bytes), nor the length of the wchar_t string (in wide characters): the length is always the length of the "ASCII" string, and the UTF-8 string is shared with the ASCII string. The structure is much smaller thanks to these optimizations, and so Python 3.3 uses less memory than 2.7 for ASCII strings, even for short strings.
Latin1 is less interesting, you cannot share length/data fields with utf8 or wstr. We didn't add a special case for Latin1 strings (except using Py_UCS1* strings to store their characters).
Wrong. len(obj) gives the "right" result (see the long discussion about what is the length of a string in a previous thread...) in O(1) since it's computed when the string is created.
The creation of the string is maybe is little bit slower (especially when you have to scan the string twice to first get the maximum character), but I think that this slow down is smaller than the speedup allowed by the PEP. Because ASCII strings are now char*, I think that processing ASCII strings is faster because the CPU can cache more data (close to the CPU). We can do better optimization on ASCII and Latin1 strings (it's faster to manipulate char* than uint16_t* or uint32_t*). For example, str.center(), str.ljust, str.rjust and str.zfill do now use the very fast memset() function for latin1 strings to pad the string. Another example, duplicating a string (or create a substring) should be faster just because you have less data to copy (e.g. 10 bytes for a string of 10 Latin1 characters vs 20 or 40 bytes with Python 3.2). The two most common encodings in the world are ASCII and UTF-8. With the PEP 393, encoding to ASCII or UTF-8 is free, you don't have to encode anything, you have directly the encoded char* buffer (whereas you have to convert 16/32 bit wchar_t to char* in Python 3.2, even for pure ASCII). (It's also free to encode "Latin1" Unicode string to Latin1.) With the PEP 393, we never have to decode UTF-16 anymore when iterating on code pointer to support correctly non-BMP characters (which was required before in narrow build, e.g. on Windows). Iterate on code point is just a dummy loop, no need to check if each character is in range U+D800-U+DFFF. There are other funny tricks (optimizations). For example, text.replace(a, b) knows that there is nothing to do if maxchar(a) > maxchar(text), where maxchar(obj) just requires to read an attribute of the string. Think about ASCII and non-ASCII strings: pure_ascii.replace('\xe9', '') now just creates a new reference... I don't think that Martin wrote his PEP to be able to implement all these optimisations, but there are an interesting side effect of his PEP :-)
In the 32 different cases, the PEP 393 is better in 29 cases and "just" as good as Python 3.2 in 3 corner cases: - 1 ASCII, 16-bit wchar, 32-bit - 1 Latin1, 32-bit wchar, 32-bit - 2 Latin1, 32-bit wchar, 32-bit Do you really care of these corner cases? See the more the realistic benchmark in previous Martin's email ("PEP 393 memory savings update"): the PEP 393 not only uses 3x less memory than 3.2, but it uses also *less* memory than Python 2.7, whereas Python 3 uses Unicode for everything!
For larger strings, I expect the savings to be more significant.
Sure.
OTOH, a single non-BMP code point in such a string would cause the savings to drop significantly again.
In this case, it's just as good as Python 3.2 in wide mode, but worse than 3.2 in narrow mode. But is it a real use case? If you want a really efficient storage for heterogeneous strings (mixing ASCII, Latin1, BMP and non-BMP), you can split the text into chunks. For example, I hope that a text processor like LibreOffice doesn't store all paragraphs in the same string, but create at least a string per paragraph. If you use short chunks, you will not notice the difference in memory footprint when you insert a non-BMP character. The trick doesn't work on Python < 3.3.
For best performance, each algorithm will have to be implemented for all three storage types. ...
Good performances can be archived using PyUnicode macros like PyUnicode_READ and PyUnicode_WRITE. But yes, if you want a super-fast Unicode processor, you can special case some kinds (UCS1, UCS2, UCS4), like the examples I described before (use memset for latin1).
... Not doing so, will result in a slow-down, if I read the PEP correctly.
I don't think so. Browse the new unicodeobject.c, there are few switch/case on the kind (if you ignore the low-level functions like _PyUnicode_Ready). For example, unicode_isalpha() has only one implementation, using PyUnicode_READ. PyUnicode_READ doesn't use a switch but classic (fast) arithmetic on pointers.
Feel free to run yourself Antoine's benchmarks like stringbench and iobench, they do micro-benchmarks. But you have to know that very few codecs use the new Unicode API (I think that only UTF-8 encoder and decoder use the new API, maybe also the ASCII codec).
I didn't run any benchmark, but I don't think that the PEP 393 makes Python slower. I expect a minor speedup in some corner cases :-) I prefer to wait until all modules are converted to the new API to run benchmarks. TODO: unicodedata, _csv, all codecs (especially error handlers), ...
About "standards compatibility", the work to support non-BMP characters everywhere was not finished in Python 3.2, 11 years after the introduction of Unicode in Python (2.0). Using the new API, non-BMP characters will be supported for free, everywhere (especially in *Python*, "\U0010FFFF"[0] and len("\U0010FFFF") doesn't give surprising results anymore). With the addition of emoticon in a non-BMP range in Unicode 6, non-BMP characters will become more and more common. Who doesn't like emoticon? :-) o;-) >< (no, I will no add non-BMP characters in this email, I don't want to crash your SMTP server and mail client)
IMHO, Python should be optimized for UCS2 usage
With the PEP 393, it's better: Python is optimize for any usage! (but I expect it to be faster in the Latin1 range, U+0000-U+00FF)
I do see the advantage for large strings, though.
A friend reads last Martin's benchmark differently: Python 3.2 uses 3x more memory than Python 2! Can I say that the PEP 393 fixed an huge regression of Python 3?
Thanks for your huge work on Unicode, Marc-Andre!
Someone said that we still need to define what a character is! By the way, what is a code point?
I don't understand why you would like to stop contribution to Unicode, but well, as you want. We will try to continue your work. Victor

Victor Stinner wrote:
Thanks. I enjoyed working it on it, but priorities are different now, and new projects are waiting :-)
I'll leave that as exercise for the interested reader to find out :-) (Hint: Google should find enough hits where I've explained those things on various mailing lists and in talks I gave.)
I only have limited time available for these things and am nowadays more interested in getting others to recognize just how great Python is, than actually sitting down and writing patches for it. Unicode was my baby for quite a few years, but I now have two kids which need more love and attention :-)
well, as you want. We will try to continue your work.
Thanks. Cheers, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 11 2011)
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

Hi, Le lundi 26 septembre 2011 23:00:06, Guido van Rossum a écrit :
So, if you have the time, please review PEP 393 and/or play with the code (the repo is linked from the PEP's References section now).
I played with the code. The full test suite pass on Linux, FreeBSD and Windows. On Windows, there is just one failure in test_configparser, I didn't investigate it yet. I like the new API: a classic loop on the string length, and a macro to read the nth character. The backward compatibility is fully transparent and is already well tested because some modules still use the legacy API. It's quite easy to move from the legacy API to the new API. It's just boring, but it's almost done in the core (unicodeobject.c, but also some modules like _io). Since the introduction of PyASCIIObject, the PEP 393 is really good in memory footprint, especially for ASCII-only strings. In Python, you manipulate a lot of ASCII strings. PEP === It's not clear what is deprecated. It would help to have a full list of the deprecated functions/macros. Sometimes Martin wrote PyUnicode_Ready, sometimes PyUnicode_READY. It's confusing. Typo: PyUnicode_FAST_READY => PyUnicode_READY. "PyUnicode_WRITE_CHAR" is not listed in the New API section. Typo in "PyUnicode_CONVERT_BYTES(from_type, tp_type, begin, end, to)": tp_type => to_type. "PyUnicode_Chr(ch)": Why introducing a new function? PyUnicode_FromOrdinal was not enough? "GDB Debugging Hooks" It's not done yet. "None of the functions in this PEP become part of the stable ABI (PEP 384)." Why? Some functions don't depend on the internal representation, like PyUnicode_Substring or PyUnicode_FindChar. Typo: "In order to port modules to the new API, try to eliminate the use of these API elements: ... PyUnicode_GET_LENGTH ..." PyUnicode_GET_LENGTH is part of the new API. I suppose that you mean PyUnicode_GET_SIZE. Victor

Given the feedback so far, I am happy to pronounce PEP 393 as accepted. Martin, congratulations! Go ahead and mark ity as Accepted. (But please do fix up the small nits that Victor reported in his earlier message.) -- --Guido van Rossum (python.org/~guido)

Guido van Rossum wrote:
I've been working on feedback for the last few days, but I guess it's too late. Here goes anyway... I've only read the PEP and not followed the discussion due to lack of time, so if any of this is no longer valid, that's probably because the PEP wasn't updated :-) Resizing -------- Codecs use resizing a lot. Given that PyCompactUnicodeObject does not support resizing, most decoders will have to use PyUnicodeObject and thus not benefit from the memory footprint advantages of e.g. PyASCIIObject. Data structure -------------- The data structure description in the PEP appears to be wrong: PyASCIIObject has a wchar_t *wstr pointer - I guess this should be a char *str pointer, otherwise, where's the memory footprint advantage (esp. on Linux where sizeof(wchar_t) == 4) ? I also don't see a reason to limit the UCS1 storage version to ASCII. Accordingly, the object should be called PyLatin1Object or PyUCS1Object. Here's the version from the PEP: """ typedef struct { PyObject_HEAD Py_ssize_t length; Py_hash_t hash; struct { unsigned int interned:2; unsigned int kind:2; unsigned int compact:1; unsigned int ascii:1; unsigned int ready:1; } state; wchar_t *wstr; } PyASCIIObject; typedef struct { PyASCIIObject _base; Py_ssize_t utf8_length; char *utf8; Py_ssize_t wstr_length; } PyCompactUnicodeObject; """ Typedef'ing Py_UNICODE to wchar_t and using wchar_t in existing code will cause problems on some systems where whcar_t is a signed type. Python assumes that Py_UNICODE is unsigned and thus doesn't check for negative values or takes these into account when doing range checks or code point arithmetic. On such platform where wchar_t is signed, it is safer to typedef Py_UNICODE to unsigned wchar_t. Accordingly and to prevent further breakage, Py_UNICODE should not be deprecated and used instead of wchar_t throughout the code. Length information ------------------ Py_UNICODE access to the objects assumes that len(obj) == length of the Py_UNICODE buffer. The PEP suggests that length should not take surrogates into account on UCS2 platforms such as Windows. The causes len(obj) to not match len(wstr). As a result, Py_UNICODE access to the Unicode objects breaks when surrogate code points are present in the Unicode object on UCS2 platforms. The PEP also does not explain how lone surrogates will be handled with respect to the length information. Furthermore, determining len(obj) will require a loop over the data, checking for surrogate code points. A simple memcpy() is no longer enough. I suggest to drop the idea of having len(obj) not count wstr surrogate code points to maintain backwards compatibility and allow for working with lone surrogates. Note that the whole surrogate debate does not have much to do with this PEP, since it's mainly about memory footprint savings. I'd also urge to do a reality check with respect to surrogates and non-BMP code points: in practice you only very rarely see any non-BMP code points in your data. Making all Python users pay for the needs of a tiny fraction is not really fair. Remember: practicality beats purity. API --- Victor already described the needed changes. Performance ----------- The PEP only lists a few low-level benchmarks as basis for the performance decrease. I'm missing some more adequate real-life tests, e.g. using an application framework such as Django (to the extent this is possible with Python3) or a server like the Radicale calendar server (which is available for Python3). I'd also like to see a performance comparison which specifically uses the existing Unicode APIs to create and work with Unicode objects. Most extensions will use this way of working with the Unicode API, either because they want to support Python 2 and 3, or because the effort it takes to port to the new APIs is too high. The PEP makes some statements that this is slower, but doesn't quantify those statements. Memory savings -------------- The table only lists string sizes up 8 code points. The memory savings for these are really only significant for ASCII strings on 64-bit platforms, if you use the default UCS2 Python build as basis. For larger strings, I expect the savings to be more significant. OTOH, a single non-BMP code point in such a string would cause the savings to drop significantly again. Complexity ---------- In order to benefit from the new API, any code that has to deal with low-level Py_UNICODE access to the Unicode objects will have to be adapted. For best performance, each algorithm will have to be implemented for all three storage types. Not doing so, will result in a slow-down, if I read the PEP correctly. It's difficult to say, of what scale, since that information is not given in the PEP, but the added loop over the complete data array in order to determine the maximum code point value suggests that it is significant. Summary ------- I am not convinced that the memory savings are big enough to warrant the performance penalty and added complexity suggested by the PEP. In times where even smartphones come with multiple GB of RAM, performance is more important than memory savings. In practice, using a UCS2 build of Python usually is a good compromise between memory savings, performance and standards compatibility. For the few cases where you have to deal with UCS4 code points, we already have made good progress to make handling these much easier. IMHO, Python should be optimized for UCS2 usage, not the rare cases of UCS4 usage you find in practice. I do see the advantage for large strings, though. My personal conclusion ---------------------- Given that I've been working on and maintaining the Python Unicode implementation actively or by providing assistance for almost 12 years now, I've also thought about whether it's still worth the effort. My interests have shifted somewhat into other directions and I feel that helping Python reach world domination in other ways makes me happier than fighting over Unicode standards, implementations, special cases that aren't special enough, and all those other nitty-gritty details that cause long discussions :-) So I feel that the PEP 393 change is a good time to draw a line and leave Unicode maintenance to Ezio, Victor, Martin, and all the others that have helped over the years. I know it's in good hands. So here it is: ---------------------------------------------------------------- Hey, that was easy :-) PS: I'll stick around a bit more for the platform module, pybench and whatever else comes along where you might be interested in my input. Thanks and cheers, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 28 2011)
2011-10-04: PyCon DE 2011, Leipzig, Germany 6 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

No, codecs have been rewritten to not use resizing.
That's the Py_UNICODE representation for backwards compatibility. It's normally NULL.
No, in the ASCII case, the UTF-8 length can be shared with the regular string length - not so for Latin-1 character above 127.
No. Py_UNICODE values *must* be in the range 0..17*2**16. Values larger than 17*2**16 are just as bad as negative values, so having Py_UNICODE unsigned doesn't improve anything.
Correct.
Incorrect. What specifically do you think would break?
The PEP also does not explain how lone surrogates will be handled with respect to the length information.
Just as any other code point. Python does not special-case surrogate code points anymore.
No, it won't. The length of the Unicode object is stored in the length field.
Backwards-compatibility is fully preserved by PyUnicode_GET_SIZE returning the size of the Py_UNICODE buffer. PyUnicode_GET_LENGTH returns the true length of the Unicode object.
That's the whole point of the PEP. You only pay for what you actually need, and in most cases, it's ASCII.
For best performance, each algorithm will have to be implemented for all three storage types.
This will be a trade-off. I think most developers will be happy with a single version covering all three cases, especially as it's much more maintainable. Kind regards, Martin

Wrong. Even if you create a string using the legacy API (e.g. PyUnicode_FromUnicode), the string will be quickly compacted to use the most efficient memory storage (depending on the maximum character). "quickly": at the first call to PyUnicode_READY. Python tries to make all strings ready as early as possible.
For pure ASCII strings, you don't have to store a pointer to the UTF-8 string, nor the length of the UTF-8 string (in bytes), nor the length of the wchar_t string (in wide characters): the length is always the length of the "ASCII" string, and the UTF-8 string is shared with the ASCII string. The structure is much smaller thanks to these optimizations, and so Python 3.3 uses less memory than 2.7 for ASCII strings, even for short strings.
Latin1 is less interesting, you cannot share length/data fields with utf8 or wstr. We didn't add a special case for Latin1 strings (except using Py_UCS1* strings to store their characters).
Wrong. len(obj) gives the "right" result (see the long discussion about what is the length of a string in a previous thread...) in O(1) since it's computed when the string is created.
The creation of the string is maybe is little bit slower (especially when you have to scan the string twice to first get the maximum character), but I think that this slow down is smaller than the speedup allowed by the PEP. Because ASCII strings are now char*, I think that processing ASCII strings is faster because the CPU can cache more data (close to the CPU). We can do better optimization on ASCII and Latin1 strings (it's faster to manipulate char* than uint16_t* or uint32_t*). For example, str.center(), str.ljust, str.rjust and str.zfill do now use the very fast memset() function for latin1 strings to pad the string. Another example, duplicating a string (or create a substring) should be faster just because you have less data to copy (e.g. 10 bytes for a string of 10 Latin1 characters vs 20 or 40 bytes with Python 3.2). The two most common encodings in the world are ASCII and UTF-8. With the PEP 393, encoding to ASCII or UTF-8 is free, you don't have to encode anything, you have directly the encoded char* buffer (whereas you have to convert 16/32 bit wchar_t to char* in Python 3.2, even for pure ASCII). (It's also free to encode "Latin1" Unicode string to Latin1.) With the PEP 393, we never have to decode UTF-16 anymore when iterating on code pointer to support correctly non-BMP characters (which was required before in narrow build, e.g. on Windows). Iterate on code point is just a dummy loop, no need to check if each character is in range U+D800-U+DFFF. There are other funny tricks (optimizations). For example, text.replace(a, b) knows that there is nothing to do if maxchar(a) > maxchar(text), where maxchar(obj) just requires to read an attribute of the string. Think about ASCII and non-ASCII strings: pure_ascii.replace('\xe9', '') now just creates a new reference... I don't think that Martin wrote his PEP to be able to implement all these optimisations, but there are an interesting side effect of his PEP :-)
In the 32 different cases, the PEP 393 is better in 29 cases and "just" as good as Python 3.2 in 3 corner cases: - 1 ASCII, 16-bit wchar, 32-bit - 1 Latin1, 32-bit wchar, 32-bit - 2 Latin1, 32-bit wchar, 32-bit Do you really care of these corner cases? See the more the realistic benchmark in previous Martin's email ("PEP 393 memory savings update"): the PEP 393 not only uses 3x less memory than 3.2, but it uses also *less* memory than Python 2.7, whereas Python 3 uses Unicode for everything!
For larger strings, I expect the savings to be more significant.
Sure.
OTOH, a single non-BMP code point in such a string would cause the savings to drop significantly again.
In this case, it's just as good as Python 3.2 in wide mode, but worse than 3.2 in narrow mode. But is it a real use case? If you want a really efficient storage for heterogeneous strings (mixing ASCII, Latin1, BMP and non-BMP), you can split the text into chunks. For example, I hope that a text processor like LibreOffice doesn't store all paragraphs in the same string, but create at least a string per paragraph. If you use short chunks, you will not notice the difference in memory footprint when you insert a non-BMP character. The trick doesn't work on Python < 3.3.
For best performance, each algorithm will have to be implemented for all three storage types. ...
Good performances can be archived using PyUnicode macros like PyUnicode_READ and PyUnicode_WRITE. But yes, if you want a super-fast Unicode processor, you can special case some kinds (UCS1, UCS2, UCS4), like the examples I described before (use memset for latin1).
... Not doing so, will result in a slow-down, if I read the PEP correctly.
I don't think so. Browse the new unicodeobject.c, there are few switch/case on the kind (if you ignore the low-level functions like _PyUnicode_Ready). For example, unicode_isalpha() has only one implementation, using PyUnicode_READ. PyUnicode_READ doesn't use a switch but classic (fast) arithmetic on pointers.
Feel free to run yourself Antoine's benchmarks like stringbench and iobench, they do micro-benchmarks. But you have to know that very few codecs use the new Unicode API (I think that only UTF-8 encoder and decoder use the new API, maybe also the ASCII codec).
I didn't run any benchmark, but I don't think that the PEP 393 makes Python slower. I expect a minor speedup in some corner cases :-) I prefer to wait until all modules are converted to the new API to run benchmarks. TODO: unicodedata, _csv, all codecs (especially error handlers), ...
About "standards compatibility", the work to support non-BMP characters everywhere was not finished in Python 3.2, 11 years after the introduction of Unicode in Python (2.0). Using the new API, non-BMP characters will be supported for free, everywhere (especially in *Python*, "\U0010FFFF"[0] and len("\U0010FFFF") doesn't give surprising results anymore). With the addition of emoticon in a non-BMP range in Unicode 6, non-BMP characters will become more and more common. Who doesn't like emoticon? :-) o;-) >< (no, I will no add non-BMP characters in this email, I don't want to crash your SMTP server and mail client)
IMHO, Python should be optimized for UCS2 usage
With the PEP 393, it's better: Python is optimize for any usage! (but I expect it to be faster in the Latin1 range, U+0000-U+00FF)
I do see the advantage for large strings, though.
A friend reads last Martin's benchmark differently: Python 3.2 uses 3x more memory than Python 2! Can I say that the PEP 393 fixed an huge regression of Python 3?
Thanks for your huge work on Unicode, Marc-Andre!
Someone said that we still need to define what a character is! By the way, what is a code point?
I don't understand why you would like to stop contribution to Unicode, but well, as you want. We will try to continue your work. Victor

Victor Stinner wrote:
Thanks. I enjoyed working it on it, but priorities are different now, and new projects are waiting :-)
I'll leave that as exercise for the interested reader to find out :-) (Hint: Google should find enough hits where I've explained those things on various mailing lists and in talks I gave.)
I only have limited time available for these things and am nowadays more interested in getting others to recognize just how great Python is, than actually sitting down and writing patches for it. Unicode was my baby for quite a few years, but I now have two kids which need more love and attention :-)
well, as you want. We will try to continue your work.
Thanks. Cheers, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 11 2011)
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
participants (6)
-
"Martin v. Löwis"
-
Benjamin Peterson
-
David Malcolm
-
Guido van Rossum
-
M.-A. Lemburg
-
Victor Stinner