I have been thinking about Unicode representation for some time now.
This was triggered, on the one hand, by discussions with Glyph Lefkowitz
(who complained that his server app consumes too much memory), and Carl
Friedrich Bolz (who profiled Python applications to determine that
Unicode strings are among the top consumers of memory in Python).
On the other hand, this was triggered by the discussion on supporting
surrogates in the library better.
I'd like to propose PEP 393, which takes a different approach,
addressing both problems simultaneously: by getting a flexible
representation (one that can be either 1, 2, or 4 bytes), we can
support the full range of Unicode on all systems, but still use
only one byte per character for strings that are pure ASCII (which
will be the majority of strings for the majority of users).
You'll find the PEP at
http://www.python.org/dev/peps/pep-0393/
For convenience, I include it below.
Regards,
Martin
PEP: 393
Title: Flexible String Representation
Version: $Revision: 88168 $
Last-Modified: $Date: 2011-01-24 21:14:21 +0100 (Mo, 24. Jan 2011) $
Author: Martin v. Löwis
On Mon, 24 Jan 2011 21:17:34 +0100
"Martin v. Löwis"
I have been thinking about Unicode representation for some time now. This was triggered, on the one hand, by discussions with Glyph Lefkowitz (who complained that his server app consumes too much memory), and Carl Friedrich Bolz (who profiled Python applications to determine that Unicode strings are among the top consumers of memory in Python). On the other hand, this was triggered by the discussion on supporting surrogates in the library better.
I'd like to propose PEP 393, which takes a different approach, addressing both problems simultaneously: by getting a flexible representation (one that can be either 1, 2, or 4 bytes), we can support the full range of Unicode on all systems, but still use only one byte per character for strings that are pure ASCII (which will be the majority of strings for the majority of users).
For this kind of experiment, I think a concrete attempt at implementing (together with performance/memory savings numbers) would be much more useful than an abstract proposal. It is hard to judge the concrete effects of the changes you are proposing, even though they might (or not) make sense in theory. For example, you are adding a lot of constant overhead to every unicode object, even very small ones, which might be detrimental. Also, accessing the unicode object's payload can become quite a bit more cumbersome. Only implementing can tell how much this is workable in practice. Regards Antoine.
I'd like to propose PEP 393, which takes a different approach, addressing both problems simultaneously: by getting a flexible representation (one that can be either 1, 2, or 4 bytes), we can support the full range of Unicode on all systems, but still use only one byte per character for strings that are pure ASCII (which will be the majority of strings for the majority of users).
For this kind of experiment, I think a concrete attempt at implementing (together with performance/memory savings numbers) would be much more useful than an abstract proposal.
I partially agree. An implementation is certainly needed, but there is nothing wrong (IMO) with designing the change before implementing it. Also, several people have offered to help with the implementation, so we need to agree on a specification first (which is actually cheaper than starting with the implementation only to find out that people misunderstood each other). Regards, Martin
Le mardi 25 janvier 2011 à 00:07 +0100, "Martin v. Löwis" a écrit :
I'd like to propose PEP 393, which takes a different approach, addressing both problems simultaneously: by getting a flexible representation (one that can be either 1, 2, or 4 bytes), we can support the full range of Unicode on all systems, but still use only one byte per character for strings that are pure ASCII (which will be the majority of strings for the majority of users).
For this kind of experiment, I think a concrete attempt at implementing (together with performance/memory savings numbers) would be much more useful than an abstract proposal.
I partially agree. An implementation is certainly needed, but there is nothing wrong (IMO) with designing the change before implementing it. Also, several people have offered to help with the implementation, so we need to agree on a specification first (which is actually cheaper than starting with the implementation only to find out that people misunderstood each other).
I'm not sure it's really cheaper. When implementing you will probably find out that it makes more sense to change the meaning of some fields, add or remove some, etc. You will also want to try various tweaks since the whole point is to lighten the footprint of unicode strings in common workloads. So, the only criticism I have, intuitively, is that the unicode structure seems to become a bit too large. For example, I'm not sure you need a generic (pointer, size) pair in addition to the representation-specific ones. Incidentally, to slightly reduce the overhead the unicode objects, there's this proposal: http://bugs.python.org/issue1943 Regards Antoine.
On Mon, Jan 24, 2011 at 3:20 PM, Antoine Pitrou
Le mardi 25 janvier 2011 à 00:07 +0100, "Martin v. Löwis" a écrit :
I'd like to propose PEP 393, which takes a different approach, addressing both problems simultaneously: by getting a flexible representation (one that can be either 1, 2, or 4 bytes), we can support the full range of Unicode on all systems, but still use only one byte per character for strings that are pure ASCII (which will be the majority of strings for the majority of users).
For this kind of experiment, I think a concrete attempt at implementing (together with performance/memory savings numbers) would be much more useful than an abstract proposal.
I partially agree. An implementation is certainly needed, but there is nothing wrong (IMO) with designing the change before implementing it. Also, several people have offered to help with the implementation, so we need to agree on a specification first (which is actually cheaper than starting with the implementation only to find out that people misunderstood each other).
I'm not sure it's really cheaper. When implementing you will probably find out that it makes more sense to change the meaning of some fields, add or remove some, etc. You will also want to try various tweaks since the whole point is to lighten the footprint of unicode strings in common workloads.
Yep. This is only a proposal, an implementation will allow all of that to be experimented with. I have frequently see code today, even in python 2.x, that suffers greatly from unicode vs string use (due to APIs in some code that were returning unicode objects unnecessarily when the data was really all ascii text). python 3.x only increases this as the default for so many things passes through unicode even for programs that may not need it.
So, the only criticism I have, intuitively, is that the unicode structure seems to become a bit too large. For example, I'm not sure you need a generic (pointer, size) pair in addition to the representation-specific ones.
I believe the intent this pep is aiming at is for the existing in memory structure to be compatible with already compiled binary extension modules without having to recompile them or change the APIs they are using. Personally I don't care at all about preserving that level of binary compatibility, it has been convenient in the past but is rarely the right thing to do. Of course I'd personally like to see PyObject nuked and revisited, it is too large and is probably not cache line efficient.
Incidentally, to slightly reduce the overhead the unicode objects, there's this proposal: http://bugs.python.org/issue1943
Interesting. But that aims more at cpu performance than memory overhead. What I see is programs that predominantly process ascii data yet waste memory on a 2-4x data explosion of the internal representation. This PEP aims to address that larger target. -gps
Le mercredi 26 janvier 2011 à 21:50 -0800, Gregory P. Smith a écrit :
Incidentally, to slightly reduce the overhead the unicode objects, there's this proposal: http://bugs.python.org/issue1943
Interesting. But that aims more at cpu performance than memory overhead. What I see is programs that predominantly process ascii data yet waste memory on a 2-4x data explosion of the internal representation. This PEP aims to address that larger target.
Right, but we should keep in mind that many unicode strings will not be very large, and so the constant overhead of unicode objects is not necessarily negligible. Regards Antoine.
I believe the intent this pep is aiming at is for the existing in memory structure to be compatible with already compiled binary extension modules without having to recompile them or change the APIs they are using.
No, binary compatibility is not achieved. ABI-conforming modules will continue to work even under this change, but only because access to the unicode object internal representation is not available to the restricted ABI.
Personally I don't care at all about preserving that level of binary compatibility, it has been convenient in the past but is rarely the right thing to do. Of course I'd personally like to see PyObject nuked and revisited, it is too large and is probably not cache line efficient.
That's a different PEP :-) Regards, Martin
So, the only criticism I have, intuitively, is that the unicode structure seems to become a bit too large. For example, I'm not sure you need a generic (pointer, size) pair in addition to the representation-specific ones.
It's not really a generic pointer, but rather a variable-sized pointer. It may not fit into any of the other representations (e.g. if there is a four-byte wchar_t, then a two-byte representation would fit neither into the UTF-8 pointer nor into the wchar_t pointer).
Incidentally, to slightly reduce the overhead the unicode objects, there's this proposal: http://bugs.python.org/issue1943
I wonder what aspects of this patch and discussion should be integrated into the PEP. The notion of allocating the memory in the same block is already considered in the PEP; what else might be relevant? Input is welcome! Regards, Martin
Incidentally, to slightly reduce the overhead the unicode objects, there's this proposal: http://bugs.python.org/issue1943
I wonder what aspects of this patch and discussion should be integrated into the PEP. The notion of allocating the memory in the same block is already considered in the PEP; what else might be relevant?
Ok, I'm sorry for not reading the PEP carefully enough, then. The patch does a couple of other tweaks such as making "state" a char rather than an int, and changing the freelist algorithm. But the latter doesn't need to be spelled out in a PEP anyway. Regards Antoine.
On Mon, 2011-01-24 at 21:17 +0100, "Martin v. Löwis" wrote: ... snip ...
I'd like to propose PEP 393, which takes a different approach, addressing both problems simultaneously: by getting a flexible representation (one that can be either 1, 2, or 4 bytes), we can support the full range of Unicode on all systems, but still use only one byte per character for strings that are pure ASCII (which will be the majority of strings for the majority of users).
There was some discussion about this at PyCon 2010, where we referred to it casually as "Pay-as-you-go unicode" ... snip ...
- str: shortest-form representation of the unicode string; the lower two bits of the pointer indicate the specific form: 01 => 1 byte (Latin-1); 11 => 2 byte (UCS-2); 11 => 4 byte (UCS-4); Repetition of "11"; I'm guessing that the 2byte/UCS-2 should read "10", so that they give the width of the char representation.
00 => null pointer
Naturally this assumes that all pointers are at least 4-byte aligned (so that they can be masked off). I assume that this is sane on every platform that Python supports, but should it be spelled out explicitly somewhere in the PEP?
The string is null-terminated (in its respective representation). - hash, state: same as in Python 3.2 - utf8_length, utf8: UTF-8 representation (null-terminated)
If this is to share its buffer with the "str" representation for the Latin-1 case, then I take it this ptr will typically be (str & ~4) ? i.e. only "str" has the low-order-bit type info.
- wstr_length, wstr: representation in platform's wchar_t (null-terminated). If wchar_t is 16-bit, this form may use surrogate pairs (in which cast wstr_length differs form length).
All three representations are optional, although the str form is considered the canonical representation which can be absent only while the string is being created.
Spelling out the meaning of "optional": does this mean that the relevant ptr is NULL; if so, if utf8 is null, is utf8_length undefined, or is it some dummy value? (i.e. is the pointer the first thing to check before we know if utf8_length is meaningful?); similar consideration for the wstr representation.
The Py_UNICODE type is still supported but deprecated. It is always defined as a typedef for wchar_t, so the wstr representation can double as Py_UNICODE representation.
The str and utf8 pointers point to the same memory if the string uses only ASCII characters (using only Latin-1 is not sufficient). The str ...though the ptrs are non-equal for this case, as noted above, as "str" has an 0x1 typecode.
and wstr pointers point to the same memory if the string happens to fit exactly to the wchar_t type of the platform (i.e. uses some BMP-not-Latin-1 characters if sizeof(wchar_t) is 2, and uses some non-BMP characters if sizeof(wchar_t) is 4).
If the string is created directly with the canonical representation (see below), this representation doesn't take a separate memory block, but is allocated right after the PyUnicodeObject struct.
Is the idea to do pointer arithmentic when deleting the PyUnicodeObject to determine if the ptr is in that location, and not delete it if it is, or is there some other way of determining whether the pointers need deallocating? If the former, is this embedding an assumption that the underlying allocator couldn't have allocated a buffer directly adjacent to the PyUnicodeObject. I know that GNU libc's malloc/free implementation has gaps of two machine words between each allocation; off the top of my head I'm not sure if the optimized Object/obmalloc.c allocator enforces such gaps. ... snip ... Extra section: GDB Debugging Hooks ------------------- Tools/gdb/libpython.py contains debugging hooks that embed knowledge about the internals of CPython's data types, include PyUnicodeObject instances. It will need to be slightly updated to track the change. (I can do that change if need be; it shouldn't be too hard). Hope this is helpful Dave
Repetition of "11"; I'm guessing that the 2byte/UCS-2 should read "10", so that they give the width of the char representation.
Thanks, fixed.
00 => null pointer
Naturally this assumes that all pointers are at least 4-byte aligned (so that they can be masked off). I assume that this is sane on every platform that Python supports, but should it be spelled out explicitly somewhere in the PEP?
I'll change the PEP to move the type indicator into the state field, so that issue becomes irrelevant.
The string is null-terminated (in its respective representation). - hash, state: same as in Python 3.2 - utf8_length, utf8: UTF-8 representation (null-terminated) If this is to share its buffer with the "str" representation for the Latin-1 case, then I take it this ptr will typically be (str & ~4) ? i.e. only "str" has the low-order-bit type info.
Yes, the other pointers are aligned. Notice that the case in which sharing occurs is only ASCII, though (for Latin-1, some characters require two bytes in UTF-8).
Spelling out the meaning of "optional": does this mean that the relevant ptr is NULL; if so, if utf8 is null, is utf8_length undefined, or is it some dummy value?
I've clarified this: I propose length is undefined (unless there is a good reason to clear it).
If the string is created directly with the canonical representation (see below), this representation doesn't take a separate memory block, but is allocated right after the PyUnicodeObject struct.
Is the idea to do pointer arithmentic when deleting the PyUnicodeObject to determine if the ptr is in that location, and not delete it if it is, or is there some other way of determining whether the pointers need deallocating?
Correct.
If the former, is this embedding an assumption that the underlying allocator couldn't have allocated a buffer directly adjacent to the PyUnicodeObject. I know that GNU libc's malloc/free implementation has gaps of two machine words between each allocation; off the top of my head I'm not sure if the optimized Object/obmalloc.c allocator enforces such gaps.
No, it doesn't... So I guess I reserve another bit in the state for that.
GDB Debugging Hooks ------------------- Tools/gdb/libpython.py contains debugging hooks that embed knowledge about the internals of CPython's data types, include PyUnicodeObject instances. It will need to be slightly updated to track the change.
Thanks, added. Regards, Martin
On Tue, Jan 25, 2011 at 6:17 AM, "Martin v. Löwis"
A new function PyUnicode_AsUTF8 is provided to access the UTF-8 representation. It is thus identical to the existing _PyUnicode_AsString, which is removed. The function will compute the utf8 representation when first called. Since this representation will consume memory until the string object is released, applications should use the existing PyUnicode_AsUTF8String where possible (which generates a new string object every time). API that implicitly converts a string to a char* (such as the ParseTuple functions) will use this function to compute a conversion.
I'm not entirely clear as to what "this function" is referring to here. I'm also dubious of the "PyUnicode_Finalize" name - "PyUnicode_Ready" might be a better option (PyType_Ready seems a better analogy for a "I've filled everything in, please calculate the derived fields now" than Py_Finalize). More generally, let me see if I understand the proposed structure correctly: str: Always set once PyUnicode_Ready() has been called. Always points to the canonical representation of the string (as indicated by PyUnicode_Kind) length: Always set once PyUnicode_Ready() has been called. Specifies the number of code points in the string. wstr: Set only if PyUnicode_AsUnicode has been called on the string. If (sizeof(wchar_t) == 2 && PyUnicode_Kind() == PyUnicode_2BYTE) or (sizeof(wchar_t) == 4 && PyUnicode_Kind() == PyUnicode_4BYTE), wstr = str, otherwise wstr points to dedicated memory wstr_length: Valid only if wstr != NULL If wstr_length != length, indicates presence of surrogate pairs in a UCS-2 string (i.e. sizeof(wchar_t) == 2, PyUnicode_Kind() == PyUnicode_4BYTE). utf8: Set only if PyUnicode_AsUTF8 has been called on the string. If string contents are pure ASCII, utf8 = str, otherwise utf8 points to dedicated memory. utf8_length: Valid only if utf8_ptr != NULL One change I would propose is that rather than hiding flags in the low order bits of the str pointer, we expand the use of the existing "state" field to cover the representation information in addition to the interning information. I would also suggest explicitly flagging internally whether or not a 1 byte string is ASCII or Latin-1 along the lines of: /* Already existing string state constants */ #SSTATE_NOT_INTERNED 0x00 #SSTATE_INTERNED_MORTAL 0x01 #SSTATE_INTERNED_IMMORTAL 0x02 /* New string state constants */ #SSTATE_INTERN_MASK 0x03 #SSTATE_KIND_ASCII 0x00 #SSTATE_KIND_LATIN1 0x04 #SSTATE_KIND_2BYTE 0x08 #SSTATE_KIND_4BYTE 0x0C #SSTATE_KIND_MASK 0x0C PyUnicode_Kind would then return PyUnicode_1BYTE for strings that were flagged internally as either ASCII or LATIN1. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Tue, 25 Jan 2011 21:08:01 +1000
Nick Coghlan
One change I would propose is that rather than hiding flags in the low order bits of the str pointer, we expand the use of the existing "state" field to cover the representation information in addition to the interning information.
+1, by the way. The "state" field has many bits available (even if we decide to make it a char rather than an int). Regards Antoine.
Am 25.01.2011 12:08, schrieb Nick Coghlan:
On Tue, Jan 25, 2011 at 6:17 AM, "Martin v. Löwis"
wrote: A new function PyUnicode_AsUTF8 is provided to access the UTF-8 representation. It is thus identical to the existing _PyUnicode_AsString, which is removed. The function will compute the utf8 representation when first called. Since this representation will consume memory until the string object is released, applications should use the existing PyUnicode_AsUTF8String where possible (which generates a new string object every time). API that implicitly converts a string to a char* (such as the ParseTuple functions) will use this function to compute a conversion.
I'm not entirely clear as to what "this function" is referring to here.
PyUnicode_AsUTF8 (i.e. the one where you don't need to release the memory). I made this explicit now.
I'm also dubious of the "PyUnicode_Finalize" name - "PyUnicode_Ready" might be a better option (PyType_Ready seems a better analogy for a "I've filled everything in, please calculate the derived fields now" than Py_Finalize).
Ok, changed (when I was pondering about this PEP, this once occurred me also, but I forgot when I typed it in).
More generally, let me see if I understand the proposed structure correctly:
str: Always set once PyUnicode_Ready() has been called. Always points to the canonical representation of the string (as indicated by PyUnicode_Kind) length: Always set once PyUnicode_Ready() has been called. Specifies the number of code points in the string.
Correct.
wstr: Set only if PyUnicode_AsUnicode has been called on the string.
Might also be set when the string is created through PyUnicode_FromUnicode was used, and PyUnicode_Ready hasn't been called.
If (sizeof(wchar_t) == 2 && PyUnicode_Kind() == PyUnicode_2BYTE) or (sizeof(wchar_t) == 4 && PyUnicode_Kind() == PyUnicode_4BYTE), wstr = str, otherwise wstr points to dedicated memory wstr_length: Valid only if wstr != NULL If wstr_length != length, indicates presence of surrogate pairs in a UCS-2 string (i.e. sizeof(wchar_t) == 2, PyUnicode_Kind() == PyUnicode_4BYTE).
Correct.
utf8: Set only if PyUnicode_AsUTF8 has been called on the string. If string contents are pure ASCII, utf8 = str, otherwise utf8 points to dedicated memory. utf8_length: Valid only if utf8_ptr != NULL
Correct.
One change I would propose is that rather than hiding flags in the low order bits of the str pointer, we expand the use of the existing "state" field to cover the representation information in addition to the interning information.
Thanks for the idea; done.
I would also suggest explicitly flagging internally whether or not a 1 byte string is ASCII or Latin-1 along the lines of:
Not sure about that. It would complicate PyUnicode_Kind. Instead, I'd rather fill out utf8 right away if we can use sharing (e.g. when the string is created with a max value <128, or PyUnicode_Ready has determined that). So I keep it for the moment as reserved (but would use it when str is NULL, as I'd have to fill in some value, anyway). Regards, Martin
I'll comment more on this later this week...
From my first impression, I'm not too thrilled by the prospect of making the Unicode implementation more complicated by having three different representations on each object.
I also don't see how this could save a lot of memory. As an example take a French text with say 10mio code points. This would end up appearing in memory as 3 copies on Windows: one copy stored as UCS2 (20MB), one as Latin-1 (10MB) and one as UTF-8 (probably around 15MB, depending on how many accents are used). That's a saving of -10MB compared to today's implementation :-) "Martin v. Löwis" wrote:
I have been thinking about Unicode representation for some time now. This was triggered, on the one hand, by discussions with Glyph Lefkowitz (who complained that his server app consumes too much memory), and Carl Friedrich Bolz (who profiled Python applications to determine that Unicode strings are among the top consumers of memory in Python). On the other hand, this was triggered by the discussion on supporting surrogates in the library better.
I'd like to propose PEP 393, which takes a different approach, addressing both problems simultaneously: by getting a flexible representation (one that can be either 1, 2, or 4 bytes), we can support the full range of Unicode on all systems, but still use only one byte per character for strings that are pure ASCII (which will be the majority of strings for the majority of users).
You'll find the PEP at
http://www.python.org/dev/peps/pep-0393/
For convenience, I include it below.
-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jan 25 2011)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
For the record:
I also don't see how this could save a lot of memory. As an example take a French text with say 10mio code points. This would end up appearing in memory as 3 copies on Windows: one copy stored as UCS2 (20MB), one as Latin-1 (10MB) and one as UTF-8 (probably around 15MB, depending on how many accents are used).
Typical French text seems to have 5% non-ASCII characters. So the number of UTF-8 bytes needed to represent a French text would only be 5% higher than the number of code points. Anyway, it's quite obvious that Martin's goal is that only one representation gets created most of the time. To quote the draft: “All three representations are optional, although the str form is considered the canonical representation which can be absent only while the string is being created.” Regards Antoine.
On Tue, Jan 25, 2011 at 5:43 PM, M.-A. Lemburg
I also don't see how this could save a lot of memory. As an example take a French text with say 10mio code points. This would end up appearing in memory as 3 copies on Windows: one copy stored as UCS2 (20MB), one as Latin-1 (10MB) and one as UTF-8 (probably around 15MB, depending on how many accents are used). That's a saving of -10MB compared to today's implementation :-)
If I am reading the pep right, which I may not be as I am no expert on unicode, the new implementation would actually give a 10MB saving since the wchar field is optional, so only the str (Latin-1) and utf8 fields would need to be stored. How it decides not to store one field or another would need to be clarified in the pep is I am right.
On Wed, Jan 26, 2011 at 11:50 AM, Dj Gilcrease
On Tue, Jan 25, 2011 at 5:43 PM, M.-A. Lemburg
wrote: I also don't see how this could save a lot of memory. As an example take a French text with say 10mio code points. This would end up appearing in memory as 3 copies on Windows: one copy stored as UCS2 (20MB), one as Latin-1 (10MB) and one as UTF-8 (probably around 15MB, depending on how many accents are used). That's a saving of -10MB compared to today's implementation :-)
If I am reading the pep right, which I may not be as I am no expert on unicode, the new implementation would actually give a 10MB saving since the wchar field is optional, so only the str (Latin-1) and utf8 fields would need to be stored. How it decides not to store one field or another would need to be clarified in the pep is I am right.
The PEP actually does define that already: PyUnicode_AsUTF8 populates the utf8 field of the existing string, while PyUnicode_AsUTF8String creates a *new* string with that field populated. PyUnicode_AsUnicode will populate the wstr field (but doing so generally shouldn't be necessary). For a UCS4 build, my reading of the PEP puts the memory savings for a 100 code point string as follows: Current size: 400 bytes (regardless of max code point) New initial size (max code point < 256): 100 bytes (75% saving) New initial size (max code point < 65536): 200 bytes (50% saving) New initial size (max code point >= 65536): 400 bytes (no saving) For each of the "new" strings, they may consume additional storage if the utf8 or wstr fields get populated. The maximum possible size would be a UCS4 string (max code point >= 65536) on a sizeof(wchar_t) == 2 system with the utf8 string populated. In such cases, you would consume at least 700 bytes, plus whatever additional memory is needed to encode the non-BMP characters into UTF-8 and UTF-16. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 26 January 2011 12:30, Nick Coghlan
The PEP actually does define that already:
PyUnicode_AsUTF8 populates the utf8 field of the existing string, while PyUnicode_AsUTF8String creates a *new* string with that field populated.
PyUnicode_AsUnicode will populate the wstr field (but doing so generally shouldn't be necessary).
AIUI, another point is that the PEP deprecates the use of the calls that populate the utf8 and wstr fields, in favour of the calls that expect the caller to manage the extra memory (PyUnicode_AsUTF8String rather than PyUnicode_AsUTF8, ??? rather than PyUnicode_AsUnicode). So in the long term, the extra fields should never be populated - although this could take some time as extensions have to be recoded. Ultimately, the extra fields and older APIs could even be removed. So any space cost (which I concede could be non-trivial in some cases) is expected to be short-term. Paul.
From my first impression, I'm not too thrilled by the prospect of making the Unicode implementation more complicated by having three different representations on each object.
Thanks, added as a concern.
I also don't see how this could save a lot of memory. As an example take a French text with say 10mio code points. This would end up appearing in memory as 3 copies on Windows: one copy stored as UCS2 (20MB), one as Latin-1 (10MB) and one as UTF-8 (probably around 15MB, depending on how many accents are used). That's a saving of -10MB compared to today's implementation :-)
As others have pointed out: that's not how it works. It actually *will* save memory, since the alternative representations are optional. Regards, Martin
BTW, has anyone looked at what other languages with a native unicode type do for their implementations if any of them attempt to conserve ram?
"Martin v. Löwis", 24.01.2011 21:17:
The Py_UNICODE type is still supported but deprecated. It is always defined as a typedef for wchar_t, so the wstr representation can double as Py_UNICODE representation.
It's too bad this isn't initialised by default, though. Py_UNICODE is the only representation that can be used efficiently from C code and Cython relies on it for fast text processing. This proposal will therefore likely have a pretty negative performance impact on extensions written in Cython as the compiler could no longer expect this representation to be available instantaneously. Stefan
On Jan 27, 2011, at 2:06 PM, Stefan Behnel wrote:
"Martin v. Löwis", 24.01.2011 21:17:
The Py_UNICODE type is still supported but deprecated. It is always defined as a typedef for wchar_t, so the wstr representation can double as Py_UNICODE representation.
It's too bad this isn't initialised by default, though. Py_UNICODE is the only representation that can be used efficiently from C code and Cython relies on it for fast text processing. This proposal will therefore likely have a pretty negative performance impact on extensions written in Cython as the compiler could no longer expect this representation to be available instantaneously.
But the whole point of the exercise is so that it doesn't have to store a 4byte-per-char representation when a 1byte-per-char rep would do. If cython wants to work most efficiently with this proposal, it should learn to deal with the three possible raw representations. James
On 1/27/2011 12:26 PM, James Y Knight wrote:
On Jan 27, 2011, at 2:06 PM, Stefan Behnel wrote:
"Martin v. Löwis", 24.01.2011 21:17:
The Py_UNICODE type is still supported but deprecated. It is always defined as a typedef for wchar_t, so the wstr representation can double as Py_UNICODE representation. It's too bad this isn't initialised by default, though. Py_UNICODE is the only representation that can be used efficiently from C code and Cython relies on it for fast text processing. This proposal will therefore likely have a pretty negative performance impact on extensions written in Cython as the compiler could no longer expect this representation to be available instantaneously. But the whole point of the exercise is so that it doesn't have to store a 4byte-per-char representation when a 1byte-per-char rep would do. If cython wants to work most efficiently with this proposal, it should learn to deal with the three possible raw representations.
C was doing fast text processing on char long before Py_UNICODE existed, or wchar_t.
James Y Knight, 27.01.2011 21:26:
On Jan 27, 2011, at 2:06 PM, Stefan Behnel wrote:
"Martin v. Löwis", 24.01.2011 21:17:
The Py_UNICODE type is still supported but deprecated. It is always defined as a typedef for wchar_t, so the wstr representation can double as Py_UNICODE representation.
It's too bad this isn't initialised by default, though. Py_UNICODE is the only representation that can be used efficiently from C code and Cython relies on it for fast text processing. This proposal will therefore likely have a pretty negative performance impact on extensions written in Cython as the compiler could no longer expect this representation to be available instantaneously.
But the whole point of the exercise is so that it doesn't have to store a 4byte-per-char representation when a 1byte-per-char rep would do.
I am well aware of that. But I'm arguing that the current simpler internal representation has had its advantages for CPython as a platform.
If cython wants to work most efficiently with this proposal, it should learn to deal with the three possible raw representations.
I agree. After all, CPython is lucky to have it available. It wouldn't be the first time that we duplicate looping code based on the input type. However, like the looping code, it will also complicate all indexing code at runtime as it always needs to test which of the representations is current before it can read a character. Currently, all of this is a compile time decision. This will necessarily have a performance impact. Stefan
I agree. After all, CPython is lucky to have it available. It wouldn't be the first time that we duplicate looping code based on the input type. However, like the looping code, it will also complicate all indexing code at runtime as it always needs to test which of the representations is current before it can read a character. Currently, all of this is a compile time decision. This will necessarily have a performance impact.
That's most certainly the case. That's one of the reasons to discuss this through a PEP, rather than just coming up with a patch: if people object to it too much because of the impact on execution speed, it may get rejected. Of course, that would make those unhappy who complain about the memory consumption. This is a classical time-space-tradeoff, favoring space reduction over time reduction. I fully understand that the actual impact can only be observed when an implementation is available, and applications have made a reasonable effort to work with the implementation efficiently (or perhaps not, which would show the impact on unmodified implementations). This is something that works much better in PyPy: the actual string operations are written in RPython, and the tracing JIT would generate all versions of the code that are relevant for the different representations (IIUC, this approach is only planned for PyPy, yet). I hope that C macros can help reduce the maintenance burden. Regards, Martin
Am 27.01.2011 20:06, schrieb Stefan Behnel:
"Martin v. Löwis", 24.01.2011 21:17:
The Py_UNICODE type is still supported but deprecated. It is always defined as a typedef for wchar_t, so the wstr representation can double as Py_UNICODE representation.
It's too bad this isn't initialised by default, though. Py_UNICODE is the only representation that can be used efficiently from C code and Cython relies on it for fast text processing.
That's not true. The str representation can also be used efficiently from C.
This proposal will therefore likely have a pretty negative performance impact on extensions written in Cython as the compiler could no longer expect this representation to be available instantaneously.
In any case, I've added this concern. Regards, Martin
* Stefan Behnel:
"Martin v. Löwis", 24.01.2011 21:17:
The Py_UNICODE type is still supported but deprecated. It is always defined as a typedef for wchar_t, so the wstr representation can double as Py_UNICODE representation.
It's too bad this isn't initialised by default, though. Py_UNICODE is the only representation that can be used efficiently from C code
Is this really true? I don't think I've seen any C API which actually
uses wchar_t, beyond that what is provided by libc. UTF-8 and even
UTF-16 are much, much more common.
--
Florian Weimer
Florian Weimer, 28.01.2011 10:35:
* Stefan Behnel:
"Martin v. Löwis", 24.01.2011 21:17:
The Py_UNICODE type is still supported but deprecated. It is always defined as a typedef for wchar_t, so the wstr representation can double as Py_UNICODE representation.
It's too bad this isn't initialised by default, though. Py_UNICODE is the only representation that can be used efficiently from C code
Is this really true? I don't think I've seen any C API which actually uses wchar_t, beyond that what is provided by libc. UTF-8 and even UTF-16 are much, much more common.
They are also much harder to use, unless you are really only interested in 7-bit ASCII data - which is the case for most C libraries, so I believe that's what you meant here. However, this is the CPython runtime with built-in Unicode support, not the C runtime where it comes as an add-on at best, and where Unicode processing without being Unicode aware is common. The nice thing about Py_UNICODE is that is basically gives you native Unicode code points directly, without needing to decode UTF-8 byte runs and the like. In Cython, it allows you to do things like this: def test_for_those_characters(unicode s): for c in s: # warning: randomly chosen Unicode escapes ahead if c in u"\u0356\u1012\u3359\u4567": return True else: return False The loop runs in plain C, using the somewhat obvious implementation with a loop over Py_UNICODE characters and a switch statement for the comparison. This would look a *lot* more ugly with UTF-8 encoded byte strings. Regarding Cython specifically, the above will still be *possible* under the proposal, given that the memory layout of the strings will still represent the Unicode code points. It will just be trickier to implement in Cython's type system as there is no longer a (user visible) C type representation for those code units. It can be any of uchar, ushort16 or uint32, neither of which is necessarily a 'native' representation of a Unicode character in CPython. While I'm somewhat confident that I'll find a way to fix this in Cython, my point is just that this adds a certain level of complexity to C code using the new memory layout that simply wasn't there before. Stefan
* Stefan Behnel:
The nice thing about Py_UNICODE is that is basically gives you native Unicode code points directly, without needing to decode UTF-8 byte runs and the like. In Cython, it allows you to do things like this:
def test_for_those_characters(unicode s): for c in s: # warning: randomly chosen Unicode escapes ahead if c in u"\u0356\u1012\u3359\u4567": return True else: return False
The loop runs in plain C, using the somewhat obvious implementation with a loop over Py_UNICODE characters and a switch statement for the comparison. This would look a *lot* more ugly with UTF-8 encoded byte strings.
Not really, because UTF-8 is quite search-friendly. (The if would
have to invoke a memmem()-like primitive.) Random subscrips are
problematic.
However, why would one want to write loops like the above? Don't you
have to take combining characters (comprising multiple codepoints)
into account most of the time when you look at individual characters?
Then UTF-32 does not offer much of a simplification.
--
Florian Weimer
Florian Weimer, 28.01.2011 15:27:
* Stefan Behnel:
The nice thing about Py_UNICODE is that is basically gives you native Unicode code points directly, without needing to decode UTF-8 byte runs and the like. In Cython, it allows you to do things like this:
def test_for_those_characters(unicode s): for c in s: # warning: randomly chosen Unicode escapes ahead if c in u"\u0356\u1012\u3359\u4567": return True else: return False
The loop runs in plain C, using the somewhat obvious implementation with a loop over Py_UNICODE characters and a switch statement for the comparison. This would look a *lot* more ugly with UTF-8 encoded byte strings.
Not really, because UTF-8 is quite search-friendly. (The if would have to invoke a memmem()-like primitive.) Random subscrips are problematic.
However, why would one want to write loops like the above? Don't you have to take combining characters (comprising multiple codepoints) into account most of the time when you look at individual characters? Then UTF-32 does not offer much of a simplification.
Hmm, I think this discussion is pointless. Regardless of the memory layout, you can always go down to the byte level and use an efficient (multi-)substring search algorithm. (which is obviously helped if you know the layout at compile time *wink*) Bad example, I guess. Stefan
The nice thing about Py_UNICODE is that is basically gives you native Unicode code points directly, without needing to decode UTF-8 byte runs and the like. In Cython, it allows you to do things like this:
def test_for_those_characters(unicode s): for c in s: # warning: randomly chosen Unicode escapes ahead if c in u"\u0356\u1012\u3359\u4567": return True else: return False
The loop runs in plain C, using the somewhat obvious implementation with a loop over Py_UNICODE characters and a switch statement for the comparison. This would look a *lot* more ugly with UTF-8 encoded byte strings.
And indeed, when Cython is updated to 3.3, it shouldn't access the UTF-8 representation for such a loop. Instead, it should access the str representation, and might compile this to code like #define Cython_CharAt(data, kind, pos) kind==LATIN1 ? \ ((unsigned char*)data)[pos] : kind==UCS2 ? \ ((unsigned short*)data)[pos] : \ ((Py_UCS4*)data)[pos] void *data = PyUnicode_Data(s); int kind = PyUnicode_Kind(s); for(int pos=0; pos < PyUnicode_Size(s); pos++){ Py_UCS4 c = Cython_CharAt(data, kind, pos); Py_UCS4 tmp = {0x356, 0x1012, 0x3359, 0x4567}; for (int k=0; k<4; k++) if (c == tmp[k]) return 1; } return 0;
Regarding Cython specifically, the above will still be *possible* under the proposal, given that the memory layout of the strings will still represent the Unicode code points. It will just be trickier to implement in Cython's type system as there is no longer a (user visible) C type representation for those code units.
There is: Py_UCS4 remains available.
It can be any of uchar, ushort16 or uint32, neither of which is necessarily a 'native' representation of a Unicode character in CPython.
There won't be a "native" representation anymore - that's the whole point of the PEP.
While I'm somewhat confident that I'll find a way to fix this in Cython, my point is just that this adds a certain level of complexity to C code using the new memory layout that simply wasn't there before.
Understood. However, I think it is easier than you think it is. Regards, Martin
"Martin v. Löwis", 28.01.2011 22:49:
And indeed, when Cython is updated to 3.3, it shouldn't access the UTF-8 representation for such a loop. Instead, it should access the str representation
Sure.
Regarding Cython specifically, the above will still be *possible* under the proposal, given that the memory layout of the strings will still represent the Unicode code points. It will just be trickier to implement in Cython's type system as there is no longer a (user visible) C type representation for those code units.
There is: Py_UCS4 remains available.
Thanks for that pointer. I had always thought that all "*UCS4*" names were platform specific and had completely missed that type. It's a lot nicer than Py_UNICODE because it allows users to fold surrogate pairs back into the character value. It's completely missing from the docs, BTW. Google doesn't give me a single mention for all of docs.python.org, even though it existed at least since (and likely long before) Cython's oldest supported runtime Python 2.3. If I had known about that type earlier, I could have ended up making that the native Unicode character type in Cython instead of bothering with Py_UNICODE. But this can still be changed I think. Since type inference was available before native Py_UNICODE support, it's unlikely that users will have Py_UNICODE written in their code explicitly. So I can make the switch under the hood. Just to explain, a native CPython C type is much better than an arbitrary integer type, because it allows Cython to apply specific coercion rules from and to Python object types. As currently Py_UNICODE, Py_UCS4 would obviously coerce from and to a 1 character Unicode string, but it could additionally handle surrogate pair splitting and combining automatically on current 16-bit Unicode builds so that you'd get a Unicode string with two code points on coercion to Python.
While I'm somewhat confident that I'll find a way to fix this in Cython, my point is just that this adds a certain level of complexity to C code using the new memory layout that simply wasn't there before.
Understood. However, I think it is easier than you think it is.
Let's see about the implications once there is an implementation. Stefan
"Martin v. Löwis", 24.01.2011 21:17:
If the string is created directly with the canonical representation (see below), this representation doesn't take a separate memory block, but is allocated right after the PyUnicodeObject struct.
Does this mean it's supposed to become a PyVarObject? Antoine proposed that, too. Apart from breaking (more or less) all existing C subtyping code, this will also make it harder to subtype it in new code. I don't like that idea at all. Stefan
Am 27.01.2011 23:53, schrieb Stefan Behnel:
"Martin v. Löwis", 24.01.2011 21:17:
If the string is created directly with the canonical representation (see below), this representation doesn't take a separate memory block, but is allocated right after the PyUnicodeObject struct.
Does this mean it's supposed to become a PyVarObject?
What do you mean by "become"? Will it be declared as such? No.
Antoine proposed that, too. Apart from breaking (more or less) all existing C subtyping code, this will also make it harder to subtype it in new code. I don't like that idea at all.
Why will it break all existing subtyping code? See the PEP: Only objects created through PyUnicode_New will be affected - I don't think this can include objects of a subtype. Regards, Martin
"Martin v. Löwis", 28.01.2011 01:02:
Am 27.01.2011 23:53, schrieb Stefan Behnel:
"Martin v. Löwis", 24.01.2011 21:17:
If the string is created directly with the canonical representation (see below), this representation doesn't take a separate memory block, but is allocated right after the PyUnicodeObject struct.
Does this mean it's supposed to become a PyVarObject?
What do you mean by "become"? Will it be declared as such? No.
Antoine proposed that, too. Apart from breaking (more or less) all existing C subtyping code, this will also make it harder to subtype it in new code. I don't like that idea at all.
Why will it break all existing subtyping code? See the PEP: Only objects created through PyUnicode_New will be affected - I don't think this can include objects of a subtype.
Ok, that's fine then. Stefan
Pardon me for this drive-by posting, but this thread smells a lot like this
old thread (don't be afraid to read it all, there are some good points in
there; not directed at you Martin, but at all readers/posters in this
thread)...
http://mail.python.org/pipermail/python-3000/2006-September/003795.html
http://mail.python.org/pipermail/python-3000/2006-September/003795.htmlI'm
not averse to faster and/or more memory efficient unicode representations (I
would be quite happy with them, actually). I do see the usefulness of having
non-utf-8 representations, and caching them is a good idea, though I wonder
if that is a "good for Python itself to cache", or "good for the application
to cache".
The evil side of me says that we should just provide an API available in
Python/C for "give me the representation of unicode string X using the
2byte/4byte code points", and have it just return the appropriate
array.array() value (useful for passing to other APIs, or for those who need
to do manual manipulation of code-points), or whatever structure is deemed
to be appropriate.
The less evil side of me says that going with what the PEP offers isn't a
bad idea, and might just be a good idea.
I'll defer my vote to Martin.
Regards,
- Josiah
On Mon, Jan 24, 2011 at 12:17 PM, "Martin v. Löwis"
I have been thinking about Unicode representation for some time now. This was triggered, on the one hand, by discussions with Glyph Lefkowitz (who complained that his server app consumes too much memory), and Carl Friedrich Bolz (who profiled Python applications to determine that Unicode strings are among the top consumers of memory in Python). On the other hand, this was triggered by the discussion on supporting surrogates in the library better.
I'd like to propose PEP 393, which takes a different approach, addressing both problems simultaneously: by getting a flexible representation (one that can be either 1, 2, or 4 bytes), we can support the full range of Unicode on all systems, but still use only one byte per character for strings that are pure ASCII (which will be the majority of strings for the majority of users).
You'll find the PEP at
http://www.python.org/dev/peps/pep-0393/
For convenience, I include it below.
Regards, Martin
PEP: 393 Title: Flexible String Representation Version: $Revision: 88168 $ Last-Modified: $Date: 2011-01-24 21:14:21 +0100 (Mo, 24. Jan 2011) $ Author: Martin v. Löwis
Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 24-Jan-2010 Python-Version: 3.3 Post-History: Abstract ========
The Unicode string type is changed to support multiple internal representations, depending on the character with the largest Unicode ordinal (1, 2, or 4 bytes). This will allow a space-efficient representation in common cases, but give access to full UCS-4 on all systems. For compatibility with existing APIs, several representations may exist in parallel; over time, this compatibility should be phased out.
Rationale =========
There are two classes of complaints about the current implementation of the unicode type: on systems only supporting UTF-16, users complain that non-BMP characters are not properly supported. On systems using UCS-4 internally (and also sometimes on systems using UCS-2), there is a complaint that Unicode strings take up too much memory - especially compared to Python 2.x, where the same code would often use ASCII strings (i.e. ASCII-encoded byte strings). With the proposed approach, ASCII-only Unicode strings will again use only one byte per character; while still allowing efficient indexing of strings containing non-BMP characters (as strings containing them will use 4 bytes per character).
One problem with the approach is support for existing applications (e.g. extension modules). For compatibility, redundant representations may be computed. Applications are encouraged to phase out reliance on a specific internal representation if possible. As interaction with other libraries will often require some sort of internal representation, the specification choses UTF-8 as the recommended way of exposing strings to C code.
For many strings (e.g. ASCII), multiple representations may actually share memory (e.g. the shortest form may be shared with the UTF-8 form if all characters are ASCII). With such sharing, the overhead of compatibility representations is reduced.
Specification =============
The Unicode object structure is changed to this definition::
typedef struct { PyObject_HEAD Py_ssize_t length; void *str; Py_hash_t hash; int state; Py_ssize_t utf8_length; void *utf8; Py_ssize_t wstr_length; void *wstr; } PyUnicodeObject;
These fields have the following interpretations:
- length: number of code points in the string (result of sq_length) - str: shortest-form representation of the unicode string; the lower two bits of the pointer indicate the specific form: 01 => 1 byte (Latin-1); 11 => 2 byte (UCS-2); 11 => 4 byte (UCS-4); 00 => null pointer
The string is null-terminated (in its respective representation). - hash, state: same as in Python 3.2 - utf8_length, utf8: UTF-8 representation (null-terminated) - wstr_length, wstr: representation in platform's wchar_t (null-terminated). If wchar_t is 16-bit, this form may use surrogate pairs (in which cast wstr_length differs form length).
All three representations are optional, although the str form is considered the canonical representation which can be absent only while the string is being created.
The Py_UNICODE type is still supported but deprecated. It is always defined as a typedef for wchar_t, so the wstr representation can double as Py_UNICODE representation.
The str and utf8 pointers point to the same memory if the string uses only ASCII characters (using only Latin-1 is not sufficient). The str and wstr pointers point to the same memory if the string happens to fit exactly to the wchar_t type of the platform (i.e. uses some BMP-not-Latin-1 characters if sizeof(wchar_t) is 2, and uses some non-BMP characters if sizeof(wchar_t) is 4).
If the string is created directly with the canonical representation (see below), this representation doesn't take a separate memory block, but is allocated right after the PyUnicodeObject struct.
String Creation ---------------
The recommended way to create a Unicode object is to use the function PyUnicode_New::
PyObject* PyUnicode_New(Py_ssize_t size, Py_UCS4 maxchar);
Both parameters must denote the eventual size/range of the strings. In particular, codecs using this API must compute both the number of characters and the maximum character in advance. An string is allocated according to the specified size and character range and is null-terminated; the actual characters in it may be unitialized.
PyUnicode_FromString and PyUnicode_FromStringAndSize remain supported for processing UTF-8 input; the input is decoded, and the UTF-8 representation is not yet set for the string.
PyUnicode_FromUnicode remains supported but is deprecated. If the Py_UNICODE pointer is non-null, the str representation is set. If the pointer is NULL, a properly-sized wstr representation is allocated, which can be modified until PyUnicode_Finalize() is called (explicitly or implicitly). Resizing a Unicode string remains possible until it is finalized.
PyUnicode_Finalize() converts a string containing only a wstr representation into the canonical representation. Unless wstr and str can share the memory, the wstr representation is discarded after the conversion.
String Access -------------
The canonical representation can be accessed using two macros PyUnicode_Kind and PyUnicode_Data. PyUnicode_Kind gives one of the value PyUnicode_1BYTE (1), PyUnicode_2BYTE (2), or PyUnicode_4BYTE (3). PyUnicode_Data gives the void pointer to the data, masking out the pointer kind. All these functions call PyUnicode_Finalize in case the canonical representation hasn't been computed yet.
A new function PyUnicode_AsUTF8 is provided to access the UTF-8 representation. It is thus identical to the existing _PyUnicode_AsString, which is removed. The function will compute the utf8 representation when first called. Since this representation will consume memory until the string object is released, applications should use the existing PyUnicode_AsUTF8String where possible (which generates a new string object every time). API that implicitly converts a string to a char* (such as the ParseTuple functions) will use this function to compute a conversion.
PyUnicode_AsUnicode is deprecated; it computes the wstr representation on first use.
String Operations -----------------
Various convenience functions will be provided to deal with the canonical representation, in particular with respect to concatenation and slicing.
Stable ABI ----------
None of the functions in this PEP become part of the stable ABI.
Copyright =========
This document has been placed in the public domain. _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/josiah.carlson%40gmail.com
"Martin v. Löwis", 24.01.2011 21:17:
I have been thinking about Unicode representation for some time now. This was triggered, on the one hand, by discussions with Glyph Lefkowitz (who complained that his server app consumes too much memory), and Carl Friedrich Bolz (who profiled Python applications to determine that Unicode strings are among the top consumers of memory in Python). On the other hand, this was triggered by the discussion on supporting surrogates in the library better.
I'd like to propose PEP 393, which takes a different approach, addressing both problems simultaneously: by getting a flexible representation (one that can be either 1, 2, or 4 bytes), we can support the full range of Unicode on all systems, but still use only one byte per character for strings that are pure ASCII (which will be the majority of strings for the majority of users).
You'll find the PEP at
http://www.python.org/dev/peps/pep-0393/
[...] Stable ABI ----------
None of the functions in this PEP become part of the stable ABI.
I think that's only part of the truth. This PEP can potentially have an impact on the stable ABI in the sense that the build-time size of Py_UNICODE may no longer be important for extensions that work on unicode buffers in the future as long as they only use the 'str' pointer and not 'wstr'. Stefan
None of the functions in this PEP become part of the stable ABI.
I think that's only part of the truth. This PEP can potentially have an impact on the stable ABI in the sense that the build-time size of Py_UNICODE may no longer be important for extensions that work on unicode buffers in the future as long as they only use the 'str' pointer and not 'wstr'.
Py_UNICODE isn't part of the stable ABI, so it wasn't important for extensions using the stable ABI before - so really no change here. Regards, Martin
"Martin v. Löwis", 29.01.2011 10:05:
None of the functions in this PEP become part of the stable ABI.
I think that's only part of the truth. This PEP can potentially have an impact on the stable ABI in the sense that the build-time size of Py_UNICODE may no longer be important for extensions that work on unicode buffers in the future as long as they only use the 'str' pointer and not 'wstr'.
Py_UNICODE isn't part of the stable ABI, so it wasn't important for extensions using the stable ABI before - so really no change here.
I know, that's not what I meant. But this PEP would enable a C API that provides direct access to the underlying buffer. Just as is currently provided for the Py_UNICODE array, but with a stable ABI because the buffer type won't change based on build time options. OTOH, one could argue that this is already partly provided by the generic buffer API. Stefan
On Sat, Jan 29, 2011 at 8:00 PM, Stefan Behnel
OTOH, one could argue that this is already partly provided by the generic buffer API.
Which won't be part of the stable ABI until 3.3 - there are some discrepancies between PEP 3118 and the actual implementation that we need to sort out first. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Sat, 29 Jan 2011 11:00:48 +0100
Stefan Behnel
I know, that's not what I meant. But this PEP would enable a C API that provides direct access to the underlying buffer. Just as is currently provided for the Py_UNICODE array, but with a stable ABI because the buffer type won't change based on build time options.
OTOH, one could argue that this is already partly provided by the generic buffer API.
Unicode objects don't provide the buffer API (and chances are they never will). Regards Antoine.
"Martin v. Löwis", 24.01.2011 21:17:
I have been thinking about Unicode representation for some time now. This was triggered, on the one hand, by discussions with Glyph Lefkowitz (who complained that his server app consumes too much memory), and Carl Friedrich Bolz (who profiled Python applications to determine that Unicode strings are among the top consumers of memory in Python). On the other hand, this was triggered by the discussion on supporting surrogates in the library better.
I'd like to propose PEP 393, which takes a different approach, addressing both problems simultaneously: by getting a flexible representation (one that can be either 1, 2, or 4 bytes), we can support the full range of Unicode on all systems, but still use only one byte per character for strings that are pure ASCII (which will be the majority of strings for the majority of users).
You'll find the PEP at
After much discussion, I'm +1 for this PEP. Implementation and benchmarks are pending, but there are strong indicators that it will bring relief for the memory overhead of most applications without leading to a major degradation performance-wise. Not for Python code anyway, and I'll try to make sure Cython extensions won't notice much when switching to CPython 3.3. Martin, this is a smart way of doing it. Stefan
"Martin v. Löwis", 24.01.2011 21:17:
I'd like to propose PEP 393, which takes a different approach, addressing both problems simultaneously: by getting a flexible representation (one that can be either 1, 2, or 4 bytes), we can support the full range of Unicode on all systems, but still use only one byte per character for strings that are pure ASCII (which will be the majority of strings for the majority of users).
You'll find the PEP at
http://www.python.org/dev/peps/pep-0393/ [...] The Py_UNICODE type is still supported but deprecated. It is always defined as a typedef for wchar_t, so the wstr representation can double as Py_UNICODE representation.
What about the character property functions? http://docs.python.org/py3k/c-api/unicode.html#unicode-character-properties Will they be adapted to accept Py_UCS4 instead of Py_UNICODE? Stefan
On Sat, Jan 29, 2011 at 12:03 PM, Stefan Behnel
What about the character property functions?
http://docs.python.org/py3k/c-api/unicode.html#unicode-character-properties
Will they be adapted to accept Py_UCS4 instead of Py_UNICODE?
They have been already. See revision 84177. Docs should be fixed.
participants (14)
-
"Martin v. Löwis"
-
Alexander Belopolsky
-
Antoine Pitrou
-
David Malcolm
-
Dj Gilcrease
-
Florian Weimer
-
Glenn Linderman
-
Gregory P. Smith
-
James Y Knight
-
Josiah Carlson
-
M.-A. Lemburg
-
Nick Coghlan
-
Paul Moore
-
Stefan Behnel