Mailman 3 PEP 393 review - Python-Dev

newer
Re: [Python-Dev] cpython: Remove...

PEP 393 review

"Martin v. Löwis"

24 Aug 2011 24 Aug '11

6:15 p.m.

Guido has agreed to eventually pronounce on PEP 393. Before that can happen, I'd like to collect feedback on it. There have been a number of voice supporting the PEP in principle, so I'm now interested in comments in the following areas: - principle objection. I'll list them in the PEP. - issues to be considered (unclarities, bugs, limitations, ...) - conditions you would like to pose on the implementation before acceptance. I'll see which of these can be resolved, and list the ones that remain open. Regards, Martin

Show replies by date

Antoine Pitrou

24 Aug 24 Aug

6:32 p.m.

On Wed, 24 Aug 2011 20:15:24 +0200 "Martin v. Löwis" wrote:

...

- issues to be considered (unclarities, bugs, limitations, ...)

With this PEP, the unicode object overhead grows to 10 pointer-sized words (including PyObject_HEAD), that's 80 bytes on a 64-bit machine. Does it have any adverse effects? Are there any plans to make instantiation of small strings fast enough? Or is it already as fast as it should be? When interfacing with the Win32 "wide" APIs, what is the recommended way to get the required LPCWSTR? Will the format codes returning a Py_UNICODE pointer with PyArg_ParseTuple be deprecated? Do you think the wstr representation could be removed in some future version of Python? Is PyUnicode_Ready() necessary for all unicode objects, or only those allocated through the legacy API? “The Py_Unicode representation is not instantaneously available”: you mean the Py_UNICODE representation?

...

- conditions you would like to pose on the implementation before acceptance. I'll see which of these can be resolved, and list the ones that remain open.

That it doesn't significantly slow down benchmarks such as stringbench and iobench. Regards Antoine.

Victor Stinner

10:29 p.m.

...

With this PEP, the unicode object overhead grows to 10 pointer-sized words (including PyObject_HEAD), that's 80 bytes on a 64-bit machine. Does it have any adverse effects?

For pure ASCII, it might be possible to use a shorter struct: typedef struct { PyObject_HEAD Py_ssize_t length; Py_hash_t hash; int state; Py_ssize_t wstr_length; wchar_t *wstr; /* no more utf8_length, utf8, str */ /* followed by ascii data */ } _PyASCIIObject; (-2 pointer -1 ssize_t: 56 bytes) => "a" is 58 bytes (with utf8 for free, without wchar_t) For object allocated with the new API, we can use a shorter struct: typedef struct { PyObject_HEAD Py_ssize_t length; Py_hash_t hash; int state; Py_ssize_t wstr_length; wchar_t *wstr; Py_ssize_t utf8_length; char *utf8; /* no more str pointer */ /* followed by latin1/ucs2/ucs4 data */ } _PyNewUnicodeObject; (-1 pointer: 72 bytes) => "é" is 74 bytes (without utf8 / wchar_t) For the legacy API: typedef struct { PyObject_HEAD Py_ssize_t length; Py_hash_t hash; int state; Py_ssize_t wstr_length; wchar_t *wstr; Py_ssize_t utf8_length; char *utf8; void *str; } _PyLegacyUnicodeObject; (same size: 80 bytes) => "a" is 80+2 (2 malloc) bytes (without utf8 / wchar_t) The current struct: typedef struct { PyObject_HEAD Py_ssize_t length; Py_UNICODE *str; Py_hash_t hash; int state; PyObject *defenc; } PyUnicodeObject; => "a" is 56+2 (2 malloc) bytes (without utf8, with wchar_t if Py_UNICODE is wchar_t) ... but the code (maybe only the macros?) and debuging will be more complex.

...

Will the format codes returning a Py_UNICODE pointer with PyArg_ParseTuple be deprecated?

Because Python 2.x is still dominant and it's already hard enough to port C modules, it's not the best moment to deprecate the legacy API (Py_UNICODE*).

...

Do you think the wstr representation could be removed in some future version of Python?

Conversion to wchar_t* is common, especially on Windows. But I don't know if we *have to* cache the result. Is it cached by the way? Or is wstr only used when a string is created from Py_UNICODE? Victor

Stefan Behnel

25 Aug 25 Aug

4:46 a.m.

Victor Stinner, 25.08.2011 00:29:

...

...
With this PEP, the unicode object overhead grows to 10 pointer-sized words (including PyObject_HEAD), that's 80 bytes on a 64-bit machine. Does it have any adverse effects?

For pure ASCII, it might be possible to use a shorter struct:

typedef struct { PyObject_HEAD Py_ssize_t length; Py_hash_t hash; int state; Py_ssize_t wstr_length; wchar_t *wstr; /* no more utf8_length, utf8, str */ /* followed by ascii data */ } _PyASCIIObject; (-2 pointer -1 ssize_t: 56 bytes)

=> "a" is 58 bytes (with utf8 for free, without wchar_t)

For object allocated with the new API, we can use a shorter struct:

typedef struct { PyObject_HEAD Py_ssize_t length; Py_hash_t hash; int state; Py_ssize_t wstr_length; wchar_t *wstr; Py_ssize_t utf8_length; char *utf8; /* no more str pointer */ /* followed by latin1/ucs2/ucs4 data */ } _PyNewUnicodeObject; (-1 pointer: 72 bytes)

=> "é" is 74 bytes (without utf8 / wchar_t)

For the legacy API:

typedef struct { PyObject_HEAD Py_ssize_t length; Py_hash_t hash; int state; Py_ssize_t wstr_length; wchar_t *wstr; Py_ssize_t utf8_length; char *utf8; void *str; } _PyLegacyUnicodeObject; (same size: 80 bytes)

=> "a" is 80+2 (2 malloc) bytes (without utf8 / wchar_t)

The current struct:

typedef struct { PyObject_HEAD Py_ssize_t length; Py_UNICODE *str; Py_hash_t hash; int state; PyObject *defenc; } PyUnicodeObject;

=> "a" is 56+2 (2 malloc) bytes (without utf8, with wchar_t if Py_UNICODE is wchar_t)

... but the code (maybe only the macros?) and debuging will be more complex.

That's an interesting idea. However, it's not required to do this as part of the PEP 393 implementation. This can be added later on if the need evidently arises in general practice. Also, there is always the possibility to simply intern very short strings in order to avoid their multiplication in memory. Long strings don't suffer from this as the data size quickly dominates. User code that works with a lot of short strings would likely do the same. BTW, I would expect that many short strings either go away as quickly as they appeared (e.g. in a parser) or were brought in as literals and are therefore interned anyway. That's just one reason why I suggest to wait for a prove of inefficiency in the real world (and, obviously, to test your own code with this as quickly as possible).

...

...
Will the format codes returning a Py_UNICODE pointer with PyArg_ParseTuple be deprecated?

Because Python 2.x is still dominant and it's already hard enough to port C modules, it's not the best moment to deprecate the legacy API (Py_UNICODE*).

Well, it will be quite inefficient in future CPython versions, so I think if it's not officially deprecated at some point, it will deprecate itself for efficiency reasons. Better make it clear that it's worth investing in better performance here.

...

...
Do you think the wstr representation could be removed in some future version of Python?

Conversion to wchar_t* is common, especially on Windows.

That's an issue. However, I cannot say how common this really is in practice. Surely depends on the specific code, right? How common is it in core CPython?

...

But I don't know if we *have to* cache the result. Is it cached by the way? Or is wstr only used when a string is created from Py_UNICODE?

If it's so common on Windows, maybe it should only be cached there? Stefan

Victor Stinner

9:17 a.m.

Le 25/08/2011 06:46, Stefan Behnel a écrit :

...

...
Conversion to wchar_t* is common, especially on Windows.

That's an issue. However, I cannot say how common this really is in practice. Surely depends on the specific code, right? How common is it in core CPython?

Quite all functions taking text as argument on Windows expects wchar_t* strings (UTF-16). In Python, we pass a "Py_UNICODE*" (PyUnicode_AS_UNICODE or PyUnicode_AsUnicode) because Py_UNICODE is wchar_t on Windows. Victor

"Martin v. Löwis"

8:24 a.m.

...

With this PEP, the unicode object overhead grows to 10 pointer-sized words (including PyObject_HEAD), that's 80 bytes on a 64-bit machine. Does it have any adverse effects?

If I count correctly, it's only three *additional* words (compared to 3.2): four new ones, minus one that is removed. In addition, it drops a memory block. Assuming a malloc overhead of two pointers per malloc block, we get one additional pointer. On a 32-bit machine with a 32-bit wchar_t, pure-ASCII strings of length 1 (+NUL) will take the same memory either way: 8 bytes for the characters in 3.2, 2 bytes in 3.3 + extra pointer + padding. Strings of 2 or more characters will take more space in 3.2. On a 32-bit machine with a 16-bit wchar_t, pure-ASCII strings up to 3 characters take the same space either way; space savings start at four characters. On a 64-bit machine with a 16-bit wchar_t, assuming a malloc minimum block size of 16 bytes, pure-ASCII strings of up to 7 characters take the same space. For 8 characters, 3.2 will need 32 bytes for the characters, whereas 3.3 will only take 16 bytes (due to padding). So: no, I can't see any adverse effects. Details depend on the malloc implementation, though. A slight memory increase may occur on compared to a narrow build may occur for strings that use non-Latin-1, and a large increase for strings that use non-BMP characters. The real issue of memory consumption is the alternative representations, if created. That applies for the default encoding in 3.2 as well as the wchar_t and UTF-8 representations in 3.3.

...

Are there any plans to make instantiation of small strings fast enough? Or is it already as fast as it should be?

I don't have any plans, and I don't see potential. Compared to 3.2, it saves a malloc call, which may be quite an improvement. OTOH, it needs to iterate over the characters twice, to find the largest character. If you are referring to the reuse of Unicode objects: that's currently not done, and is difficult to do in the 3.2 way due to the various sizes of characters. One idea might be to only reuse UCS1 strings, and then keep a freelist for these based on the string length.

...

When interfacing with the Win32 "wide" APIs, what is the recommended way to get the required LPCWSTR?

As before: PyUnicode_AsUnicode.

...

Will the format codes returning a Py_UNICODE pointer with PyArg_ParseTuple be deprecated?

Not for 3.3, no.

...

Do you think the wstr representation could be removed in some future version of Python?

Yes. This probably has to wait for Python 4, though.

...

Is PyUnicode_Ready() necessary for all unicode objects, or only those allocated through the legacy API?

Only for the latter (although it doesn't hurt to apply it to all of them).

...

“The Py_Unicode representation is not instantaneously available”: you mean the Py_UNICODE representation?

Thanks, fixed.

...

...
- conditions you would like to pose on the implementation before acceptance. I'll see which of these can be resolved, and list the ones that remain open.

That it doesn't significantly slow down benchmarks such as stringbench and iobench.

Can you please quantify "significantly"? Also, having a complete list of benchmarks to perform prior to acceptance would be helpful. Thanks, Martin

Antoine Pitrou

11:27 a.m.

Hello, On Thu, 25 Aug 2011 10:24:39 +0200 "Martin v. Löwis" wrote:

...

On a 32-bit machine with a 32-bit wchar_t, pure-ASCII strings of length 1 (+NUL) will take the same memory either way: 8 bytes for the characters in 3.2, 2 bytes in 3.3 + extra pointer + padding. Strings of 2 or more characters will take more space in 3.2.

On a 32-bit machine with a 16-bit wchar_t, pure-ASCII strings up to 3 characters take the same space either way; space savings start at four characters.

On a 64-bit machine with a 16-bit wchar_t, assuming a malloc minimum block size of 16 bytes, pure-ASCII strings of up to 7 characters take the same space. For 8 characters, 3.2 will need 32 bytes for the characters, whereas 3.3 will only take 16 bytes (due to padding).

That's very good. For future reference, could you add this information to the PEP?

...

...
...
- conditions you would like to pose on the implementation before acceptance. I'll see which of these can be resolved, and list the ones that remain open.

That it doesn't significantly slow down benchmarks such as stringbench and iobench.

Can you please quantify "significantly"? Also, having a complete list of benchmarks to perform prior to acceptance would be helpful.

I would say no more than a 15% slowdown on each of the following benchmarks: - stringbench.py -u (http://svn.python.org/view/sandbox/trunk/stringbench/) - iobench.py -t (in Tools/iobench/) - the json_dump, json_load and regex_v8 tests from http://hg.python.org/benchmarks/ I believe these are representative of string-heavy operations. Additionally, it would be nice if you could run at least some of the test_bigmem tests, according to your system's available RAM. Regards Antoine.

"Martin v. Löwis"

28 Aug 28 Aug

7:47 p.m.

...

I would say no more than a 15% slowdown on each of the following benchmarks:

- stringbench.py -u (http://svn.python.org/view/sandbox/trunk/stringbench/) - iobench.py -t (in Tools/iobench/) - the json_dump, json_load and regex_v8 tests from http://hg.python.org/benchmarks/

I now have benchmark results for these; numbers are for revision c10bcab2aac7, comparing to 1ea72da11724 (wide unicode), on 64-bit Linux with gcc 4.6.1 running on Core i7 2.8GHz. - stringbench gives 10% slowdown on total time; the tests take between 78% and 220%. The cost is typically not in performing the string operations themselves, but in the creation of the result strings. In PEP 393, a buffer must be scanned for the highest code point, which means that each byte must be inspected twice (a second time when the copying occurs). - the iobench results are between 2% acceleration (seek operations), 16% slowdown for small-sized reads (4.31MB/s vs. 5.22 MB/s) and 37% for large sized reads (154 MB/s vs. 235 MB/s). The speed difference is probably in the UTF-8 decoder; I have already restored the "runs of ASCII" optimization and am out of ideas for further speedups. Again, having to scan the UTF-8 string twice is probably one cause of slowdown. - the json and regex_v8 tests see a slowdown of below 1%. The slowdown is larger when compared with a narrow Unicode build.

...

Additionally, it would be nice if you could run at least some of the test_bigmem tests, according to your system's available RAM.

Running only StrTest with 4.5G allows me to run 2 tests (test_encode_raw_unicode_escape and test_encode_utf7); this sees a slowdown of 37% in Linux user time. Regards, Martin

Antoine Pitrou

8:01 p.m.

...

- the iobench results are between 2% acceleration (seek operations), 16% slowdown for small-sized reads (4.31MB/s vs. 5.22 MB/s) and 37% for large sized reads (154 MB/s vs. 235 MB/s). The speed difference is probably in the UTF-8 decoder; I have already restored the "runs of ASCII" optimization and am out of ideas for further speedups. Again, having to scan the UTF-8 string twice is probably one cause of slowdown.

I don't think it's the UTF-8 decoder because I see an even larger slowdown with simpler encodings (e.g. "-E latin1" or "-E utf-16le"). Thanks Antoine.

"Martin v. Löwis"

8:23 p.m.

Am 28.08.2011 22:01, schrieb Antoine Pitrou:

...

...
- the iobench results are between 2% acceleration (seek operations), 16% slowdown for small-sized reads (4.31MB/s vs. 5.22 MB/s) and 37% for large sized reads (154 MB/s vs. 235 MB/s). The speed difference is probably in the UTF-8 decoder; I have already restored the "runs of ASCII" optimization and am out of ideas for further speedups. Again, having to scan the UTF-8 string twice is probably one cause of slowdown.

I don't think it's the UTF-8 decoder because I see an even larger slowdown with simpler encodings (e.g. "-E latin1" or "-E utf-16le").

But those aren't used in iobench, are they? Regards, Martin

Antoine Pitrou

8:27 p.m.

Le dimanche 28 août 2011 à 22:23 +0200, "Martin v. Löwis" a écrit :

...

Am 28.08.2011 22:01, schrieb Antoine Pitrou:

...
...
- the iobench results are between 2% acceleration (seek operations), 16% slowdown for small-sized reads (4.31MB/s vs. 5.22 MB/s) and 37% for large sized reads (154 MB/s vs. 235 MB/s). The speed difference is probably in the UTF-8 decoder; I have already restored the "runs of ASCII" optimization and am out of ideas for further speedups. Again, having to scan the UTF-8 string twice is probably one cause of slowdown.

I don't think it's the UTF-8 decoder because I see an even larger slowdown with simpler encodings (e.g. "-E latin1" or "-E utf-16le").

But those aren't used in iobench, are they?

I was not very clear, but you can change the encoding used in iobench by using the "-E" command-line option (while UTF-8 is the default if you don't specify anything). For example: $ ./python Tools/iobench/iobench.py -t -E latin1 Preparing files... Text unit = one character (latin1-decoded) ** Text input ** [ 400KB ] read one unit at a time... 5.17 MB/s [ 400KB ] read 20 units at a time... 77.6 MB/s [ 400KB ] read one line at a time... 209 MB/s [ 400KB ] read 4096 units at a time... 509 MB/s [ 20KB ] read whole contents at once... 885 MB/s [ 400KB ] read whole contents at once... 730 MB/s [ 10MB ] read whole contents at once... 726 MB/s (etc.) Regards Antoine.

"Martin v. Löwis"

9:06 p.m.

Am 28.08.2011 22:01, schrieb Antoine Pitrou:

...

...
- the iobench results are between 2% acceleration (seek operations), 16% slowdown for small-sized reads (4.31MB/s vs. 5.22 MB/s) and 37% for large sized reads (154 MB/s vs. 235 MB/s). The speed difference is probably in the UTF-8 decoder; I have already restored the "runs of ASCII" optimization and am out of ideas for further speedups. Again, having to scan the UTF-8 string twice is probably one cause of slowdown.

I don't think it's the UTF-8 decoder because I see an even larger slowdown with simpler encodings (e.g. "-E latin1" or "-E utf-16le").

Those haven't been ported to the new API, yet. Consider, for example, d9821affc9ee. Before that, I got 253 MB/s on the 4096 units read test; with that change, I get 610 MB/s. The trunk gives me 488 MB/s, so this is a 25% speedup for PEP 393. Regards, Martin

Victor Stinner

29 Aug 29 Aug

8:52 a.m.

Le 28/08/2011 23:06, "Martin v. Löwis" a écrit :

...

Am 28.08.2011 22:01, schrieb Antoine Pitrou:

...
...
- the iobench results are between 2% acceleration (seek operations), 16% slowdown for small-sized reads (4.31MB/s vs. 5.22 MB/s) and 37% for large sized reads (154 MB/s vs. 235 MB/s). The speed difference is probably in the UTF-8 decoder; I have already restored the "runs of ASCII" optimization and am out of ideas for further speedups. Again, having to scan the UTF-8 string twice is probably one cause of slowdown.

I don't think it's the UTF-8 decoder because I see an even larger slowdown with simpler encodings (e.g. "-E latin1" or "-E utf-16le").

Those haven't been ported to the new API, yet. Consider, for example, d9821affc9ee. Before that, I got 253 MB/s on the 4096 units read test; with that change, I get 610 MB/s. The trunk gives me 488 MB/s, so this is a 25% speedup for PEP 393.

If I understand correctly, the performance now highly depend on the used characters? A pure ASCII string is faster than a string with characters in the ISO-8859-1 charset? Is it also true for BMP characters vs non-BMP characters? Do these benchmark tools use only ASCII characters, or also some ISO-8859-1 characters? Or, better, different Unicode ranges in different tests? Victor

"Martin v. Löwis"

7:34 p.m.

...

...
Those haven't been ported to the new API, yet. Consider, for example, d9821affc9ee. Before that, I got 253 MB/s on the 4096 units read test; with that change, I get 610 MB/s. The trunk gives me 488 MB/s, so this is a 25% speedup for PEP 393.

If I understand correctly, the performance now highly depend on the used characters? A pure ASCII string is faster than a string with characters in the ISO-8859-1 charset?

How did you infer that from above paragraph??? ASCII and Latin-1 are mostly identical in terms of performance - the ASCII decoder should be slightly slower than the Latin-1 decoder, since the ASCII decoder needs to check for errors, whereas the Latin-1 decoder will never be confronted with errors. What matters is a) is the codec already rewritten to use the new representation, or must it go through Py_UNICODE[] first, requiring then a second copy to the canonical form? b) what is the cost of finding out the highest character? - regardless of what the highest character turns out to be

...

Is it also true for BMP characters vs non-BMP characters?

Well... If you are talking about the ASCII and Latin-1 codecs - neither of these support most BMP characters, let alone non-BMP characters. In general, non-BMP characters are more expensive to process since they take more space.

...

Do these benchmark tools use only ASCII characters, or also some ISO-8859-1 characters?

See for yourself. iobench uses Latin-1, including non-ASCII, but not non-Latin-1.

...

Or, better, different Unicode ranges in different tests?

That's why I asked for a list of benchmarks to perform. I cannot run an infinite number of benchmarks prior to adoption of the PEP. Regards, Martin

Victor Stinner

10:20 p.m.

Le lundi 29 août 2011 21:34:48, vous avez écrit :

...

...
...
Those haven't been ported to the new API, yet. Consider, for example, d9821affc9ee. Before that, I got 253 MB/s on the 4096 units read test; with that change, I get 610 MB/s. The trunk gives me 488 MB/s, so this is a 25% speedup for PEP 393.

If I understand correctly, the performance now highly depend on the used characters? A pure ASCII string is faster than a string with characters in the ISO-8859-1 charset?

How did you infer that from above paragraph??? ASCII and Latin-1 are mostly identical in terms of performance - the ASCII decoder should be slightly slower than the Latin-1 decoder, since the ASCII decoder needs to check for errors, whereas the Latin-1 decoder will never be confronted with errors.

I don't compare ASCII and ISO-8859-1 decoders. I was asking if decoding b'abc' from ISO-8859-1 is faster than decoding b'ab\xff' from ISO-8859-1, and if yes: why? Your patch replaces PyUnicode_New(size, 255) ... memcpy(), by PyUnicode_FromUCS1(). I don't understand how it makes Python faster: PyUnicode_FromUCS1() does first scan the input string for the maximum code point. I suppose that the main difference is that the ISO-8859-1 encoded string is stored as the UTF-8 encoded string (shared pointer) if all characters of the string are ASCII characters. In this case, encoding the string to UTF-8 doesn't cost anything, we already have the result. Am I correct? Victor

"Martin v. Löwis"

30 Aug 30 Aug

6:20 a.m.

...

I don't compare ASCII and ISO-8859-1 decoders. I was asking if decoding b'abc' from ISO-8859-1 is faster than decoding b'ab\xff' from ISO-8859-1, and if yes: why?

No, that makes no difference.

...

Your patch replaces PyUnicode_New(size, 255) ... memcpy(), by PyUnicode_FromUCS1().

You compared to the wrong revision. PyUnicode_New is already a PEP 393 function, and this version you have been comparing to is indeed faster than the current version. However, it is also incorrect, as it fails to compute the maxchar, and hence fails to detect pure-ASCII strings. See below for the actual diff. It should be obvious why the 393 version is faster: 3.3 currently needs to widen each char (to 16 or 32 bits). Regards, Martin @@ -5569,41 +5569,8 @@ Py_ssize_t size, const char *errors) { - PyUnicodeObject *v; - Py_UNICODE *p; - const char *e, *unrolled_end; - /* Latin-1 is equivalent to the first 256 ordinals in Unicode. */ - if (size == 1) { - Py_UNICODE r = *(unsigned char*)s; - return PyUnicode_FromUnicode(&r, 1); - } - - v = _PyUnicode_New(size); - if (v == NULL) - goto onError; - if (size == 0) - return (PyObject *)v; - p = PyUnicode_AS_UNICODE(v); - e = s + size; - /* Unrolling the copy makes it much faster by reducing the looping - overhead. This is similar to what many memcpy() implementations do. */ - unrolled_end = e - 4; - while (s < unrolled_end) { - p[0] = (unsigned char) s[0]; - p[1] = (unsigned char) s[1]; - p[2] = (unsigned char) s[2]; - p[3] = (unsigned char) s[3]; - s += 4; - p += 4; - } - while (s < e) - *p++ = (unsigned char) *s++; - return (PyObject *)v; - - onError: - Py_XDECREF(v); - return NULL; + return PyUnicode_FromUCS1((unsigned char*)s, size); } /* create or adjust a UnicodeEncodeError */

Dirkjan Ochtman

29 Aug 29 Aug

9:03 a.m.

On Sun, Aug 28, 2011 at 21:47, "Martin v. Löwis" wrote:

...

result strings. In PEP 393, a buffer must be scanned for the highest code point, which means that each byte must be inspected twice (a second time when the copying occurs).

This may be a silly question: are there things in place to optimize this for the case where two strings are combined? E.g. highest character in combined string is max(highest character in either of the strings). Also, this PEP makes me wonder if there should be a way to distinguish between language PEPs and (CPython) implementation PEPs, by adding a tag or using the PEP number ranges somehow. Cheers, Dirkjan

Victor Stinner

9:19 a.m.

Le 29/08/2011 11:03, Dirkjan Ochtman a écrit :

...

On Sun, Aug 28, 2011 at 21:47, "Martin v. Löwis" wrote:

...
result strings. In PEP 393, a buffer must be scanned for the highest code point, which means that each byte must be inspected twice (a second time when the copying occurs).

This may be a silly question: are there things in place to optimize this for the case where two strings are combined? E.g. highest character in combined string is max(highest character in either of the strings).

The "double-scan" issue is only for codec decoders. If you combine two Unicode objects (a+b), you already know the highest code point and the kind of each string. Victor

Barry Warsaw

4:24 p.m.

New subject: PEP categories (was Re: PEP 393 review)

On Aug 29, 2011, at 11:03 AM, Dirkjan Ochtman wrote:

...

Also, this PEP makes me wonder if there should be a way to distinguish between language PEPs and (CPython) implementation PEPs, by adding a tag or using the PEP number ranges somehow.

I've thought about this, and about a similar split between language changes and stdlib changes (i.e. new modules such as regex). Probably the best thing to do would be to allocate some 1000's to the different categories, like we did for the 3xxx Python 3k PEPS (now largely moot though). -Barry

Dirkjan Ochtman

4:38 p.m.

New subject: PEP categories (was Re: PEP 393 review)

On Mon, Aug 29, 2011 at 18:24, Barry Warsaw wrote:

...

...
Also, this PEP makes me wonder if there should be a way to distinguish between language PEPs and (CPython) implementation PEPs, by adding a tag or using the PEP number ranges somehow.

I've thought about this, and about a similar split between language changes and stdlib changes (i.e. new modules such as regex). Probably the best thing to do would be to allocate some 1000's to the different categories, like we did for the 3xxx Python 3k PEPS (now largely moot though).

Allocating 1000's seems sensible enough to me. And yes, the division between recents 3x and non-3x PEPs seems quite arbitrary. Cheers, Dirkjan P.S. Perhaps the index could list accepted and open PEPs before meta and informational? And maybe reverse the order under some headings, for example in the finished category...

Antoine Pitrou

4:40 p.m.

New subject: PEP categories (was Re: PEP 393 review)

On Mon, 29 Aug 2011 18:38:23 +0200 Dirkjan Ochtman wrote:

...

On Mon, Aug 29, 2011 at 18:24, Barry Warsaw wrote:

...
...
Also, this PEP makes me wonder if there should be a way to distinguish between language PEPs and (CPython) implementation PEPs, by adding a tag or using the PEP number ranges somehow.

I've thought about this, and about a similar split between language changes and stdlib changes (i.e. new modules such as regex). Probably the best thing to do would be to allocate some 1000's to the different categories, like we did for the 3xxx Python 3k PEPS (now largely moot though).

Allocating 1000's seems sensible enough to me.

And yes, the division between recents 3x and non-3x PEPs seems quite arbitrary.

I like the 3k numbers myself :))

Barry Warsaw

7 p.m.

New subject: PEP categories (was Re: PEP 393 review)

On Aug 29, 2011, at 06:40 PM, Antoine Pitrou wrote:

...

I like the 3k numbers myself :))

Me too. :) But I think we've pretty much abandoned that convention for any new PEPs. Well, until Guido announces Python 4k. :) -Barry

Stefan Behnel

4:55 p.m.

New subject: PEP categories (was Re: PEP 393 review)

Barry Warsaw, 29.08.2011 18:24:

...

On Aug 29, 2011, at 11:03 AM, Dirkjan Ochtman wrote:

...
Also, this PEP makes me wonder if there should be a way to distinguish between language PEPs and (CPython) implementation PEPs, by adding a tag or using the PEP number ranges somehow.

I've thought about this, and about a similar split between language changes and stdlib changes (i.e. new modules such as regex). Probably the best thing to do would be to allocate some 1000's to the different categories, like we did for the 3xxx Python 3k PEPS (now largely moot though).

These things tend to get somewhat clumsy over time, though. What about a stdlib change that only applies to CPython for some reason, e.g. because no other implementation currently has that module? I think it's ok to make a coarse-grained distinction by numbers, but there should also be a way to tag PEPs textually. Stefan

Barry Warsaw

6:59 p.m.

New subject: PEP categories (was Re: PEP 393 review)

On Aug 29, 2011, at 06:55 PM, Stefan Behnel wrote:

...

These things tend to get somewhat clumsy over time, though. What about a stdlib change that only applies to CPython for some reason, e.g. because no other implementation currently has that module? I think it's ok to make a coarse-grained distinction by numbers, but there should also be a way to tag PEPs textually.

Yeah, the categories would be pretty coarse grained, and their orthogonality would cause classification problems. I suppose we could use some kind of hashtag approach. OTOH, I'm not entirely sure it's worth it either. ;) I think we'd need a concrete proposal and someone willing to hack the PEP0 autogen tools. -Barry

"Martin v. Löwis"

7:20 p.m.

Am 29.08.2011 11:03, schrieb Dirkjan Ochtman:

...

On Sun, Aug 28, 2011 at 21:47, "Martin v. Löwis" wrote:

...
result strings. In PEP 393, a buffer must be scanned for the highest code point, which means that each byte must be inspected twice (a second time when the copying occurs).

This may be a silly question: are there things in place to optimize this for the case where two strings are combined? E.g. highest character in combined string is max(highest character in either of the strings).

Unicode_Concat goes like this maxchar = PyUnicode_MAX_CHAR_VALUE(u); if (PyUnicode_MAX_CHAR_VALUE(v) > maxchar) maxchar = PyUnicode_MAX_CHAR_VALUE(v); /* Concat the two Unicode strings */ w = (PyUnicodeObject *) PyUnicode_New( PyUnicode_GET_LENGTH(u) + PyUnicode_GET_LENGTH(v), maxchar); if (w == NULL) goto onError; PyUnicode_CopyCharacters(w, 0, u, 0, PyUnicode_GET_LENGTH(u)); PyUnicode_CopyCharacters(w, PyUnicode_GET_LENGTH(u), v, 0, PyUnicode_GET_LENGTH(v));

...

Also, this PEP makes me wonder if there should be a way to distinguish between language PEPs and (CPython) implementation PEPs, by adding a tag or using the PEP number ranges somehow.

Well, no. This would equally apply to every single patch, and is just not feasible. Instead, alternative implementations typically target a CPython version, and then find out what features they need to implement to claim conformance. Regards, Martin

Guido van Rossum

25 Aug 25 Aug

9:55 p.m.

On Thu, Aug 25, 2011 at 1:24 AM, "Martin v. Löwis" wrote:

...

...
With this PEP, the unicode object overhead grows to 10 pointer-sized words (including PyObject_HEAD), that's 80 bytes on a 64-bit machine. Does it have any adverse effects?

If I count correctly, it's only three *additional* words (compared to 3.2): four new ones, minus one that is removed. In addition, it drops a memory block. Assuming a malloc overhead of two pointers per malloc block, we get one additional pointer. [...]

But strings are allocated via PyObject_Malloc(), i.e. the custom arena-based allocator -- isn't its overhead (for small objects) less than 2 pointers per block? -- --Guido van Rossum (python.org/~guido)

"Martin v. Löwis"

26 Aug 26 Aug

10:29 a.m.

...

But strings are allocated via PyObject_Malloc(), i.e. the custom arena-based allocator -- isn't its overhead (for small objects) less than 2 pointers per block?

Ah, right, I missed that. Indeed, those have no header, and the only overhead is the padding to a multiple of 8. That shifts the picture; I hope the table below is correct, assuming ASCII strings. 3.2: 7 pointers (adds 4 bytes padding on 32-bit systems) 393: 10 pointers string | 32-bit pointer | 32-bit pointer | 64-bit pointer size | 16-bit wchar_t | 32-bit wchar_t | 32-bit wchar_t | 3.2 | 393 | 3.2 | 393 | 3.2 | 393 | ----------------------------------------------------------- 1 | 40 | 48 | 40 | 48 | 64 | 88 | 2 | 40 | 48 | 48 | 48 | 72 | 88 | 3 | 40 | 48 | 48 | 48 | 72 | 88 | 4 | 48 | 48 | 56 | 48 | 80 | 88 | 5 | 48 | 48 | 56 | 48 | 80 | 88 | 6 | 48 | 48 | 64 | 48 | 88 | 88 | 7 | 48 | 48 | 64 | 48 | 88 | 88 | 8 | 56 | 56 | 72 | 56 | 96 | 86 | So 1-byte strings increase in size; very short strings increase on 16-bit-wchar_t systems and 64-bit systems. Short strings keep there size, and long strings save. Regards, Martin

Guido van Rossum

2:55 p.m.

It would be nice if someone wrote a test to roughly verify these numbers, e.v. by allocating lots of strings of a certain size and measuring the process size before and after (being careful to adjust for the list or other data structure required to keep those objects alive). --Guido On Fri, Aug 26, 2011 at 3:29 AM, "Martin v. Löwis" wrote:

...

...
But strings are allocated via PyObject_Malloc(), i.e. the custom arena-based allocator -- isn't its overhead (for small objects) less than 2 pointers per block?

Ah, right, I missed that. Indeed, those have no header, and the only overhead is the padding to a multiple of 8.

That shifts the picture; I hope the table below is correct, assuming ASCII strings. 3.2: 7 pointers (adds 4 bytes padding on 32-bit systems) 393: 10 pointers

string | 32-bit pointer | 32-bit pointer | 64-bit pointer size | 16-bit wchar_t | 32-bit wchar_t | 32-bit wchar_t | 3.2 | 393 | 3.2 | 393 | 3.2 | 393 | ----------------------------------------------------------- 1 | 40 | 48 | 40 | 48 | 64 | 88 | 2 | 40 | 48 | 48 | 48 | 72 | 88 | 3 | 40 | 48 | 48 | 48 | 72 | 88 | 4 | 48 | 48 | 56 | 48 | 80 | 88 | 5 | 48 | 48 | 56 | 48 | 80 | 88 | 6 | 48 | 48 | 64 | 48 | 88 | 88 | 7 | 48 | 48 | 64 | 48 | 88 | 88 | 8 | 56 | 56 | 72 | 56 | 96 | 86 |

So 1-byte strings increase in size; very short strings increase on 16-bit-wchar_t systems and 64-bit systems. Short strings keep there size, and long strings save.

Regards, Martin

-- --Guido van Rossum (python.org/~guido)

Guido van Rossum

2:56 p.m.

Also, please add the table (and the reasoning that led to it) to the PEP. On Fri, Aug 26, 2011 at 7:55 AM, Guido van Rossum wrote:

...

It would be nice if someone wrote a test to roughly verify these numbers, e.v. by allocating lots of strings of a certain size and measuring the process size before and after (being careful to adjust for the list or other data structure required to keep those objects alive).

--Guido

On Fri, Aug 26, 2011 at 3:29 AM, "Martin v. Löwis" wrote:

...
...
But strings are allocated via PyObject_Malloc(), i.e. the custom arena-based allocator -- isn't its overhead (for small objects) less than 2 pointers per block?

Ah, right, I missed that. Indeed, those have no header, and the only overhead is the padding to a multiple of 8.

That shifts the picture; I hope the table below is correct, assuming ASCII strings. 3.2: 7 pointers (adds 4 bytes padding on 32-bit systems) 393: 10 pointers

string | 32-bit pointer | 32-bit pointer | 64-bit pointer size | 16-bit wchar_t | 32-bit wchar_t | 32-bit wchar_t | 3.2 | 393 | 3.2 | 393 | 3.2 | 393 | ----------------------------------------------------------- 1 | 40 | 48 | 40 | 48 | 64 | 88 | 2 | 40 | 48 | 48 | 48 | 72 | 88 | 3 | 40 | 48 | 48 | 48 | 72 | 88 | 4 | 48 | 48 | 56 | 48 | 80 | 88 | 5 | 48 | 48 | 56 | 48 | 80 | 88 | 6 | 48 | 48 | 64 | 48 | 88 | 88 | 7 | 48 | 48 | 64 | 48 | 88 | 88 | 8 | 56 | 56 | 72 | 56 | 96 | 86 |

So 1-byte strings increase in size; very short strings increase on 16-bit-wchar_t systems and 64-bit systems. Short strings keep there size, and long strings save.

Regards, Martin

-- --Guido van Rossum (python.org/~guido)

-- --Guido van Rossum (python.org/~guido)

"Martin v. Löwis"

28 Aug 28 Aug

6:13 p.m.

Am 26.08.2011 16:56, schrieb Guido van Rossum:

...

Also, please add the table (and the reasoning that led to it) to the PEP.

Done! Martin

"Martin v. Löwis"

29 Aug 29 Aug

8:32 p.m.

tl;dr: PEP-393 reduces the memory usage for strings of a very small Django app from 7.4MB to 4.4MB, all other objects taking about 1.9MB. Am 26.08.2011 16:55, schrieb Guido van Rossum:

...

It would be nice if someone wrote a test to roughly verify these numbers, e.v. by allocating lots of strings of a certain size and measuring the process size before and after (being careful to adjust for the list or other data structure required to keep those objects alive).

I have now written a Django application to measure the effect of PEP 393, using the debug mode (to find all strings), and sys.getsizeof: https://bitbucket.org/t0rsten/pep-393/src/ad02e1b4cad9/pep393utils/djmemprof... The results for 3.3 and pep-393 are attached. The Django app is small in every respect: trivial ORM, very few objects (just for the sake of exercising the ORM at all), no templating, short strings. The memory snapshot is taken in the middle of a request. The tests were run on a 64-bit Linux system with 32-bit Py_UNICODE. The tally of strings by length confirms that both tests have indeed comparable sets of objects (not surprising since it is identical Django source code and the identical application). Most strings in this benchmark are shorter than 16 characters, and a few have several thousand characters. The tally of byte lengths shows that it's the really long memory blocks that are gone with the PEP. Digging into the internal representation, it's possibly to estimate "unaccounted" bytes. For PEP 393: bytes - 80*strings - (chars+strings) = 190053 This is the total of the wchar_t and UTF-8 representations for objects that have them, plus any 2-byte and four-byte strings accounted incorrectly in above formula. Unfortunately, for "default" bytes + 56*strings - 4*(chars+strings) = 0 as unicode__sizeof__ doesn't account for the (separate) PyBytes object that may carry the default encoding. So in practice, the 3.3 number should be somewhat larger. In both cases, the app didn't cope for internal fragmentation; this would be possible by rounding up each string size to the next multiple of 8 (given that it's all allocated through the object allocator). It should be possible to squeeze a little bit out of the 190kB, by finding objects for which the wchar_t or UTF-8 representations are created unnecessarily. Regards, Martin

Antoine Pitrou

8:54 p.m.

On Mon, 29 Aug 2011 22:32:01 +0200 "Martin v. Löwis" wrote:

...

I have now written a Django application to measure the effect of PEP 393, using the debug mode (to find all strings), and sys.getsizeof:

https://bitbucket.org/t0rsten/pep-393/src/ad02e1b4cad9/pep393utils/djmemprof...

The results for 3.3 and pep-393 are attached.

This looks very nice. Is 3.3 a wide build? (how about a narrow build?) (is it with your own port of Django to py3k, or is there an official branch for it?) Regards Antoine.

"Martin v. Löwis"

30 Aug 30 Aug

8:06 a.m.

...

This looks very nice. Is 3.3 a wide build? (how about a narrow build?)

It's a wide build. For reference, I also attach 64-bit narrow build results, and 32-bit results (wide, narrow, and PEP 393). Savings are much smaller in narrow builds (larger on 32-bit systems than on 64-bit systems).

...

(is it with your own port of Django to py3k, or is there an official branch for it?)

It's https://bitbucket.org/loewis/django-3k Regards, Martin

Antoine Pitrou

11:33 a.m.

By the way, I don't know if you're working on it, but StringIO seems a bit broken right now. test_memoryio crashes here: test_newline_cr (test.test_memoryio.CStringIOTest) ... Fatal Python error: Segmentation fault Current thread 0x00007f3f6353b700: File "/home/antoine/cpython/pep-393/Lib/test/test_memoryio.py", line 583 in test_newline_cr File "/home/antoine/cpython/pep-393/Lib/unittest/case.py", line 386 in _executeTestPart File "/home/antoine/cpython/pep-393/Lib/unittest/case.py", line 441 in run File "/home/antoine/cpython/pep-393/Lib/unittest/case.py", line 493 in __call__ File "/home/antoine/cpython/pep-393/Lib/unittest/suite.py", line 105 in run File "/home/antoine/cpython/pep-393/Lib/unittest/suite.py", line 67 in __call__ File "/home/antoine/cpython/pep-393/Lib/unittest/suite.py", line 105 in run File "/home/antoine/cpython/pep-393/Lib/unittest/suite.py", line 67 in __call__ File "/home/antoine/cpython/pep-393/Lib/unittest/runner.py", line 168 in run File "/home/antoine/cpython/pep-393/Lib/test/support.py", line 1293 in _run_suite File "/home/antoine/cpython/pep-393/Lib/test/support.py", line 1327 in run_unittest File "/home/antoine/cpython/pep-393/Lib/test/test_memoryio.py", line 718 in test_main File "/home/antoine/cpython/pep-393/Lib/test/regrtest.py", line 1139 in runtest_inner File "/home/antoine/cpython/pep-393/Lib/test/regrtest.py", line 915 in runtest File "/home/antoine/cpython/pep-393/Lib/test/regrtest.py", line 707 in main File "/home/antoine/cpython/pep-393/Lib/test/__main__.py", line 13 in <module> File "/home/antoine/cpython/pep-393/Lib/runpy.py", line 73 in _run_code File "/home/antoine/cpython/pep-393/Lib/runpy.py", line 160 in _run_module_as_main Erreur de segmentation (core dumped) And here's an excerpt of the C stack: #0 find_control_char (translated=0, universal=0, readnl=<value optimized out>, kind=4, start=0xa75cf4 "c", end= 0xa75d00 "", consumed=0x7fffffffab38) at ./Modules/_io/textio.c:1617 #1 _PyIO_find_line_ending (translated=0, universal=0, readnl=<value optimized out>, kind=4, start=0xa75cf4 "c", end= 0xa75d00 "", consumed=0x7fffffffab38) at ./Modules/_io/textio.c:1678 #2 0x00000000004ed3be in _stringio_readline (self=0x7ffff291a250) at ./Modules/_io/stringio.c:271 #3 stringio_iternext (self=0x7ffff291a250) at ./Modules/_io/stringio.c:322 #4 0x000000000052aa19 in listextend (self=0x7ffff2900ab8, b=<value optimized out>) at Objects/listobject.c:844 #5 0x000000000052afe8 in list_init (self=0x7ffff2900ab8, args=<value optimized out>, kw=<value optimized out>) at Objects/listobject.c:2312 #6 0x00000000004283c7 in type_call (type=<value optimized out>, args=(<_io.StringIO at remote 0x7ffff291a250>,), kwds=0x0) at Objects/typeobject.c:692 #7 0x00000000004fdf17 in PyObject_Call (func=, arg=<value optimized out>, kw=<value optimized out>) at Objects/abstract.c:2147 Regards Antoine.

M.-A. Lemburg

29 Aug 29 Aug

8:54 p.m.

"Martin v. Löwis" wrote:

...

tl;dr: PEP-393 reduces the memory usage for strings of a very small Django app from 7.4MB to 4.4MB, all other objects taking about 1.9MB.

Am 26.08.2011 16:55, schrieb Guido van Rossum:

...
It would be nice if someone wrote a test to roughly verify these numbers, e.v. by allocating lots of strings of a certain size and measuring the process size before and after (being careful to adjust for the list or other data structure required to keep those objects alive).

I have now written a Django application to measure the effect of PEP 393, using the debug mode (to find all strings), and sys.getsizeof:

https://bitbucket.org/t0rsten/pep-393/src/ad02e1b4cad9/pep393utils/djmemprof...

The results for 3.3 and pep-393 are attached.

The Django app is small in every respect: trivial ORM, very few objects (just for the sake of exercising the ORM at all), no templating, short strings. The memory snapshot is taken in the middle of a request.

The tests were run on a 64-bit Linux system with 32-bit Py_UNICODE.

For comparison, could you run the test of the unmodified Python 3.3 on a 16-bit Py_UNICODE version as well ? Thanks, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 29 2011)

...

...
...
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

2011-10-04: PyCon DE 2011, Leipzig, Germany 36 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

Stefan Behnel

25 Aug 25 Aug

5:09 a.m.

"Martin v. Löwis", 24.08.2011 20:15:

...

Guido has agreed to eventually pronounce on PEP 393. Before that can happen, I'd like to collect feedback on it. There have been a number of voice supporting the PEP in principle

Absolutely.

...

- conditions you would like to pose on the implementation before acceptance. I'll see which of these can be resolved, and list the ones that remain open.

Just repeating here that I'd like to see the buffer void* changed into a union of pointers that state the exact layout type. IMHO, that would clarify the implementation and make it clearer that it's correct to access the data buffer as a flat array. (Obviously, code that does that is subject to future changes, that's why there are macros.) Stefan

Stefan Behnel

6:47 p.m.

"Martin v. Löwis", 24.08.2011 20:15:

...

- issues to be considered (unclarities, bugs, limitations, ...)

A problem of the current implementation is the need for calling PyUnicode_(FAST_)READY(), and the fact that it can fail (e.g. due to insufficient memory). Basically, this means that even something as trivial as trying to get the length of a Unicode string can now result in an error. I just noticed this when rewriting Cython's helper function that searches a unicode string for a (Py_UCS4) character. Previously, the entire function was safe, could never produce an error and therefore always returned a boolean result. In the new world, the caller of this function must check and propagate errors. This may not be a major issue in most cases, but it can have a non-trivial impact on user code, depending on how deep in a call chain this happens and on how much control the user has over the call chain (think of a C callback, for example). Also, even in the case that there is no error, the potential need to build up the string on request means that the run time and memory requirements of an algorithm are less predictable now as they depend on the origin of the input and not just its Python level string content. I would be happier with an implementation that avoided this by always instantiating the data buffer right from the start, instead of carrying only a Py_UNICODE buffer for old-style instances. Stefan

Stefan Behnel

9:30 p.m.

Stefan Behnel, 25.08.2011 20:47:

...

"Martin v. Löwis", 24.08.2011 20:15:

...
- issues to be considered (unclarities, bugs, limitations, ...)

A problem of the current implementation is the need for calling PyUnicode_(FAST_)READY(), and the fact that it can fail (e.g. due to insufficient memory). Basically, this means that even something as trivial as trying to get the length of a Unicode string can now result in an error.

Oh, and the same applies to PyUnicode_AS_UNICODE() now. I doubt that there is *any* code out there that expects this macro to ever return NULL. This means that the current implementation has actually broken the old API. Just allocate an "80% of your memory" long string using the new API and then call PyUnicode_AS_UNICODE() on it to see what I mean. Sadly, a quick look at a couple of recent commits in the pep-393 branch suggested that it is not even always obvious to you as the authors which macros can be called safely and which cannot. I immediately spotted a bug in one of the updated core functions (unicode_repr, IIRC) where PyUnicode_GET_LENGTH() is called without a previous call to PyUnicode_FAST_READY(). I find it everything but obvious that calling PyUnicode_DATA() and PyUnicode_KIND() is safe as long as the return value is being checked for errors, but calling PyUnicode_GET_LENGTH() is not safe unless there was a previous call to PyUnicode_Ready().

...

I just noticed this when rewriting Cython's helper function that searches a unicode string for a (Py_UCS4) character. Previously, the entire function was safe, could never produce an error and therefore always returned a boolean result. In the new world, the caller of this function must check and propagate errors. This may not be a major issue in most cases, but it can have a non-trivial impact on user code, depending on how deep in a call chain this happens and on how much control the user has over the call chain (think of a C callback, for example).

Also, even in the case that there is no error, the potential need to build up the string on request means that the run time and memory requirements of an algorithm are less predictable now as they depend on the origin of the input and not just its Python level string content.

I would be happier with an implementation that avoided this by always instantiating the data buffer right from the start, instead of carrying only a Py_UNICODE buffer for old-style instances.

Stefan

Stefan Behnel

26 Aug 26 Aug

5:21 a.m.

Stefan Behnel, 25.08.2011 23:30:

...

Stefan Behnel, 25.08.2011 20:47:

...
"Martin v. Löwis", 24.08.2011 20:15:

...
- issues to be considered (unclarities, bugs, limitations, ...)

A problem of the current implementation is the need for calling PyUnicode_(FAST_)READY(), and the fact that it can fail (e.g. due to insufficient memory). Basically, this means that even something as trivial as trying to get the length of a Unicode string can now result in an error.

Oh, and the same applies to PyUnicode_AS_UNICODE() now. I doubt that there is *any* code out there that expects this macro to ever return NULL. This means that the current implementation has actually broken the old API. Just allocate an "80% of your memory" long string using the new API and then call PyUnicode_AS_UNICODE() on it to see what I mean.

Sadly, a quick look at a couple of recent commits in the pep-393 branch suggested that it is not even always obvious to you as the authors which macros can be called safely and which cannot. I immediately spotted a bug in one of the updated core functions (unicode_repr, IIRC) where PyUnicode_GET_LENGTH() is called without a previous call to PyUnicode_FAST_READY().

I find it everything but obvious that calling PyUnicode_DATA() and PyUnicode_KIND() is safe as long as the return value is being checked for errors, but calling PyUnicode_GET_LENGTH() is not safe unless there was a previous call to PyUnicode_Ready().

And, adding to my own mail yet another time, the current header file states this: """ /* String contains only wstr byte characters. This is only possible when the string was created with a legacy API and PyUnicode_Ready() has not been called yet. Note that PyUnicode_KIND() calls PyUnicode_FAST_READY() so PyUnicode_WCHAR_KIND is only possible as a intialized value not as a result of PyUnicode_KIND(). */ #define PyUnicode_WCHAR_KIND 0 """ From my understanding, this is incorrect. When I call PyUnicode_KIND() on an old style object and it fails to allocate the string buffer, I would expect that I actually get PyUnicode_WCHAR_KIND back as a result, as the SSTATE_KIND_* value in the "state" field has not been initialised yet at that point. Stefan

Stefan Behnel

3:55 p.m.

Stefan Behnel, 25.08.2011 23:30:

...

Sadly, a quick look at a couple of recent commits in the pep-393 branch suggested that it is not even always obvious to you as the authors which macros can be called safely and which cannot. I immediately spotted a bug in one of the updated core functions (unicode_repr, IIRC) where PyUnicode_GET_LENGTH() is called without a previous call to PyUnicode_FAST_READY().

Here is another example from unicodeobject.c, commit 56aaa17fc05e: + switch(PyUnicode_KIND(string)) { + case PyUnicode_1BYTE_KIND: + list = ucs1lib_splitlines( + (PyObject*) string, PyUnicode_1BYTE_DATA(string), + PyUnicode_GET_LENGTH(string), keepends); + break; + case PyUnicode_2BYTE_KIND: + list = ucs2lib_splitlines( + (PyObject*) string, PyUnicode_2BYTE_DATA(string), + PyUnicode_GET_LENGTH(string), keepends); + break; + case PyUnicode_4BYTE_KIND: + list = ucs4lib_splitlines( + (PyObject*) string, PyUnicode_4BYTE_DATA(string), + PyUnicode_GET_LENGTH(string), keepends); + break; + default: + assert(0); + list = 0; + } The assert(0) at the end will hit when the system is running out of memory while working on a wchar string. Stefan

"Martin v. Löwis"

4:56 p.m.

Am 26.08.2011 17:55, schrieb Stefan Behnel:

...

Stefan Behnel, 25.08.2011 23:30:

...
Sadly, a quick look at a couple of recent commits in the pep-393 branch suggested that it is not even always obvious to you as the authors which macros can be called safely and which cannot. I immediately spotted a bug in one of the updated core functions (unicode_repr, IIRC) where PyUnicode_GET_LENGTH() is called without a previous call to PyUnicode_FAST_READY().

Here is another example from unicodeobject.c, commit 56aaa17fc05e:

+ switch(PyUnicode_KIND(string)) { + case PyUnicode_1BYTE_KIND: + list = ucs1lib_splitlines( + (PyObject*) string, PyUnicode_1BYTE_DATA(string), + PyUnicode_GET_LENGTH(string), keepends); + break; + case PyUnicode_2BYTE_KIND: + list = ucs2lib_splitlines( + (PyObject*) string, PyUnicode_2BYTE_DATA(string), + PyUnicode_GET_LENGTH(string), keepends); + break; + case PyUnicode_4BYTE_KIND: + list = ucs4lib_splitlines( + (PyObject*) string, PyUnicode_4BYTE_DATA(string), + PyUnicode_GET_LENGTH(string), keepends); + break; + default: + assert(0); + list = 0; + }

The assert(0) at the end will hit when the system is running out of memory while working on a wchar string.

No, that should not happen: it should never get to this point. I agree with your observation that somebody should be done about error handling, and will update the PEP shortly. I propose that PyUnicode_Ready should be explicitly called on input where raising an exception is feasible. In contexts where it is not feasible (such as reading a character, or reading the length or the kind), failing to ready the string should cause a fatal error. What do you think? Regards, Martin

Stefan Behnel

6:28 p.m.

"Martin v. Löwis", 26.08.2011 18:56:

...

I agree with your observation that somebody should be done about error handling, and will update the PEP shortly. I propose that PyUnicode_Ready should be explicitly called on input where raising an exception is feasible. In contexts where it is not feasible (such as reading a character, or reading the length or the kind), failing to ready the string should cause a fatal error.

I consider this an increase in complexity. It will then no longer be enough to access the data, the user will first have to figure out a suitable place in the code to make sure it's actually there, potentially forgetting about it because it works in all test cases, or potentially triggering a huge amount of overhead that copies and 'recodes' the string data by executing one of the macros that does it automatically. For the specific case of Cython, I would guess that I could just add another special case that reads the data from the Py_UNICODE buffer and combines surrogates at need, but that will only work in some cases (specifically not for indexing). And outside of Cython, most normal user code won't do that. My gut feeling leans towards a KISS approach. If you go the route to require an explicit point for triggering PyUnicode_Ready() calls, why not just go all the way and make it completely explicit in *all* cases? I.e. remove all implicit calls from the macros and make it part of the new API semantics that users *must* call PyUnicode_FAST_READY() before doing anything with a new string data layout. Much fewer surprises. Note that there isn't currently an official macro way to figure out that the flexible string layout has not been initialised yet, i.e. that wstr is set but str is not. If the implicit PyUnicode_Ready() calls get removed, PyUnicode_KIND() could take that place by simply returning WSTR_KIND. That being said, the main problem I currently see is that basically all existing code needs to be updated in order to handle these errors. Otherwise, it would be possible to trigger crashes by properly forging a string and passing it into an unprepared C library to let it run into a NULL pointer return value of PyUnicode_AS_UNICODE(). Stefan

Stefan Behnel

7:58 p.m.

Stefan Behnel, 26.08.2011 20:28:

...

"Martin v. Löwis", 26.08.2011 18:56:

...
I agree with your observation that somebody should be done about error handling, and will update the PEP shortly. I propose that PyUnicode_Ready should be explicitly called on input where raising an exception is feasible. In contexts where it is not feasible (such as reading a character, or reading the length or the kind), failing to ready the string should cause a fatal error. [...] My gut feeling leans towards a KISS approach. If you go the route to require an explicit point for triggering PyUnicode_Ready() calls, why not just go all the way and make it completely explicit in *all* cases? I.e. remove all implicit calls from the macros and make it part of the new API semantics that users *must* call PyUnicode_FAST_READY() before doing anything with a new string data layout. Much fewer surprises.

Note that there isn't currently an official macro way to figure out that the flexible string layout has not been initialised yet, i.e. that wstr is set but str is not. If the implicit PyUnicode_Ready() calls get removed, PyUnicode_KIND() could take that place by simply returning WSTR_KIND.

Here's a patch that updates only the header file, to make it clear what I mean. Stefan

4621

Age (days ago)

4627

Last active (days ago)

List overview

Download

42 comments

8 participants

participants (8)

"Martin v. Löwis"
Antoine Pitrou
Barry Warsaw
Dirkjan Ochtman
Guido van Rossum
M.-A. Lemburg
Stefan Behnel
Victor Stinner

PEP 393 review

tags

participants (8)