[Python-Dev] [Python-checkins] cpython: Optimize string slicing to use the new API
Victor Stinner
victor.stinner at haypocalc.com
Wed Oct 5 01:59:35 CEST 2011
Le 04/10/2011 20:09, "Martin v. Löwis" a écrit :
> Am 04.10.11 19:50, schrieb Antoine Pitrou:
>> On Tue, 04 Oct 2011 19:49:09 +0200
>> "Martin v. Löwis"<martin at v.loewis.de> wrote:
>>
>>>> + result = PyUnicode_New(slicelength, PyUnicode_MAX_CHAR_VALUE(self));
>>>
>>> This is incorrect: the maxchar of the slice might be smaller than the
>>> maxchar of the input string.
>>
>> I thought that heuristic would be good enough. I'll try to fix it.
>
> No - strings must always be in the canonical form.
I added a check in _PyUnicode_CheckConsistency() (debug mode) to ensure
that newly created strings always use the most efficient storage.
> For example, PyUnicode_RichCompare considers string unequal if they
> have different kinds. As a consequence, your slice
> result may not compare equal to a canonical variant of itself.
I see this as a micro-optimization. IMO we should *not* rely on these
assumptions because we cannot expect that all developers of third party
modules will be able to write perfect code, and some (lazy developers!)
may prefer to use a fixed maximum character (e.g. 0xFFFF).
To be able to rely on such assumption, we have to make sure that strings
are in canonical forms (always check before using a string?). But it
would slow down Python because you have to scan the whole string to get
the maximum characters (see my change in _PyUnicode_CheckConsistency).
I would prefer to drop such micro-optimization and tolerate
non-canonical strings (strings not using the most efficient storage).
Even if PEP 393 is fully backward compatibly (except that
PyUnicode_AS_UNICODE and PyUnicode_AsUnicode may now return NULL), it's
already a big change (developers may want to move to the new API to
benefit of the advantages of the PEP 393), and very few developers
understand correctly Unicode.
It's safer to see the PEP 393 as a best-effort method. Hopefuly, most
(or all?) strings created by Python itself are in canonical form.
Victor
More information about the Python-Dev
mailing list