[Python-Dev] The future of the wchar_t cache

Mon Oct 22 10:47:15 EDT 2018

On 22Oct2018 1007, Serhiy Storchaka wrote:
> 22.10.18 16:24, Steve Dower пише:
>> Yes, that's true. But "should reduce ... footprint" is also an 
>> optimisation that deserves a benchmark by that standard. Also, I'm 
>> proposing keeping the 'kind' as UCS-2 when the string is created from 
>> UCS-2 data that is likely to be used as UCS-2. We would not create the 
>> UCS-1 version in this case, so it's not the same as prefilling the 
>> cache, but it would cost a bit of memory in exchange for CPU. If 
>> slicing and concatentation between matching kinds also preserved the 
>> kind, a lot of path handling code could avoid back-and-forth conversions.
> 
> Oh, I afraid this will complicate the whole code of unicodeobject.c (and 
> several other files) a much and can introduce a lot of subtle bugs.
> 
> For example, when you search a UCS2 string in a UCS1 string, the current 
> code returns the result fast, because a UCS1 string can't contain codes 
>  > 0xff, and a UCS2 string should contain codes > 0xff. And there are 
> many such assumptions.

That doesn't change though, as we're only ever expanding the range. So 
searching a UCS2 string in a UCS2 string that doesn't contain any actual 
UCS2 characters is the only case that would be affected, and whether 
that case occurs more than the UCS2->UCS1->UCS2 conversion case is 
something we can measure (but I'd be surprised if substring searches 
occur more frequently than OS conversions).

Currently, unicode_compare_eq exits early when the kinds do not match, 
and that would be a problem (but is also easily fixable). But other 
string operations already handle mismatched kinds.

Cheers,
Steve