The future of the wchar_t cache
Currently the PyUnicode object contains two caches: for UTF-8 representation and for wchar_t representation. They are needed not for optimization but for supporting C API which returns borrowed references for such representations. The UTF-8 cache always was in unicode objects (but in Python 2 it was not a UTF-8 cache, but a 8-bit representation cache). Initially it was needed for compatibility with 8-bit str, for implementing the "s" and "z" format units in PyArg_Parse(). Now it is used also for PyUnicode_AsUTF8() and PyUnicode_AsUTF8AndSize(). The wchar_t cache was added with PEP 393 in 3.3 as a replacement for the former Py_UNICODE representation. Now Py_UNICODE is defined as an alias of wchar_t, and the C API which returned a pointer to Py_UNICODE content returns now a pointer to the cached wchar_t representation. It is the "u" and "Z" format units in PyArg_Parse(), PyUnicode_AsUnicode(), PyUnicode_AsUnicodeAndSize(), PyUnicode_GET_SIZE(), PyUnicode_GET_DATA_SIZE(), PyUnicode_AS_UNICODE(), PyUnicode_AS_DATA(). All this increase the size of the unicode object. It includes the constant overhead of additional pointer and size fields, and the overhead of the cached representation proportional to the string length. The following table contains number of bytes per character for different kinds, with and without filling specified caches. raw +utf8 +wchar_t +utf8+wchar_t Windows Linux Windows Linux ASCII 1 1 3 5 3 5 UCS1 1 2-3 3 5 4-5 6-7 UCS2 2 3-5 2 6 3-5 7-9 UCS4 4 5-8 6-8 4 7-12 5-8 There is also a new C API added in 3.3 for getting wchar_t representation without using the cache: PyUnicode_AsWideChar() and PyUnicode_AsWideCharString(). Currently it uses the cache, this has benefits and disadvantages. Old Py_UNICODE based API is deprecated, and will be removed eventually. I want to ask about the future of the wchar_t cache. Is the benefit of caching the wchar_t representation larger the disadvantage of spending more memory? The wchar_t representation is so natural for Windows API as the UTF8 representation for POSIX API. But in all other cases it is just waste of memory. Are there reasons of keeping the wchar_t cache after removing the deprecated API? I have rewrote PyUnicode_AsWideChar() and PyUnicode_AsWideCharString(). They were implemented via the old Py_UNICODE based API, and now they don't use deprecated functions. They still use the wchar_t cache if it was created by previous use of the deprecated API, but don't create it themselves. Is this the correct decision? https://bugs.python.org/issue30863
+1 to remove wchar_t cache. I hope we can remove it at Python 3.9 or 3.10.
Serhiy Storchaka schrieb am 20.10.2018 um 13:06:
Currently the PyUnicode object contains two caches: for UTF-8 representation and for wchar_t representation. They are needed not for optimization but for supporting C API which returns borrowed references for such representations.
The UTF-8 cache always was in unicode objects (but in Python 2 it was not a UTF-8 cache, but a 8-bit representation cache). Initially it was needed for compatibility with 8-bit str, for implementing the "s" and "z" format units in PyArg_Parse(). Now it is used also for PyUnicode_AsUTF8() and PyUnicode_AsUTF8AndSize().
The wchar_t cache was added with PEP 393 in 3.3 as a replacement for the former Py_UNICODE representation. Now Py_UNICODE is defined as an alias of wchar_t, and the C API which returned a pointer to Py_UNICODE content returns now a pointer to the cached wchar_t representation. It is the "u" and "Z" format units in PyArg_Parse(), PyUnicode_AsUnicode(), PyUnicode_AsUnicodeAndSize(), PyUnicode_GET_SIZE(), PyUnicode_GET_DATA_SIZE(), PyUnicode_AS_UNICODE(), PyUnicode_AS_DATA().
All this increase the size of the unicode object. It includes the constant overhead of additional pointer and size fields, and the overhead of the cached representation proportional to the string length. The following table contains number of bytes per character for different kinds, with and without filling specified caches.
raw +utf8 +wchar_t +utf8+wchar_t Windows Linux Windows Linux ASCII 1 1 3 5 3 5 UCS1 1 2-3 3 5 4-5 6-7 UCS2 2 3-5 2 6 3-5 7-9 UCS4 4 5-8 6-8 4 7-12 5-8
There is also a new C API added in 3.3 for getting wchar_t representation without using the cache: PyUnicode_AsWideChar() and PyUnicode_AsWideCharString(). Currently it uses the cache, this has benefits and disadvantages.
Old Py_UNICODE based API is deprecated, and will be removed eventually. I want to ask about the future of the wchar_t cache. Is the benefit of caching the wchar_t representation larger the disadvantage of spending more memory? The wchar_t representation is so natural for Windows API as the UTF8 representation for POSIX API. But in all other cases it is just waste of memory. Are there reasons of keeping the wchar_t cache after removing the deprecated API?
I'd be happy to get rid of it. But regarding the use under Windows, I wonder if there's interest in keeping it as a special Windows-only feature, e.g. to speed up the data exchange with the Win32 APIs. I guess it would have to provide a visible (performance?) advantage to justify such special casing over the code removal. Stefan
On 20Oct2018 0901, Stefan Behnel wrote:
I'd be happy to get rid of it. But regarding the use under Windows, I wonder if there's interest in keeping it as a special Windows-only feature, e.g. to speed up the data exchange with the Win32 APIs. I guess it would have to provide a visible (performance?) advantage to justify such special casing over the code removal.
I think these cases would be just as well served by maintaining the original UCS-2 representation even if the maximum character fits into UCS-1, and only do the conversion when Python copies the string into a new location. I don't have numbers, but my instinct says the most impacted operations would be retrieving collections of strings from the OS (avoiding a scan/conversion for each one), comparisons against these collections (in-memory handling for hash/comparison of mismatched KIND), and passing some of these strings back to the OS (conversion back into UCS-2). This is basically a glob/fnmatch/stat sequence, and is the main real scenario I can think of where Python's overhead might become noticeable. Another option that might be useful is some way to allow the UCS-1/4 <-> UCS-2 conversion to occur outside the GIL. Most of the time when we need to convert we're about to release the GIL (or have just recovered it), so even without the cache we could probably recover some of the performance impact in parallelism. (That said, these are often tied up in conditions and generated code, so it may not be as easy to do this as retaining the original format.) Some sort of tracing to see how often the cache is reused after being generated would be interesting, as well as how often the cache is being generated for a string that was originally in UCS-2 but we changed it to UCS-1. Cheers, Steve
Le sam. 20 oct. 2018 à 18:02, Steve Dower
I don't have numbers, but my instinct says the most impacted operations would be retrieving collections of strings from the OS (avoiding a scan/conversion for each one), comparisons against these collections (in-memory handling for hash/comparison of mismatched KIND), and passing some of these strings back to the OS (conversion back into UCS-2). This is basically a glob/fnmatch/stat sequence, and is the main real scenario I can think of where Python's overhead might become noticeable.
Use os.scandir() to avoid stat :-) For code like "for name in os.listdir(): open(name): ...." (replace listdir with scandir if you want to get file metadata), the cache is useless, since the fresh string has to be converted to wchar_t* anyway, and the cache is destroyed at the end of the loop iteration, whereas the cache has never been used... I'm not saying that the cache is useless. I just doubt that it's so common that it really provide any performance benefit. Victor
On 22Oct2018 0413, Victor Stinner wrote:
For code like "for name in os.listdir(): open(name): ...." (replace listdir with scandir if you want to get file metadata), the cache is useless, since the fresh string has to be converted to wchar_t* anyway, and the cache is destroyed at the end of the loop iteration, whereas the cache has never been used...
Agreed the cache is useless here, but since the listdir() result came in as wchar_t we could keep it that way (assuming we'd only be changing it to char), and then there wouldn't have to be a conversion when we immediately pass it back to open(). That said, I spent some time yesterday converting the importlib cache to use scandir and separate caches for dir/file (to avoid the stat calls) and it made very little overall difference. I have to assume the string manipulation dominates. (Making DirEntry lazily calculate its .path had a bigger impact. Also, I didn't try to make Windows flush its own stat cache, and accessing warm files is much faster than cold ones.)
I'm not saying that the cache is useless. I just doubt that it's so common that it really provide any performance benefit.
I think that it is mostly useless, but if we can transparently keep many strings "native" size, that will handle many of the useful cases such as the single-use pass-through scenario like above. Cheers, Steve
Le lun. 22 oct. 2018 à 15:08, Steve Dower
Agreed the cache is useless here, but since the listdir() result came in as wchar_t we could keep it that way (assuming we'd only be changing it to char), and then there wouldn't have to be a conversion when we immediately pass it back to open().
Serhiy wants to remove the cache which should *reduce* Python memory footprint on Windows. You are proposing to fill the cache eagierly, that would increase the Python memory footprint :-/ Your proposed change is an optimisation, a benchmark is needed to see the benefit. I expect no significant difference on benchmarks of https://pyperformance.readthedocs.io/ ...
That said, I spent some time yesterday converting the importlib cache to use scandir and separate caches for dir/file (to avoid the stat calls) and it made very little overall difference. I have to assume the string manipulation dominates. (Making DirEntry lazily calculate its .path had a bigger impact. Also, I didn't try to make Windows flush its own stat cache, and accessing warm files is much faster than cold ones.)
I helped Ben Hoyt to design and implement his PEP 471 (os.scandir). When the kernel filesystem cache is filled, the speedup of os.scandir() is hard to notice. But when you work on a network filesystem like NFS, the speedup is like 5x faster. NFS doesn't cache stat() by default. Victor
On 22Oct2018 0913, Victor Stinner wrote:
Le lun. 22 oct. 2018 à 15:08, Steve Dower
a écrit : Agreed the cache is useless here, but since the listdir() result came in as wchar_t we could keep it that way (assuming we'd only be changing it to char), and then there wouldn't have to be a conversion when we immediately pass it back to open().
Serhiy wants to remove the cache which should *reduce* Python memory footprint on Windows.
You are proposing to fill the cache eagierly, that would increase the Python memory footprint :-/ Your proposed change is an optimisation, a benchmark is needed to see the benefit. I expect no significant difference on benchmarks of https://pyperformance.readthedocs.io/ ...
Yes, that's true. But "should reduce ... footprint" is also an optimisation that deserves a benchmark by that standard. Also, I'm proposing keeping the 'kind' as UCS-2 when the string is created from UCS-2 data that is likely to be used as UCS-2. We would not create the UCS-1 version in this case, so it's not the same as prefilling the cache, but it would cost a bit of memory in exchange for CPU. If slicing and concatentation between matching kinds also preserved the kind, a lot of path handling code could avoid back-and-forth conversions. The import benchmarks ought to be improved on Windows by this new optimisation, as this is a prime case where we regularly convert strings from what the OS gave us into UCS-1 and back into what the OS expects. But if you don't run the benchmarks on all OS's, then sure, you won't see any difference :) Cheers, Steve
Le lun. 22 oct. 2018 à 15:24, Steve Dower
Yes, that's true. But "should reduce ... footprint" is also an optimisation that deserves a benchmark by that standard.
pyperformance has a mode to mesure the memory usage (mostly the memory peak) if someone wants to have a look.
Also, I'm proposing keeping the 'kind' as UCS-2 when the string is created from UCS-2 data that is likely to be used as UCS-2.
Oh. That's a major change in the PEP 393 design. You would have to modify many functions in CPython. Currently, the PEP 393 requires that a string always use the most efficient storage, and many optimizations and code paths rely on that assumptions. I'm against this change. Moreover, it's hard to guess how a string will be used later... Victor
On 22Oct2018 0928, Victor Stinner wrote:
Also, I'm proposing keeping the 'kind' as UCS-2 when the string is created from UCS-2 data that is likely to be used as UCS-2.
Oh. That's a major change in the PEP 393 design. You would have to modify many functions in CPython. Currently, the PEP 393 requires that a string always use the most efficient storage, and many optimizations and code paths rely on that assumptions.
I don't know that it requires that many modifications - those functions already have to handle UCS-2 content anyway (e.g. if I get a path from scandir() that includes a non-ASCII character), and they're only using the assumption of most efficient storage to determine the resulting storage size of a string operation (which I'm proposing should also be UCS-2 when the source strings are UCS-2, since that's the best indicator we have that it'll be used as UCS-2 later, as well as being the current implementation :) ).
I'm against this change.
Moreover, it's hard to guess how a string will be used later...
Agreed. There are some heuristics we can use, but it's definitely only a guess. That's the nature of this problem - guessing that it *won't* be used as UCS-2 later on is also a guess. Cheers, Steve
22.10.18 16:24, Steve Dower пише:
Yes, that's true. But "should reduce ... footprint" is also an optimisation that deserves a benchmark by that standard. Also, I'm proposing keeping the 'kind' as UCS-2 when the string is created from UCS-2 data that is likely to be used as UCS-2. We would not create the UCS-1 version in this case, so it's not the same as prefilling the cache, but it would cost a bit of memory in exchange for CPU. If slicing and concatentation between matching kinds also preserved the kind, a lot of path handling code could avoid back-and-forth conversions.
Oh, I afraid this will complicate the whole code of unicodeobject.c (and several other files) a much and can introduce a lot of subtle bugs. For example, when you search a UCS2 string in a UCS1 string, the current code returns the result fast, because a UCS1 string can't contain codes
0xff, and a UCS2 string should contain codes > 0xff. And there are many such assumptions.
On 22Oct2018 1007, Serhiy Storchaka wrote:
22.10.18 16:24, Steve Dower пише:
Yes, that's true. But "should reduce ... footprint" is also an optimisation that deserves a benchmark by that standard. Also, I'm proposing keeping the 'kind' as UCS-2 when the string is created from UCS-2 data that is likely to be used as UCS-2. We would not create the UCS-1 version in this case, so it's not the same as prefilling the cache, but it would cost a bit of memory in exchange for CPU. If slicing and concatentation between matching kinds also preserved the kind, a lot of path handling code could avoid back-and-forth conversions.
Oh, I afraid this will complicate the whole code of unicodeobject.c (and several other files) a much and can introduce a lot of subtle bugs.
For example, when you search a UCS2 string in a UCS1 string, the current code returns the result fast, because a UCS1 string can't contain codes
0xff, and a UCS2 string should contain codes > 0xff. And there are many such assumptions.
That doesn't change though, as we're only ever expanding the range. So searching a UCS2 string in a UCS2 string that doesn't contain any actual UCS2 characters is the only case that would be affected, and whether that case occurs more than the UCS2->UCS1->UCS2 conversion case is something we can measure (but I'd be surprised if substring searches occur more frequently than OS conversions). Currently, unicode_compare_eq exits early when the kinds do not match, and that would be a problem (but is also easily fixable). But other string operations already handle mismatched kinds. Cheers, Steve
On 22Oct2018 1047, Steve Dower wrote:
On 22Oct2018 1007, Serhiy Storchaka wrote:
22.10.18 16:24, Steve Dower пише:
Yes, that's true. But "should reduce ... footprint" is also an optimisation that deserves a benchmark by that standard. Also, I'm proposing keeping the 'kind' as UCS-2 when the string is created from UCS-2 data that is likely to be used as UCS-2. We would not create the UCS-1 version in this case, so it's not the same as prefilling the cache, but it would cost a bit of memory in exchange for CPU. If slicing and concatentation between matching kinds also preserved the kind, a lot of path handling code could avoid back-and-forth conversions.
Oh, I afraid this will complicate the whole code of unicodeobject.c (and several other files) a much and can introduce a lot of subtle bugs.
For example, when you search a UCS2 string in a UCS1 string, the current code returns the result fast, because a UCS1 string can't contain codes > 0xff, and a UCS2 string should contain codes > 0xff. And there are many such assumptions.
That doesn't change though, as we're only ever expanding the range. So searching a UCS2 string in a UCS2 string that doesn't contain any actual UCS2 characters is the only case that would be affected, and whether that case occurs more than the UCS2->UCS1->UCS2 conversion case is something we can measure (but I'd be surprised if substring searches occur more frequently than OS conversions).
Currently, unicode_compare_eq exits early when the kinds do not match, and that would be a problem (but is also easily fixable). But other string operations already handle mismatched kinds.
I made the changes (along with a somewhat expensive update to make __hash__ produce the same value for UCS1 and UCS2 strings) and it works just fine, but the speed difference seems to be fairly trivial. Equality time in particular is slower (highly optimized memcpy vs. plain-old for loop). That said, I didn't remove the wchar_t cache (though I tried some tricks to avoid it), so it's possible that once that's gone we'll see an avoidable regression here, but on its own this doesn't contribute much. Cheers, Steve
22.10.18 23:41, Steve Dower пише:
That said, I didn't remove the wchar_t cache (though I tried some tricks to avoid it), so it's possible that once that's gone we'll see an avoidable regression here, but on its own this doesn't contribute much.
Could you please test PR 2599 on Windows? It makes PyUnicode_AsWideChar() and PyUnicode_AsWideCharString() not filling the wchar_t cache. If there is a significant regression in any benchmark or small regressions in several benchmarks we can consider to continue to fill the cache in these function and to preserve the cache in future. Otherwise I'll merge the PR as is. https://github.com/python/cpython/pull/2599
On Tue, 23 Oct 2018 at 00:50, Steve Dower
On 22Oct2018 1007, Serhiy Storchaka wrote:
22.10.18 16:24, Steve Dower пише:
Yes, that's true. But "should reduce ... footprint" is also an optimisation that deserves a benchmark by that standard. Also, I'm proposing keeping the 'kind' as UCS-2 when the string is created from UCS-2 data that is likely to be used as UCS-2. We would not create the UCS-1 version in this case, so it's not the same as prefilling the cache, but it would cost a bit of memory in exchange for CPU. If slicing and concatentation between matching kinds also preserved the kind, a lot of path handling code could avoid back-and-forth conversions.
Oh, I afraid this will complicate the whole code of unicodeobject.c (and several other files) a much and can introduce a lot of subtle bugs.
For example, when you search a UCS2 string in a UCS1 string, the current code returns the result fast, because a UCS1 string can't contain codes
0xff, and a UCS2 string should contain codes > 0xff. And there are many such assumptions.
That doesn't change though, as we're only ever expanding the range. So searching a UCS2 string in a UCS2 string that doesn't contain any actual UCS2 characters is the only case that would be affected, and whether that case occurs more than the UCS2->UCS1->UCS2 conversion case is something we can measure (but I'd be surprised if substring searches occur more frequently than OS conversions).
Currently, unicode_compare_eq exits early when the kinds do not match, and that would be a problem (but is also easily fixable). But other string operations already handle mismatched kinds.
If you did allow for denormalised UCS-2 strings, you'd probably want some kind of flag on the instance to indicate that the real kind was 8-bit. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Hi Serhiy,
+1 to remove wchar_t cache. IMHO it wastes memory for no real performance gain.
By the way, can we start to schedule the *removal* of the Py_UNICODE
API? For example, decide when Py_DEPRECATED is used in the C API?
Should we start to deprecate when Python 2 reachs its end of life? Or
can we *remove* this API as soon as Python 2 is dead? (Please, don't
use "Python 4" as a milestone to introduce such backward incompatible
change.)
Le sam. 20 oct. 2018 à 15:05, Stefan Behnel
I'd be happy to get rid of it. But regarding the use under Windows, I wonder if there's interest in keeping it as a special Windows-only feature, e.g. to speed up the data exchange with the Win32 APIs. I guess it would have to provide a visible (performance?) advantage to justify such special casing over the code removal.
If someone wants to *keep* the cache, I would like to see: * statistics on the cache usage: how many strings are converted to wchar_t*? What is the percentage on all strings? * what is the cache hit? * what is the overhead of removing the cache? (on length 10, 100, 1000, 1 MB, 10 MB?) When you look at the performance of the str type, it's common that timings are smaller than 1 us. Each str function has been highly optimized. IMHO the conversion of a string to wchar_t* is cheap, especially because I expect that all strings used with the Windows API are shorter than 1,000 characters. Victor
22.10.18 11:09, Victor Stinner пише:
+1 to remove wchar_t cache. IMHO it wastes memory for no real performance gain.
By the way, can we start to schedule the *removal* of the Py_UNICODE API? For example, decide when Py_DEPRECATED is used in the C API? Should we start to deprecate when Python 2 reachs its end of life? Or can we *remove* this API as soon as Python 2 is dead? (Please, don't use "Python 4" as a milestone to introduce such backward incompatible change.)
Such removal is scheduled on 4.0. But since currently there are no any plans for Python 4, I think we should schedule some concrete date. Removing the C API is a major breakage, we should prepare it carefully. 1. Make Py_DEPRECATED working on Windows. [1] Unfortunately this requires breaking its interface. It needs to be inserted before the function declaration, not after it. [1] https://bugs.python.org/issue33407 2. Add Py_DEPRECATED to the rest of Py_UNICODE based C API (PyUnicode_AsUnicode(), PyUnicode_AsUnicodeAndSize(), etc). 3. Some macros like PyUnicode_AS_UNICODE() should be converted to functions for using compile-time warnings. 3. Would be nice to make the "u" format unit in PyArg_ParseTuple() and similar functions producing a compile time warning. Don't know if it is possible, but compilers are able to check format strings for printf-like functions. 4. Add a configuration option for removing the cache at compile time. This will help us to find all C API that depends on it. 5. Add tests for all C API (I'm working on this). ... two or more releases later ... 6. Make all deprecated C API functions emitting a DeprecationWarning at runtime. ... two or more releases later ... 7. Make all deprecated C API functions raising an error and remove their declarations from headers. Remove the wchar_t cache and legacy representations. Make PyUnicode_Ready() a no-op. So if we implement items 1-5 in 3.8, we could get rid of this legacy in 3.12.
20.10.18 16:01, Stefan Behnel пише:
But regarding the use under Windows, I wonder if there's interest in keeping it as a special Windows-only feature, e.g. to speed up the data exchange with the Win32 APIs. I guess it would have to provide a visible (performance?) advantage to justify such special casing over the code removal.
This is an interesting question, and we should found the answer right now. Should PyUnicode_AsWideChar() and PyUnicode_AsWideCharString() continue to attach the wchar_t representation to the unicode object on Windows? The cost is higher memory consumption and slower first time call. The benefit is faster following calls for the same object. Although they still need to copy the content, so the time is still O(N), just with smaller constant multiplier.
participants (6)
-
INADA Naoki
-
Nick Coghlan
-
Serhiy Storchaka
-
Stefan Behnel
-
Steve Dower
-
Victor Stinner