Mailman 3 Enhancement request for PyUnicode proxies - Python-Dev

newer
[RELEASE] Python 3.10.0a4 is now...

Enhancement request for PyUnicode proxies

older
Re: [Python-ideas] Re: Add venv...

Nelson, Karl E.

25 Dec 2020 25 Dec '20

11:03 p.m.

I was directed to post this request to the general Python development community so hopefully this is on topic. One of the weaknesses of the PyUnicode implementation is that the type is concrete and there is no option for an abstract proxy string to a foreign source. This is an issue for an API like JPype in which java.lang.Strings are passed back from Java. Ideally these would be a type derived from the Unicode type str, but that requires transferring the memory immediately from Java to Python even when that handle is large and will never be accessed from within Python. For certain operations like XML parsing this can be prohibitable, so instead of returning a str we return a JString. (There is a separate issue that Java method names and Python method names conflict so direct inheritance creates some problems.) The JString type can of course be transferred to Python space at any time as both Python Unicode and Java string objects are immutable. However the CPython API which takes strings only accepts the Unicode type objects which have a concrete implementation. It is possible to extend strings, but those extensions do not allow for proxing as far as I can tell. Thus there is no option currently to proxy to a string representation in another language. The concept of the using the duck type ``__str__`` method is insufficient as this indices that an object can become a string, rather than "this object is effectively a string" for the purposes of the CPython API. One way to address this is to use currently outdated copy of READY to extend Unicode objects to other languages. A class like JString would be an unready Unicode object which when READY is called transfers the memory from Java, sets up the flags and sets up a pointer to the code point representation. Unfortunately the READY concept is scheduled for removal and thus the chance to address the needs for proxying a Unicode to another languages representation may be limited. There may be other methods to accomplish this without using the concept of READY. So long as access to the code points go through the Unicode API and the Unicode object can be extended such that the actual code points may be located outside of the Unicode object then a proxy can still be achieved if there are hooks in it to decided when a transfer should be performed. Generally the transfer request only needs to happen once but the key issue being that the number of code points (nor the kind of points) will not be known until the memory is transferred. Java has much the same problem. Although they defined an interface class "java.lang.CharacterArray" the actually "java.lang.String" class is concrete and almost all API methods take a String rather than the base interface even when the base interface would have been adequate. Thus just like Python has difficulty treating a foreign string class as it would a native one, Java cannot treat a Python string as native one as well. So Python strings get represented as CharacterArray type which effectively limits it use greatly. Summary: * A String proxy would need the address of the memory in the "wstr" slot though the code points may be char[], wchar[] or int[] depending the representation in the proxy. * API calls to interpret the data would need to check to see if the data is transferred first, if not it would call the proxy dependent transfer method which is responsible for creating a block of code points and set up flags (kind, ascii, ready, and compact). * The memory block allocated would need to call the proxy dependent destructor to clean up with the string is done. * It is not clear if this would have impact on performance. Python already has the concept of a string which needs actions before it can be accessed, but this is scheduled for removal. Are there any plans currently to address the concept of a proxy string in PyUnicode API?

Attachments:

attachment.htm (text/html — 8.5 KB)

Show replies by date

Ronald Oussoren

26 Dec 26 Dec

11:52 a.m.

...

On 25 Dec 2020, at 23:03, Nelson, Karl E. via Python-Dev wrote:

I was directed to post this request to the general Python development community so hopefully this is on topic.

One of the weaknesses of the PyUnicode implementation is that the type is concrete and there is no option for an abstract proxy string to a foreign source. This is an issue for an API like JPype in which java.lang.Strings are passed back from Java. Ideally these would be a type derived from the Unicode type str, but that requires transferring the memory immediately from Java to Python even when that handle is large and will never be accessed from within Python. For certain operations like XML parsing this can be prohibitable, so instead of returning a str we return a JString. (There is a separate issue that Java method names and Python method names conflict so direct inheritance creates some problems.)

The JString type can of course be transferred to Python space at any time as both Python Unicode and Java string objects are immutable. However the CPython API which takes strings only accepts the Unicode type objects which have a concrete implementation. It is possible to extend strings, but those extensions do not allow for proxing as far as I can tell. Thus there is no option currently to proxy to a string representation in another language. The concept of the using the duck type ``__str__`` method is insufficient as this indices that an object can become a string, rather than “this object is effectively a string” for the purposes of the CPython API.

One way to address this is to use currently outdated copy of READY to extend Unicode objects to other languages. A class like JString would be an unready Unicode object which when READY is called transfers the memory from Java, sets up the flags and sets up a pointer to the code point representation. Unfortunately the READY concept is scheduled for removal and thus the chance to address the needs for proxying a Unicode to another languages representation may be limited. There may be other methods to accomplish this without using the concept of READY. So long as access to the code points go through the Unicode API and the Unicode object can be extended such that the actual code points may be located outside of the Unicode object then a proxy can still be achieved if there are hooks in it to decided when a transfer should be performed. Generally the transfer request only needs to happen once but the key issue being that the number of code points (nor the kind of points) will not be known until the memory is transferred.

Java has much the same problem. Although they defined an interface class “java.lang.CharacterArray” the actually “java.lang.String” class is concrete and almost all API methods take a String rather than the base interface even when the base interface would have been adequate. Thus just like Python has difficulty treating a foreign string class as it would a native one, Java cannot treat a Python string as native one as well. So Python strings get represented as CharacterArray type which effectively limits it use greatly.

Summary:

A String proxy would need the address of the memory in the “wstr” slot though the code points may be char[], wchar[] or int[] depending the representation in the proxy. API calls to interpret the data would need to check to see if the data is transferred first, if not it would call the proxy dependent transfer method which is responsible for creating a block of code points and set up flags (kind, ascii, ready, and compact). The memory block allocated would need to call the proxy dependent destructor to clean up with the string is done. It is not clear if this would have impact on performance. Python already has the concept of a string which needs actions before it can be accessed, but this is scheduled for removal.

Are there any plans currently to address the concept of a proxy string in PyUnicode API?

I have a similar problem in PyObjC which proxies Objective-C classes to Python (and the other way around). For interop with Python code I proxy Objective-C strings using a subclass of str() that is eagerly populated even if, as you mention as well, a lot of these proxy object are never used in a context where the str() representation is important. A complicating factor for me is that Objective-C strings are, in general, mutable which can lead to interesting behaviour. Another disadvantage of subclassing str() for foreign string types is that this removes the proxy class from their logical location in the class hierarchy (in my case the proxy type is not a subclass of the proxy type for NSObject, even though all Objective-C classes inherit from NSObject). I primarily chose to subclass the str type because that enables using the NSString proxy type with C functions/methods that expect a string argument. That might be something that can be achieved using a new protocol, similar to operator.index of os.fspath. A complicating factor here is there’s a significant amount of Python code as well that explicitly tests for the str type to exclude strings from code paths that iterate over containers. Ronald — Twitter / micro.blog: @ronaldoussoren Blog: https://blog.ronaldoussoren.net/

...

_______________________________________________ Python-Dev mailing list -- python-dev@python.org mailto:python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org mailto:python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/BDJAQDPQ... https://mail.python.org/archives/list/python-dev@python.org/message/BDJAQDPQ... Code of Conduct: http://python.org/psf/codeofconduct/ http://python.org/psf/codeofconduct/

Phil Thompson

12:43 p.m.

On 26/12/2020 10:52, Ronald Oussoren via Python-Dev wrote:

...

...
On 25 Dec 2020, at 23:03, Nelson, Karl E. via Python-Dev wrote:

I was directed to post this request to the general Python development community so hopefully this is on topic.

One of the weaknesses of the PyUnicode implementation is that the type is concrete and there is no option for an abstract proxy string to a foreign source. This is an issue for an API like JPype in which java.lang.Strings are passed back from Java. Ideally these would be a type derived from the Unicode type str, but that requires transferring the memory immediately from Java to Python even when that handle is large and will never be accessed from within Python. For certain operations like XML parsing this can be prohibitable, so instead of returning a str we return a JString. (There is a separate issue that Java method names and Python method names conflict so direct inheritance creates some problems.)

The JString type can of course be transferred to Python space at any time as both Python Unicode and Java string objects are immutable. However the CPython API which takes strings only accepts the Unicode type objects which have a concrete implementation. It is possible to extend strings, but those extensions do not allow for proxing as far as I can tell. Thus there is no option currently to proxy to a string representation in another language. The concept of the using the duck type ``__str__`` method is insufficient as this indices that an object can become a string, rather than “this object is effectively a string” for the purposes of the CPython API.

One way to address this is to use currently outdated copy of READY to extend Unicode objects to other languages. A class like JString would be an unready Unicode object which when READY is called transfers the memory from Java, sets up the flags and sets up a pointer to the code point representation. Unfortunately the READY concept is scheduled for removal and thus the chance to address the needs for proxying a Unicode to another languages representation may be limited. There may be other methods to accomplish this without using the concept of READY. So long as access to the code points go through the Unicode API and the Unicode object can be extended such that the actual code points may be located outside of the Unicode object then a proxy can still be achieved if there are hooks in it to decided when a transfer should be performed. Generally the transfer request only needs to happen once but the key issue being that the number of code points (nor the kind of points) will not be known until the memory is transferred.

Java has much the same problem. Although they defined an interface class “java.lang.CharacterArray” the actually “java.lang.String” class is concrete and almost all API methods take a String rather than the base interface even when the base interface would have been adequate. Thus just like Python has difficulty treating a foreign string class as it would a native one, Java cannot treat a Python string as native one as well. So Python strings get represented as CharacterArray type which effectively limits it use greatly.

Summary:

A String proxy would need the address of the memory in the “wstr” slot though the code points may be char[], wchar[] or int[] depending the representation in the proxy. API calls to interpret the data would need to check to see if the data is transferred first, if not it would call the proxy dependent transfer method which is responsible for creating a block of code points and set up flags (kind, ascii, ready, and compact). The memory block allocated would need to call the proxy dependent destructor to clean up with the string is done. It is not clear if this would have impact on performance. Python already has the concept of a string which needs actions before it can be accessed, but this is scheduled for removal.

Are there any plans currently to address the concept of a proxy string in PyUnicode API?

I have a similar problem in PyObjC which proxies Objective-C classes to Python (and the other way around). For interop with Python code I proxy Objective-C strings using a subclass of str() that is eagerly populated even if, as you mention as well, a lot of these proxy object are never used in a context where the str() representation is important. A complicating factor for me is that Objective-C strings are, in general, mutable which can lead to interesting behaviour. Another disadvantage of subclassing str() for foreign string types is that this removes the proxy class from their logical location in the class hierarchy (in my case the proxy type is not a subclass of the proxy type for NSObject, even though all Objective-C classes inherit from NSObject).

I primarily chose to subclass the str type because that enables using the NSString proxy type with C functions/methods that expect a string argument. That might be something that can be achieved using a new protocol, similar to operator.index of os.fspath. A complicating factor here is there’s a significant amount of Python code as well that explicitly tests for the str type to exclude strings from code paths that iterate over containers.

Just to add another use case... PyQt (the Python bindings for Qt) has a similar issue. Qt implements unicode strings as a QString class which uses UTF-16 as the "native" representation. Currently PyQt converts between Python unicode objects and QString instances as and when required. While this might sound inefficient I've never had a report saying that this was actually a problem in a particular situation - but it would be nice to avoid it if possible. It's worth comparing the situation with byte arrays. There is no problem of translating different representations of an element, but there is still the issue of who owns the memory. The Python buffer protocol usually solves this problem, so something similar for unicode "arrays" might suffice. Phil

Guido van Rossum

6:43 p.m.

On Sat, Dec 26, 2020 at 3:54 AM Phil Thompson via Python-Dev < python-dev@python.org> wrote:

...

It's worth comparing the situation with byte arrays. There is no problem of translating different representations of an element, but there is still the issue of who owns the memory. The Python buffer protocol usually solves this problem, so something similar for unicode "arrays" might suffice.

Exactly my thought on the matter. I have no doubt that between all of us we could design a decent protocol. The practical problem would be to convince enough people that this is worth doing to actually get the code changed (str being one of the most popular data types traveling across C API boundaries), in the CPython core (which surely has a lot of places to modify) as well as in the vast collection of affected 3rd party modules. Like many migrations it's an endless slog for the developers involved, and in open source it's hard to assign resources for such a project. -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...

Ronald Oussoren

27 Dec 27 Dec

12:19 p.m.

...

On 26 Dec 2020, at 18:43, Guido van Rossum wrote:

On Sat, Dec 26, 2020 at 3:54 AM Phil Thompson via Python-Dev mailto:python-dev@python.org> wrote: It's worth comparing the situation with byte arrays. There is no problem of translating different representations of an element, but there is still the issue of who owns the memory. The Python buffer protocol usually solves this problem, so something similar for unicode "arrays" might suffice.

Exactly my thought on the matter. I have no doubt that between all of us we could design a decent protocol.

The practical problem would be to convince enough people that this is worth doing to actually get the code changed (str being one of the most popular data types traveling across C API boundaries), in the CPython core (which surely has a lot of places to modify) as well as in the vast collection of affected 3rd party modules. Like many migrations it's an endless slog for the developers involved, and in open source it's hard to assign resources for such a project.

That’s a problem indeed. An 80% solution could be reached by teaching PyArg_Parse* about the new protocol, it already uses the buffer protocol for bytes-like objects and could be thought about a variant of the protocol for strings. That would require that the implementation of that new variant returns a pointer in the Py_view that can used after the view is released, but that’s already a restriction for the use of new style buffers in the PyArg_Parse* APIs. That wouldn’t be a solution for code using the PyUnicode_* APIs of course, nor Python code explicitly checking for the str type. In the end a new string “kind” (next to the 1, 2 and 4 byte variants) where callbacks are used to provide data might be the most pragmatic. That will still break code peaking directly in the the PyUnicodeObject struct, but anyone doing that should know that that is not a stable API. Ronald — Twitter / micro.blog: @ronaldoussoren Blog: https://blog.ronaldoussoren.net/

...

-- --Guido van Rossum (python.org/~guido http://python.org/~guido) Pronouns: he/him (why is my pronoun here?) http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c..._______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/2FO5LQIO... Code of Conduct: http://python.org/psf/codeofconduct/

Guido van Rossum

8:15 p.m.

On Sun, Dec 27, 2020 at 3:19 AM Ronald Oussoren wrote:

...

On 26 Dec 2020, at 18:43, Guido van Rossum wrote:

On Sat, Dec 26, 2020 at 3:54 AM Phil Thompson via Python-Dev < python-dev@python.org> wrote:

...
It's worth comparing the situation with byte arrays. There is no problem of translating different representations of an element, but there is still the issue of who owns the memory. The Python buffer protocol usually solves this problem, so something similar for unicode "arrays" might suffice.

Exactly my thought on the matter. I have no doubt that between all of us we could design a decent protocol.

The practical problem would be to convince enough people that this is worth doing to actually get the code changed (str being one of the most popular data types traveling across C API boundaries), in the CPython core (which surely has a lot of places to modify) as well as in the vast collection of affected 3rd party modules. Like many migrations it's an endless slog for the developers involved, and in open source it's hard to assign resources for such a project.

That’s a problem indeed. An 80% solution could be reached by teaching PyArg_Parse* about the new protocol, it already uses the buffer protocol for bytes-like objects and could be thought about a variant of the protocol for strings. That would require that the implementation of that new variant returns a pointer in the Py_view that can used after the view is released, but that’s already a restriction for the use of new style buffers in the PyArg_Parse* APIs.

That wouldn’t be a solution for code using the PyUnicode_* APIs of course, nor Python code explicitly checking for the str type.

In the end a new string “kind” (next to the 1, 2 and 4 byte variants) where callbacks are used to provide data might be the most pragmatic. That will still break code peaking directly in the the PyUnicodeObject struct, but anyone doing that should know that that is not a stable API.

That's an attractive idea. I've personally never had to peek inside the implementation, and I suspect there's not that much code that does so (even in the CPython code base itself, outside the PyUnicode implementation of course). -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...

MRAB

9 p.m.

On 2020-12-27 19:15, Guido van Rossum wrote:

...

On Sun, Dec 27, 2020 at 3:19 AM Ronald Oussoren mailto:ronaldoussoren@mac.com> wrote:

...
On 26 Dec 2020, at 18:43, Guido van Rossum mailto:guido@python.org> wrote:

On Sat, Dec 26, 2020 at 3:54 AM Phil Thompson via Python-Dev mailto:python-dev@python.org> wrote:

It's worth comparing the situation with byte arrays. There is no problem of translating different representations of an element, but there is still the issue of who owns the memory. The Python buffer protocol usually solves this problem, so something similar for unicode "arrays" might suffice.

Exactly my thought on the matter. I have no doubt that between all of us we could design a decent protocol.

The practical problem would be to convince enough people that this is worth doing to actually get the code changed (str being one of the most popular data types traveling across C API boundaries), in the CPython core (which surely has a lot of places to modify) as well as in the vast collection of affected 3rd party modules. Like many migrations it's an endless slog for the developers involved, and in open source it's hard to assign resources for such a project.

That’s a problem indeed. An 80% solution could be reached by teaching PyArg_Parse* about the new protocol, it already uses the buffer protocol for bytes-like objects and could be thought about a variant of the protocol for strings. That would require that the implementation of that new variant returns a pointer in the Py_view that can used after the view is released, but that’s already a restriction for the use of new style buffers in the PyArg_Parse* APIs.

That wouldn’t be a solution for code using the PyUnicode_* APIs of course, nor Python code explicitly checking for the str type.

In the end a new string “kind” (next to the 1, 2 and 4 byte variants) where callbacks are used to provide data might be the most pragmatic. That will still break code peaking directly in the the PyUnicodeObject struct, but anyone doing that should know that that is not a stable API.

That's an attractive idea. I've personally never had to peek inside the implementation, and I suspect there's not that much code that does so (even in the CPython code base itself, outside the PyUnicode implementation of course).

The re module does it extensively for speed reasons.

Inada Naoki

28 Dec 28 Dec

3:07 a.m.

On Sun, Dec 27, 2020 at 8:20 PM Ronald Oussoren via Python-Dev wrote:

...

On 26 Dec 2020, at 18:43, Guido van Rossum wrote:

On Sat, Dec 26, 2020 at 3:54 AM Phil Thompson via Python-Dev wrote:

...
That wouldn’t be a solution for code using the PyUnicode_* APIs of course, nor Python code explicitly checking for the str type.

In the end a new string “kind” (next to the 1, 2 and 4 byte variants) where callbacks are used to provide data might be the most pragmatic. That will still break code peaking directly in the the PyUnicodeObject struct, but anyone doing that should know that that is not a stable API.

I had a similar idea for lazy loading or lazy decoding of Unicode objects. But I have rejected the idea and proposed to deprecate PyUnicode_READY() because of the balance between merits and complexity: * Simplifying the Unicode object may introduce more room for optimization because Unicode is the essential type for Python. Since Python is a dynamic language, a huge amount of str comparison happened in runtime compared with static languages like Java and Rust. * Third parties may forget to check PyErr_Occurred() after API like PyUnicode_Contains or PyUnicode_Compare when the author knows all operands are exact Unicode type. Additionally, if we introduce the customizable lazy str object, it's very easy to release GIL during basic Unicode operations. Many third parties may assume PyUnicode_Compare doesn't release GIL if both operands are Unicode objects. It will produce bugs hard to find and reproduce. So I'm +1 to make Unicode simple by removing PyUnicode_READY(), and -1 to make Unicode complicated by adding customizable callback for lazy population. Anyway, I am OK to un-deprecate PyUnicode_READY() and make it no-op macro since Python 3.12. But I don't know how many third-parties use it properly, because legacy Unicode objects are very rare already. Regards, -- Inada Naoki

Greg Ewing

3:58 a.m.

Rather than a full-blown buffer-protocol-like thing, could we get by with something simpler? How about just having a flag in the unicode object indicating that it doesn't own the memory that it points to? -- Greg

Ronald Oussoren

10:11 a.m.

...

On 28 Dec 2020, at 03:58, Greg Ewing wrote:

Rather than a full-blown buffer-protocol-like thing, could we get by with something simpler? How about just having a flag in the unicode object indicating that it doesn't own the memory that it points to?

I don’t know about the OP, but for me that wouldn’t be good enough as I’d still have to copy the string value because of the semantics of ObjC strings. Ronald — Twitter / micro.blog: @ronaldoussoren Blog: https://blog.ronaldoussoren.net/ https://blog.ronaldoussoren.net/

Phil Thompson

11:22 a.m.

On 28/12/2020 02:07, Inada Naoki wrote:

...

On Sun, Dec 27, 2020 at 8:20 PM Ronald Oussoren via Python-Dev wrote:

...
On 26 Dec 2020, at 18:43, Guido van Rossum wrote:

On Sat, Dec 26, 2020 at 3:54 AM Phil Thompson via Python-Dev wrote:

...
That wouldn’t be a solution for code using the PyUnicode_* APIs of course, nor Python code explicitly checking for the str type.

In the end a new string “kind” (next to the 1, 2 and 4 byte variants) where callbacks are used to provide data might be the most pragmatic. That will still break code peaking directly in the the PyUnicodeObject struct, but anyone doing that should know that that is not a stable API.

I had a similar idea for lazy loading or lazy decoding of Unicode objects. But I have rejected the idea and proposed to deprecate PyUnicode_READY() because of the balance between merits and complexity:

* Simplifying the Unicode object may introduce more room for optimization because Unicode is the essential type for Python. Since Python is a dynamic language, a huge amount of str comparison happened in runtime compared with static languages like Java and Rust. * Third parties may forget to check PyErr_Occurred() after API like PyUnicode_Contains or PyUnicode_Compare when the author knows all operands are exact Unicode type.

Additionally, if we introduce the customizable lazy str object, it's very easy to release GIL during basic Unicode operations. Many third parties may assume PyUnicode_Compare doesn't release GIL if both operands are Unicode objects. It will produce bugs hard to find and reproduce.

I would have no problem with the protocol stating that the GIL must not be released by "foreign" unicode implementations.

...

So I'm +1 to make Unicode simple by removing PyUnicode_READY(), and -1 to make Unicode complicated by adding customizable callback for lazy population.

Anyway, I am OK to un-deprecate PyUnicode_READY() and make it no-op macro since Python 3.12. But I don't know how many third-parties use it properly, because legacy Unicode objects are very rare already.

For me lazy population might not be enough (as I'm not sure precisely what you mean by it). I would like to be able to use my foreign unicode thing to be used as the storage. For example (where text() returns a unicode object with a foreign kind)... some_text = an_editor.text() more_text = another_editor.text() if some_text == more_text: print("The text is the same") ...would not involve any conversions at all. The following would require a conversion... if some_text == "literal text": Phil

Inada Naoki

12:27 p.m.

On Mon, Dec 28, 2020 at 7:22 PM Phil Thompson wrote:

...

...
So I'm +1 to make Unicode simple by removing PyUnicode_READY(), and -1 to make Unicode complicated by adding customizable callback for lazy population.

Anyway, I am OK to un-deprecate PyUnicode_READY() and make it no-op macro since Python 3.12. But I don't know how many third-parties use it properly, because legacy Unicode objects are very rare already.

For me lazy population might not be enough (as I'm not sure precisely what you mean by it). I would like to be able to use my foreign unicode thing to be used as the storage.

For example (where text() returns a unicode object with a foreign kind)...

some_text = an_editor.text() more_text = another_editor.text()

if some_text == more_text: print("The text is the same")

...would not involve any conversions at all.

So you mean custom internal representation of exact Unicode object? Then, I am more strong -1, sorry. I can not believe the merits of it is bigger than the costs of its complexity. If 3rd party wants to use completely different internal representation, it must not be a unicode object at all. Regards, -- Inada Naoki

Phil Thompson

12:52 p.m.

On 28/12/2020 11:27, Inada Naoki wrote:

...

On Mon, Dec 28, 2020 at 7:22 PM Phil Thompson wrote:

...
...
So I'm +1 to make Unicode simple by removing PyUnicode_READY(), and -1 to make Unicode complicated by adding customizable callback for lazy population.

Anyway, I am OK to un-deprecate PyUnicode_READY() and make it no-op macro since Python 3.12. But I don't know how many third-parties use it properly, because legacy Unicode objects are very rare already.

For me lazy population might not be enough (as I'm not sure precisely what you mean by it). I would like to be able to use my foreign unicode thing to be used as the storage.

For example (where text() returns a unicode object with a foreign kind)...

some_text = an_editor.text() more_text = another_editor.text()

if some_text == more_text: print("The text is the same")

...would not involve any conversions at all.

So you mean custom internal representation of exact Unicode object?

Then, I am more strong -1, sorry. I can not believe the merits of it is bigger than the costs of its complexity. If 3rd party wants to use completely different internal representation, it must not be a unicode object at all.

I would have thought that an object was defined by its behaviour rather than by any particular implementation detail. However I completely understand the desire to avoid additional complexity of the implementation. Phil

Inada Naoki

2 p.m.

On Mon, Dec 28, 2020 at 8:52 PM Phil Thompson wrote:

...

I would have thought that an object was defined by its behaviour rather than by any particular implementation detail.

As my understanding, the policy "an object was defined by its behavior..." doesn't mean "put unlimited amount of implementation behind one concrete type." The policy means APIs shouldn't limit input to one concrete type without a reason. In other words, duck typing and structural subtyping are good. For example, we can try making io.TextIOWrapper accepts not only Unicode objects (including subclass) but any objects implementing some protocol. We already have __index__ for integers and buffer protocol for byts-like objects. That is examples of the policy. Regards, -- Inada Naoki

Ronald Oussoren

2:27 p.m.

...

On 28 Dec 2020, at 14:00, Inada Naoki wrote:

On Mon, Dec 28, 2020 at 8:52 PM Phil Thompson wrote:

...
I would have thought that an object was defined by its behaviour rather than by any particular implementation detail.

As my understanding, the policy "an object was defined by its behavior..." doesn't mean "put unlimited amount of implementation behind one concrete type." The policy means APIs shouldn't limit input to one concrete type without a reason. In other words, duck typing and structural subtyping are good.

For example, we can try making io.TextIOWrapper accepts not only Unicode objects (including subclass) but any objects implementing some protocol. We already have __index__ for integers and buffer protocol for byts-like objects. That is examples of the policy.

I agree that that would be the cleanest approach, although I worry about how long it will take until 3th-party code is converted to the new protocol. That’s why I wrote earlier that adding this feature to PyUnicode_Type is the most pragmantic solution ;-) There are two clear options for a new protocol: 1. Add something similar to __index__ of __fspath__, but for “string-like” objects 2. Add an extension to the buffer protocol In either case an ABC for string-like objects would also be nice, to be able to opt in to the fairly common pattern of excluding strings from types that can be iterated over, that is: if isinstance(value, collections.abc.Iterable) and not isinstance(value, str): for item in value: proces_item(item) else: process_item(value) Ronald — Twitter / micro.blog: @ronaldoussoren Blog: https://blog.ronaldoussoren.net/ https://blog.ronaldoussoren.net/

Antoine Pitrou

29 Dec 29 Dec

6:23 p.m.

On Mon, 28 Dec 2020 14:27:00 +0100 Ronald Oussoren via Python-Dev wrote:

...

...
On 28 Dec 2020, at 14:00, Inada Naoki wrote:

On Mon, Dec 28, 2020 at 8:52 PM Phil Thompson wrote:

...
I would have thought that an object was defined by its behaviour rather than by any particular implementation detail.

As my understanding, the policy "an object was defined by its behavior..." doesn't mean "put unlimited amount of implementation behind one concrete type." The policy means APIs shouldn't limit input to one concrete type without a reason. In other words, duck typing and structural subtyping are good.

For example, we can try making io.TextIOWrapper accepts not only Unicode objects (including subclass) but any objects implementing some protocol. We already have __index__ for integers and buffer protocol for byts-like objects. That is examples of the policy.

I agree that that would be the cleanest approach, although I worry about how long it will take until 3th-party code is converted to the new protocol. That’s why I wrote earlier that adding this feature to PyUnicode_Type is the most pragmantic solution ;-)

But the "pragmatic" solution will make a performance-critical type (PyUnicode) more complicated and therefore potentially larger/slower. I think Inada's concerns are valid here.

...

There are two clear options for a new protocol:

1. Add something similar to __index__ of __fspath__, but for “string-like” objects

2. Add an extension to the buffer protocol

The third option is to add a distinct "string view" protocol. There are peculiarities (such as the fact that different objects may have different internal representations - some utf8, some utf16...) that make the buffer protocol suboptimal for this. Also, we probably don't want unicode-like objects to start being usable in contexts where a buffer-like object is required (such as writing to a binary file, or zlib-compressing a bunch of bytes). Regards Antoine.

Steve Dower

4 Jan 4 Jan

5:53 p.m.

On 12/29/2020 5:23 PM, Antoine Pitrou wrote:

...

The third option is to add a distinct "string view" protocol. There are peculiarities (such as the fact that different objects may have different internal representations - some utf8, some utf16...) that make the buffer protocol suboptimal for this.

Also, we probably don't want unicode-like objects to start being usable in contexts where a buffer-like object is required (such as writing to a binary file, or zlib-compressing a bunch of bytes).

I've had to deal with this problem in the past as well (WinRT HSTRINGs), and this is the approach that would seem to make the most sense to me. Basically, reintroduce PyString_* APIs as an _abstract_ interface to str-like objects. So the first line of every single one can be PyUnicode_Check() followed by calling the _concrete_ PyUnicode_* implementation. And then we develop additional type slots or whatever is necessary for someone to build an equivalent native object. Most "is this a str" checks can become PyString_Check, provided all the APIs used against the object are abstract (PyObject_* or PyString_*). Those that are going to mess with internals will have to get special treatment. I don't want to make it all sound too easy, because it probably won't be. But it should be possible to add a viable proxy layer as a set of abstract C APIs to use instead of the concrete ones. Cheers, Steve

Guido van Rossum

6:17 p.m.

Do you want to be a champion for this development? Or does anyone else want to volunteer? On Mon, Jan 4, 2021 at 8:54 AM Steve Dower wrote:

...

On 12/29/2020 5:23 PM, Antoine Pitrou wrote:

...
The third option is to add a distinct "string view" protocol. There are peculiarities (such as the fact that different objects may have different internal representations - some utf8, some utf16...) that make the buffer protocol suboptimal for this.

Also, we probably don't want unicode-like objects to start being usable in contexts where a buffer-like object is required (such as writing to a binary file, or zlib-compressing a bunch of bytes).

I've had to deal with this problem in the past as well (WinRT HSTRINGs), and this is the approach that would seem to make the most sense to me.

Basically, reintroduce PyString_* APIs as an _abstract_ interface to str-like objects.

So the first line of every single one can be PyUnicode_Check() followed by calling the _concrete_ PyUnicode_* implementation. And then we develop additional type slots or whatever is necessary for someone to build an equivalent native object.

Most "is this a str" checks can become PyString_Check, provided all the APIs used against the object are abstract (PyObject_* or PyString_*). Those that are going to mess with internals will have to get special treatment.

I don't want to make it all sound too easy, because it probably won't be. But it should be possible to add a viable proxy layer as a set of abstract C APIs to use instead of the concrete ones.

Cheers, Steve _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/TC3BZJX4... Code of Conduct: http://python.org/psf/codeofconduct/

-- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...

Nelson, Karl E.

7:07 p.m.

I would like to second Steve's suggestion. The requirements for JPype for this to work are pretty minimal. If there were a bit flag for string like that was checked by PyString_Check and then a call to the PyObject_Str() which would be guaranteed to return a concrete Unicode which is then used throughout the function call. This would not require any additional slots. Unfortunately, this doesn't match the other patterns in Python as if it passes PyString_Check then why would one need to call a casting object to get the actual string. It could be as simple as making a macro PyString_ToUnicode() that calls PyUnicode_Check and if it passes creates a new reference else returns PyObject_Str(). It is then just a small matter for JString, ObjCStr, WinHTString, etc to set this bit flag when the type is created. The downside of this is that we end up with an extra reference/dereference in string using functions, but given ownership concerns of a Buffer like protocol this is really the minimum required. This does not deal with ObjC requirement through as unlike Java, ObjC has mutable strings. There are number of parts of the Python API where the string is consumed immediately were immutable and mutable strings do not matter. But others like the hash or dictionary keys require immutable. So perhaps there needs to also be PyString_IsImmutable() so that we can prevent accidentally usage of a mutable string. I would be happy to help with this effort, but I am in the unfortunate position that the legal department at my employer (DOE/LLNL) has objected to some clause in the PSF Contributor Agreement thus prohibiting me from signing it. We also have a policy that prohibits open source contributions to projects that require signing agreements without laboratory legal sign off so I am in a bind until such time as I deal with their concerns. --Karl -----Original Message----- From: Steve Dower Sent: Monday, January 4, 2021 8:54 AM To: python-dev@python.org Subject: [Python-Dev] Re: Enhancement request for PyUnicode proxies On 12/29/2020 5:23 PM, Antoine Pitrou wrote:

...

The third option is to add a distinct "string view" protocol. There are peculiarities (such as the fact that different objects may have different internal representations - some utf8, some utf16...) that make the buffer protocol suboptimal for this.

Also, we probably don't want unicode-like objects to start being usable in contexts where a buffer-like object is required (such as writing to a binary file, or zlib-compressing a bunch of bytes).

Antoine Pitrou

28 Dec 28 Dec

2:52 p.m.

On Mon, 28 Dec 2020 11:07:46 +0900 Inada Naoki wrote:

...

Additionally, if we introduce the customizable lazy str object, it's very easy to release GIL during basic Unicode operations. Many third parties may assume PyUnicode_Compare doesn't release GIL if both operands are Unicode objects.

1) You have to prove such "many third parties" exist. I've written my share of C extension code and I don't remember assuming that PyUnicode_Compare doesn't release the GIL. 2) Even if there is such third party code, it is clearly making assumptions about undocumented implementation details. It is therefore ok to break it in new versions of CPython. However, I agree that having to call PyUnicode_READY() before calling C unicode APIs is probably an obscure detail that few people remember about. Regards Antoine.

Inada Naoki

6:20 p.m.

On Mon, Dec 28, 2020 at 10:53 PM Antoine Pitrou wrote:

...

On Mon, 28 Dec 2020 11:07:46 +0900 Inada Naoki wrote:

...
Additionally, if we introduce the customizable lazy str object, it's very easy to release GIL during basic Unicode operations. Many third parties may assume PyUnicode_Compare doesn't release GIL if both operands are Unicode objects.

1) You have to prove such "many third parties" exist. I've written my share of C extension code and I don't remember assuming that PyUnicode_Compare doesn't release the GIL.

It is my fault that I said "many", but I just pointed out possible backward incompatibility. Why I have to prove it?

...

2) Even if there is such third party code, it is clearly making assumptions about undocumented implementation details. It is therefore ok to break it in new versions of CPython.

But it should be considered carefully, because these APIs are not releasing GIL for a long time. And this type of change do not cause just a simple crash, but very rare undefined behaviors in multithreaded complex applications. For example, borrowed references in the caller can be changed to other objects with same size because memory blocks are reused. It is very difficult to notice and reproduce.

...

However, I agree that having to call PyUnicode_READY() before calling C unicode APIs is probably an obscure detail that few people remember about.

If we provide custom callback and call it in PyUnicode_READY(), many Unicode APIs using PyUnicode_READY() will be changed from predictable behavior API to "may run arbitrary code" behavior. It is obscure detail too. Regards, -- Inada Naoki

Antoine Pitrou

9:34 p.m.

On Tue, 29 Dec 2020 02:20:45 +0900 Inada Naoki wrote:

...

On Mon, Dec 28, 2020 at 10:53 PM Antoine Pitrou wrote:

...
On Mon, 28 Dec 2020 11:07:46 +0900 Inada Naoki wrote:

...
Additionally, if we introduce the customizable lazy str object, it's very easy to release GIL during basic Unicode operations. Many third parties may assume PyUnicode_Compare doesn't release GIL if both operands are Unicode objects.

1) You have to prove such "many third parties" exist. I've written my share of C extension code and I don't remember assuming that PyUnicode_Compare doesn't release the GIL.

It is my fault that I said "many", but I just pointed out possible backward incompatibility. Why I have to prove it?

Because most C extension code is far from that level of micro-optimization, so I doubt you'll find much code that deliberately relies on such an obscure implementation detail.

...

...
2) Even if there is such third party code, it is clearly making assumptions about undocumented implementation details. It is therefore ok to break it in new versions of CPython.

But it should be considered carefully, because these APIs are not releasing GIL for a long time. And this type of change do not cause just a simple crash, but very rare undefined behaviors in multithreaded complex applications. For example, borrowed references in the caller can be changed to other objects with same size because memory blocks are reused. It is very difficult to notice and reproduce.

Agreed, but that's a general problem with the C API (the existence of borrowed references and the fact that most C API calls can silently release the GIL, even as a side effect of object (de)allocation). It's also why it's better for most use cases to something like Cython. Regards Antoine.

1206

Age (days ago)

1216

Last active (days ago)

List overview

Download

21 comments

9 participants

participants (9)

Antoine Pitrou
Greg Ewing
Guido van Rossum
Inada Naoki
MRAB
Nelson, Karl E.
Phil Thompson
Ronald Oussoren
Steve Dower

Enhancement request for PyUnicode proxies

tags

participants (9)