On 25 Dec 2020, at 23:03, Nelson, Karl E. via Python-Dev <python-dev@python.org> wrote:

I was directed to post this request to the general Python development community so hopefully this is on topic.
 
One of the weaknesses of the PyUnicode implementation is that the type is concrete and there is no option for an abstract proxy string to a foreign source.  This is an issue for an API like JPype in which java.lang.Strings are passed back from Java.   Ideally these would be a type derived from the Unicode type str, but that requires transferring the memory immediately from Java to Python even when that handle is large and will never be accessed from within Python.  For certain operations like XML parsing this can be prohibitable, so instead of returning a str we return a JString.   (There is a separate issue that Java method names and Python method names conflict so direct inheritance creates some problems.)
 
The JString type can of course be transferred to Python space at any time as both Python Unicode and Java string objects are immutable.  However the CPython API which takes strings only accepts the Unicode type objects which have a concrete implementation.  It is possible to extend strings, but those extensions do not allow for proxing as far as I can tell.  Thus there is no option currently to proxy to a string representation in another language.  The concept of the using the duck type ``__str__`` method is insufficient as this indices that an object can become a string, rather than “this object is effectively a string” for the purposes of the CPython API.
 
One way to address this is to use currently outdated copy of READY to extend Unicode objects to other languages.  A class like JString would be an unready Unicode object which when READY is called transfers the memory from Java, sets up the flags and sets up a pointer to the code point representation.  Unfortunately the READY concept is scheduled for removal and thus the chance to address the needs for proxying a Unicode to another languages representation may be limited. There may be other methods to accomplish this without using the concept of READY.  So long as access to the code points go through the Unicode API and the Unicode object can be extended such that the actual code points may be located outside of the Unicode object then a proxy can still be achieved if there are hooks in it to decided when a transfer should be performed.   Generally the transfer request only needs to happen once  but the key issue being that the number of code points (nor the kind of points) will not be known until the memory is transferred.
 
Java has much the same problem.   Although they defined an interface class “java.lang.CharacterArray” the actually “java.lang.String” class is concrete and almost all API methods take a String rather than the base interface even when the base interface would have been adequate.  Thus just like Python has difficulty treating a foreign string class as it would a native one, Java cannot treat a Python string as native one as well.  So Python strings get represented as CharacterArray type which effectively limits it use greatly.
 
Summary:
 
  • A String proxy would need the address of the memory in the “wstr” slot though the code points may be char[], wchar[] or int[] depending the representation in the proxy.
  • API calls to interpret the data would need to check to see if the data is transferred first, if not it would call the proxy dependent transfer method which is responsible for creating a block of code points and set up flags (kind, ascii, ready, and compact). 
  • The memory block allocated would need to call the proxy dependent destructor to clean up with the string is done.
  • It is not clear if this would have impact on performance.   Python already has the concept of a string which needs actions before it can be accessed, but this is scheduled for removal.
 
Are there any plans currently to address the concept of a proxy string in PyUnicode API?  

I have a similar problem in PyObjC which proxies Objective-C classes to Python (and the other way around). For interop with Python code I proxy Objective-C strings using a subclass of str() that is eagerly populated even if, as you mention as well, a lot of these proxy object are never used in a context where the str() representation is important.  A complicating factor for me is that Objective-C strings are, in general, mutable which can lead to interesting behaviour.    Another disadvantage of subclassing str() for foreign string types is that this removes the proxy class from their logical location in the class hierarchy (in my case the proxy type is not a subclass of the proxy type for NSObject, even though all Objective-C classes inherit from NSObject).

I primarily chose to subclass the str type because that enables using the NSString proxy type with C functions/methods that expect a string argument.  That might be something that can be achieved using a new protocol, similar to operator.index of os.fspath.   A complicating factor here is there’s a significant amount of Python code as well that explicitly tests for the str type to exclude strings from code paths that iterate over containers.

Ronald


Twitter / micro.blog: @ronaldoussoren
Blog: https://blog.ronaldoussoren.net/

 
 
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/BDJAQDPQMVCLCSB3CEM34VPAY666D3M3/
Code of Conduct: http://python.org/psf/codeofconduct/