Alternative Unicode implementations (NSString/NSMutableString)

Ronald Oussoren came up with a concrete use case for wanting the interpreter to consider something a string, even if it isn't implemented with the default datastructure. In https://mail.python.org/pipermail/python-ideas/2017-July/046407.html he writes: The reason I need to subclass str: in PyObjC I use a subclass of str to represent Objective-C strings (NSString/NSMutableString), and I need to keep track of the original value; mostly because there are some Objective-C APIs that use object identity. The worst part is that fully initialising the PyUnicodeObject fields often isn’t necessary as a lot of Objective-C strings aren’t used as strings in Python code. The PyUnicodeObject (via its leading PyASCIIObject member) currently uses 7 flag bits including 2 for kind. Would it be worth adding an 8th big to indicate that string is a virtual subclass, and that the internals should not be touched directly? (This would require changing some of the macros; at the time of PEP 393 it Martin ruled YAGNI ... but is this something that might reasonably be reconsidered, if someone did the work. Which I am considering, but not committing to.) -jJ

Supporting a new kind of string storage would require a lot of efforts. There are a lot of C code specialized for each Unicode kind Victor Le 19 juil. 2017 12:43 AM, "Jim J. Jewett" <jimjjewett@gmail.com> a écrit :
Ronald Oussoren came up with a concrete use case for wanting the interpreter to consider something a string, even if it isn't implemented with the default datastructure.
In https://mail.python.org/pipermail/python-ideas/2017-July/046407.html he writes:
The reason I need to subclass str: in PyObjC I use a subclass of str to represent Objective-C strings (NSString/NSMutableString), and I need to keep track of the original value; mostly because there are some Objective-C APIs that use object identity. The worst part is that fully initialising the PyUnicodeObject fields often isn’t necessary as a lot of Objective-C strings aren’t used as strings in Python code.
The PyUnicodeObject (via its leading PyASCIIObject member) currently uses 7 flag bits including 2 for kind. Would it be worth adding an 8th big to indicate that string is a virtual subclass, and that the internals should not be touched directly? (This would require changing some of the macros; at the time of PEP 393 it Martin ruled YAGNI ... but is this something that might reasonably be reconsidered, if someone did the work. Which I am considering, but not committing to.)
-jJ _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

On 19 July 2017 at 09:40, Victor Stinner <victor.stinner@gmail.com> wrote:
Supporting a new kind of string storage would require a lot of efforts. There are a lot of C code specialized for each Unicode kind
If I understand the requested flag correctly, it would be to request one of the following: 1. *Never* use any of CPython's fast paths, and instead be permanently slow; or 2. Indicate that it's a "lazily rendered" subclass that should hold off on calling PyUnicode_Ready for as long as possible, but still do so when necessary (akin to creating strings via the old Py_UNICODE APIs and then calling PyUnicode_READY on them) Neither of those is exactly straightforward, but I think it has the potential to tie in well with a Rust concept that Armin Ronacher recently pointed out, which is that in addition to their native String type, they also define a *separate* CString type as part of their C FFI layer: https://doc.rust-lang.org/std/ffi/struct.CString.html The Rust example does prompt me to ask whether this might be better modeled as a "PlatformString" data type (essentially a str subclass with an extra void * entry for a pointer to the native object), while the operator.index() precedent prompts me to ask whether or not this might be better handled with a "__platformstr__" protocol, but the basic *idea* of having a clearly defined way of modeling platform-native text strings at least somewhat independently of the core Python data types seems reasonable to me. (If we do go with the "flag bit" option, then it may actually be possible to steal the existing "Py_UNICODE *" pointer at the same time - that way an externally defined string would automatically be handled the same way as any other unready string, and "Py_UNICODE *" would just be a particular example of a platform string type) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

2017-07-19 4:34 GMT+02:00 Nick Coghlan <ncoghlan@gmail.com>:
2. Indicate that it's a "lazily rendered" subclass that should hold off on calling PyUnicode_Ready for as long as possible, but still do so when necessary (akin to creating strings via the old Py_UNICODE APIs and then calling PyUnicode_READY on them)
Py_UNICODE is deprecated and should go away in the long term. Serhiy Storchaka started to deprecate APIs using Py_UNICODE. We call PyUnicode_READY() *everywhere* to cast "legacy string" to the new compact format *as soon as possible*. So I don't think that you should abuse this machinery :-( Victor

On 19 Jul 2017, at 00:35, Jim J. Jewett <jimjjewett@gmail.com> wrote:
Ronald Oussoren came up with a concrete use case for wanting the interpreter to consider something a string, even if it isn't implemented with the default datastructure.
In https://mail.python.org/pipermail/python-ideas/2017-July/046407.html he writes:
The reason I need to subclass str: in PyObjC I use a subclass of str to represent Objective-C strings (NSString/NSMutableString), and I need to keep track of the original value; mostly because there are some Objective-C APIs that use object identity. The worst part is that fully initialising the PyUnicodeObject fields often isn’t necessary as a lot of Objective-C strings aren’t used as strings in Python code.
The PyUnicodeObject (via its leading PyASCIIObject member) currently uses 7 flag bits including 2 for kind. Would it be worth adding an 8th big to indicate that string is a virtual subclass, and that the internals should not be touched directly? (This would require changing some of the macros; at the time of PEP 393 it Martin ruled YAGNI ... but is this something that might reasonably be reconsidered, if someone did the work. Which I am considering, but not committing to.)
The reason I subclass str is primarily that it isn’t possible to be accepted as string like by the C API otherwise (that is, PyArg_Parse and the like require a PyUnicode_Type instance when the caller asks for a string). Adding a string equivalent of __index__ would most likely be a solution for my use case[1]. Without such a hook it would be nice to be able to postpone moving to PyUnicode_IS_READY state as long as possible, with a hook to provide the character buffer when the transition happens. That would make it possible to avoid duplicating the string buffer until it is truly needed. Ronald [1] Ignoring backward compatibility concerns on my side and without having fully thought through the consequences.
-jJ _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
participants (4)
-
Jim J. Jewett
-
Nick Coghlan
-
Ronald Oussoren
-
Victor Stinner