[Python-Dev] Bridging strings from Python to other languages
Bill Bumgarner
bbum@codefab.com
Tue, 4 Feb 2003 23:55:42 -0500
[This is a continuation of the thread that Just mentions below --
"NSString & mutability". I finally had a chance to write enough code
to figure out where the walls were that I kept bloodying my head
against. I believe I also came up with a more constructive way to
think about this whole problem. More below -- the end of the 'bridging
strings' section contains a series of what I believe are issues with
the current Python implementation that should/may/could be addressed in
a future version]
On Tuesday, Feb 4, 2003, at 22:13 US/Eastern, Just wrote on python-dev:
> (The use case is this. The PyObjC project marries Objective-C with
> Python. This is cool as it gives us direct access to almost all of
> Cocoa, the native OSX GUI interface. However, Cocoa defines its own
> string type and for reasons that are waaay beyond the scope of this
> post
> (check the archives of the pyobjc-dev list if you're really really
> interested; see a recent thread called "NSString & mutability") it
> appears a bad idea to _convert_ these strings to Python unicode
> strings.
> So we need to wrap them. Yet they should work as much like unicode
> strings as possible...)
Let me rephrase the problem in slightly different terms. This will be
long winded-- skip down to the 'bridging strings' section if you don't
want to go through the initial discussion of the challenges of bridging
two runtimes...
In creating a bridge between Python and other languages-- in this case,
Objective-C-- the general goal is to provide seamless connectivity
between the two runtime environments. That is, you want to have a
proxy to objects or structures found in the 'alien' runtime available
in the 'native' runtime in a fashion that makes the proxy convenient to
use. Generally, this means that the proxy should act as much like the
'native' runtime up to the point where it starts to obfuscate the
behavior of the 'alien' runtime. While a decent bridging and proxying
mechanism can make "crossing the bridge" easy to do, one can never
avoid the fact that there really is a bridge and on the other side
there really is an 'alien' runtime.
Now-- there are a number of different ways to proxy objects/structures
between the two runtimes:
- pure proxy: the object/structure to be bridged is represented by a
proxy that handles all requests for information or invocation of
functions/methods by converting the request/invocation into a form that
can be understood on the other side of the bridge.
Example -- the following creates a python native proxy to the alien
Objective-C NSMutableArray instance (ignore that it really creates an
NSCFArray-- that is an internal-to-Foundation implementation detail
that is irrelevant). The expression 'a.count()' actually causes the
'count' Objective-C method to be invoked through the proxy a:
>>> from Foundation import *
>>> a = NSMutableArray.array()
>>> type(a)
<objective-c class NSCFArray at 0x466d0>
>>> a.count()
0
>>> len(a)
0
The len(a) is just a demonstration of how far the proxying can go by
defining the appropriate internal(?) attributes on the proxy.
- pure conversion: the object/structure to be bridged is converted to
the native type as it crosses the bridge.
Example -- the NSString is currently bridged to the Python String class
such that string instances are converted to their native types as they
cross the bridge [at least, this is the case in CVS -- I now have a
proxy class that can wrap a Python PyString/PyUnicode instance and
present it is a standard NSString instance on the ObjC side. Avoids
lots of unnecessary data copying when going from Python->Objective-C,
but it needs a bunch of cleanup before I commit.]:
In the following, I create a new Objective-C NSString instance and
assign it to 's'. What results is a copy of the contents of the
NSString instnace shoved into a normal Python string.
>>> s = NSString.stringWithString_("Foobar")
>>> type(s)
<type 'str'>
- mixed conversion/proxy: this is a suboptimal case. It generally
converts to a native type in one direction, but potentially not in the
other or not fully.
Example -- NSNumber is currently in this category. It should change
eventually, but there are issues to deal with:
>>> a = NSArray.arrayWithObject_(1)
>>> a[0]
<NSCFNumber objective-c instance 0x6704b0>
>>> a[0] + 1
2
---
One of the key challenges is that proxying effectively causes two
references to any given object to exist; the native object reference
and the 'alien' reference through the proxy. Care must be taken to
ensure that a single reference on either side of the bridge is enough
to preserve both components of the hunk of data while also ensuring
that the existence of a proxy without references does not prevent the
item from being collected [may sound confusing: consider the situation
where a Python class is subclassed in the alien environment or
vice-versa -- you effectively end up with instances that have part of
their implementation in one runtime and the other part in the other
runtime. It can lead to issues.].
In general, these kinds of issues can be worked through by leveraging
mechanisms such as weak references. By providing a callback on the
finalization of an object, it is possible to ensure that the
alien-to-python component of the instance is destroyed, as well.
A final challenge is that sometimes an object's type or contents are
completely irrelevant to a piece of code. It is the object reference
itself that is meaningful. In these situations, if an object is passed
across the bridge and back, what should come back really should be what
went across in the first place-- if not, the contents may have been
preserved, but the object's original identity has been lost.
Sometimes an object is just an object.
---
Strings provide a particular set of challenges in that no two runtime
environments present exactly the same set of features in their string
handling API, yet every runtime has some kind of a string API and,
invariably, that API is very much at the core of the runtime. The
addition of Unicode to every string API over the last decade+ has not
made things any simpler.
In python, strings are immutable and can encapsulate non-unicode data.
A separate unicode-- also immutable-- type is provided to encapsulate
unicode data, but the standard string type can also encapsulate unicode
data in certain circumstances [at least, it appears that PyString will
happily consume and represent UTF8].
In Objective-C [and other languages], there is a single String class
that can encapsulate both ASCII and unicode data in many different
encodings. Furthermore, there is a subclass of String that provides
additional mutability API -- an instance of the mutable string class
can have its contents changed by the developer while the identity of
the object remains the same (unlike python where appending "b" to "a"
results in a new string "ab").
To further complicate matters, most typed languages support the concept
of 'upcasting' -- that is, of casting a particular instance to actually
be an instance of a superclass. For Objective-C, it can mean that a
method that is declared as returning an immutable string or array
actually returns a mutable string or array instance -- as long as the
developer pays attention to the compiler warnings and doesn't do any
stupid casting of their own, everything is fine. Java offers similar
casting "features".
- bridging strings -
So, how to bridge strings in such an environment? In all cases, we
can [fortunately] assume that strings pass across the bridge in one of
a few choke points in the code -- that there is always a location to
add a little bit of logic with which to help bridge the string [or any
other random object].
The goal is to bridge strings in a fashion such that (not really in
order of importance):
(1) only one hunk of memory is used to contain the data within the
string
(2) conversion is kept to a minimum, if present at all, because
strings will be passed back-and-forth across the bridge very frequently
(3) identity is maintained; pass a string with id() 7570720 from
Python into the alien runtime and subsequently from the alien runtime
back into python and the same string instance with id() 7570720 really
should come back
(4) 'alien' string specific API can still be used; the
Objective-C NSString provides a very rich API, including localization
features that are not available in pure python.
For Python->Objective-C, bridging strings has proven to be fairly easy.
(1), (2), and (4) are quite straightforward. (3) is not done yet.
For Objective-C->Python, bridging strings is not so easy. The
difficulty is compounded by certain features of the Python
string/unicode APIs.
(1) is pretty easy -- the challenge is to figure out which API to call
on the Python side such that the resulting Python object does not copy
and re-encode the data. If that is unavoidable, the cost of encoding
or conversion (2) should be minimized [hopefully with a cache so that
cost of conversion is paid once, then never again for immutable string
instances]. There is also the ongoing challenge of determining when to
use the PyString vs. PyUnicode APIs; it seems that unicode objects are
not welcome everywhere that string objects are?
(4) is actually quite easy and has been available for some time through
the use of unbound methods. However, the current implementation in
CVS will always cause the python string to be converted to an NSString,
the method invoked, and then the result-- if any and if a string-- is
converted back to a python string.
(3) is not so easy-- at least, not from what I have determined so far.
Most of the issues seem to be due to limitations in Python (which is
really just another way of saying "I don't know enough to approach this
problem from the right direction"):
- can't use weakref because one can't have a weak reference to a
string or unicode object. This means that a callback when a string
ref is finalized on the python side is not possible. It also means
that creating a hash between ObjC string instances and Python string
instances can't be done without using strong references, thereby
creating the potential for leaking memory.
- can't subclass string (but can unicode) to provide a class that
acts exactly like a regular string while containing a reference to the
foreign string object. There doesn't appear to be anywhere to hide a
hunk of data in the string instance, either.
- can't use the character buffer APIs because a character buffer
cannot be used consistently throughout the python APIs in the same
places as a string. Using str() to turn a char buffer into a string
violates (1) [and doesn't make much sense anyway].
End result -- it is very difficult to preserve the association between
an alien string instance and a PyString instance consistently. Even if
PyString instances provide very thorough and consistent hashing
behavior where two strings with the same contents always hash the same,
the same cannot be said of all alien environments. Even when it is
true, there are cases where the developer may be relying on the
identity of the object to not change outside of their control.
Mutable strings obviously present issues of their own, but they are not
particularly relevant to discussion on python-dev outside of how future
development might make the support of such common idiosyncrasies a bit
more straightforward. Ideally, one could have an object on the python
side that looks/feels/smells like a string instance, but whose contents
may change. This creates any number of exciting problems. To further
compound problems, anything that is declared as returning an NSString
*may* return an NSMutableString at whim. It doesn't happen often, but
when it does, if the handling of mutable vs. immutable strings is too
radically different, it'll cause code to blow up in highly unexpected
and very difficult to debug ways.
Rambling on....
b.bum