[Python-Dev] Bridging strings from Python to other languages

Bill Bumgarner bbum@codefab.com
Tue, 4 Feb 2003 23:55:42 -0500


[This is a continuation of the thread that Just mentions below -- 
"NSString & mutability".  I finally had a chance to write enough code 
to figure out where the walls were that I kept bloodying my head 
against.  I believe I also came up with a more constructive way to 
think about this whole problem. More below -- the end of the 'bridging 
strings' section contains a series of what I believe are issues with 
the current Python implementation that should/may/could be addressed in 
a future version]

On Tuesday, Feb 4, 2003, at 22:13 US/Eastern, Just wrote on python-dev:
> (The use case is this. The PyObjC project marries Objective-C with
> Python. This is cool as it gives us direct access to almost all of
> Cocoa, the native OSX GUI interface. However, Cocoa defines its own
> string type and for reasons that are waaay beyond the scope of this 
> post
> (check the archives of the pyobjc-dev list if you're really really
> interested; see a recent thread called "NSString & mutability") it
> appears a bad idea to _convert_ these strings to Python unicode 
> strings.
> So we need to wrap them. Yet they should work as much like unicode
> strings as possible...)

Let me rephrase the problem in slightly different terms.   This will be 
long winded-- skip down to the 'bridging strings' section if you don't 
want to go through the initial discussion of the challenges of bridging 
two runtimes...

In creating a bridge between Python and other languages-- in this case, 
Objective-C-- the general goal is to provide seamless connectivity 
between the two runtime environments.   That is, you want to have a 
proxy to objects or structures found in the 'alien' runtime available 
in the 'native' runtime in a fashion that makes the proxy convenient to 
use.   Generally, this means that the proxy should act as much like the 
'native' runtime up to the point where it starts to obfuscate the 
behavior of the 'alien' runtime.   While a decent bridging and proxying 
mechanism can make "crossing the bridge" easy to do, one can never 
avoid the fact that there really is a bridge and on the other side 
there really is an 'alien' runtime.

Now-- there are a number of different ways to proxy objects/structures 
between the two runtimes:

- pure proxy:   the object/structure to be bridged is represented by a 
proxy that handles all requests for information or invocation of 
functions/methods by converting the request/invocation into a form that 
can be understood on the other side of the bridge.

Example -- the following creates a python native proxy to the alien 
Objective-C NSMutableArray instance (ignore that it really creates an 
NSCFArray-- that is an internal-to-Foundation implementation detail 
that is irrelevant).  The expression 'a.count()' actually causes the 
'count' Objective-C method to be invoked through the proxy a:

 >>> from Foundation import *
 >>> a = NSMutableArray.array()
 >>> type(a)
<objective-c class NSCFArray at 0x466d0>
 >>> a.count()
0
 >>> len(a)
0

The len(a) is just a demonstration of how far the proxying can go by 
defining the appropriate internal(?) attributes on the proxy.

- pure conversion:   the object/structure to be bridged is converted to 
the native type as it crosses the bridge.

Example -- the NSString is currently bridged to the Python String class 
such that string instances are converted to their native types as they 
cross the bridge [at least, this is the case in CVS -- I now have a 
proxy class that can wrap a Python PyString/PyUnicode instance and 
present it is a standard NSString instance on the ObjC side.  Avoids 
lots of unnecessary data copying when going from Python->Objective-C, 
but it needs a bunch of cleanup before I commit.]:

In the following, I create a new Objective-C NSString instance and 
assign it to 's'.  What results is a copy of the contents of the 
NSString instnace shoved into a normal Python string.

 >>> s = NSString.stringWithString_("Foobar")
 >>> type(s)
<type 'str'>

- mixed conversion/proxy:   this is a suboptimal case.   It generally 
converts to a native type in one direction, but potentially not in the 
other or not fully.

Example -- NSNumber is currently in this category.   It should change 
eventually, but there are issues to deal with:

 >>> a = NSArray.arrayWithObject_(1)
 >>> a[0]
<NSCFNumber objective-c instance 0x6704b0>
 >>> a[0] + 1
2

---

One of the key challenges is that proxying effectively causes two 
references to any given object to exist;   the native object reference 
and the 'alien' reference through the proxy.   Care must be taken to 
ensure that a single reference on either side of the bridge is enough 
to preserve both components of the hunk of data while also ensuring 
that the existence of a proxy without references does not prevent the 
item from being collected [may sound confusing:  consider the situation 
where a Python class is subclassed in the alien environment or 
vice-versa -- you effectively end up with instances that have part of 
their implementation in one runtime and the other part in the other 
runtime.  It can lead to issues.].

In general, these kinds of issues can be worked through by leveraging 
mechanisms such as weak references.  By providing a callback on the 
finalization of an object, it is possible to ensure that the 
alien-to-python component of the instance is destroyed, as well.

A final challenge is that sometimes an object's type or contents are 
completely irrelevant to a piece of code.   It is the object reference 
itself that is meaningful.  In these situations, if an object is passed 
across the bridge and back, what should come back really should be what 
went across in the first place-- if not, the contents may have been 
preserved, but the object's original identity has been lost.

Sometimes an object is just an object.

---

Strings provide a particular set of challenges in that no two runtime 
environments present exactly the same set of features in their string 
handling API, yet every runtime has some kind of a string API and, 
invariably, that API is very much at the core of the runtime.   The 
addition of Unicode to every string API over the last decade+ has not 
made things any simpler.

In python, strings are immutable and can encapsulate non-unicode data.  
  A separate unicode-- also immutable-- type is provided to encapsulate 
unicode data, but the standard string type can also encapsulate unicode 
data in certain circumstances [at least, it appears that PyString will 
happily consume and represent UTF8].

In Objective-C [and other languages], there is a single String class 
that can encapsulate both ASCII and unicode data in many different 
encodings.   Furthermore, there is a subclass of String that provides 
additional mutability API -- an instance of the mutable string class 
can have its contents changed by the developer while the identity of 
the object remains the same (unlike python where appending "b" to "a" 
results in a new string "ab").

To further complicate matters, most typed languages support the concept 
of 'upcasting' -- that is, of casting a particular instance to actually 
be an instance of a superclass.   For Objective-C, it can mean that a 
method that is declared as returning an immutable string or array 
actually returns a mutable string or array instance -- as long as the 
developer pays attention to the compiler warnings and doesn't do any 
stupid casting of their own, everything is fine.   Java offers similar 
casting "features".

- bridging strings -

So, how to bridge strings in such an environment?   In all cases, we 
can [fortunately] assume that strings pass across the bridge in one of 
a few choke points in the code -- that there is always a location to 
add a little bit of logic with which to help bridge the string [or any 
other random object].

The goal is to bridge strings in a fashion such that (not really in 
order of importance):

     (1) only one hunk of memory is used to contain the data within the 
string

     (2) conversion is kept to a minimum, if present at all, because 
strings will be passed back-and-forth across the bridge very frequently

     (3) identity is maintained;   pass a string with id() 7570720 from 
Python into the alien runtime and subsequently from the alien runtime 
back into python and the same string instance with id() 7570720 really 
should come back

     (4) 'alien' string specific API can still be used;   the 
Objective-C NSString provides a very rich API, including localization 
features that are not available in pure python.

For Python->Objective-C, bridging strings has proven to be fairly easy. 
  (1), (2), and (4) are quite straightforward.   (3) is not done yet.

For Objective-C->Python, bridging strings is not so easy.  The 
difficulty is compounded by certain features of the Python 
string/unicode APIs.

(1) is pretty easy -- the challenge is to figure out which API to call 
on the Python side such that the resulting Python object does not copy 
and re-encode the data.   If that is unavoidable, the cost of encoding 
or conversion (2) should be minimized [hopefully with a cache so that 
cost of conversion is paid once, then never again for immutable string 
instances].  There is also the ongoing challenge of determining when to 
use the PyString vs. PyUnicode APIs;  it seems that unicode objects are 
not welcome everywhere that string objects are?

(4) is actually quite easy and has been available for some time through 
the use of unbound methods.   However, the current implementation in 
CVS will always cause the python string to be converted to an NSString, 
the method invoked, and then the result-- if any and if a string-- is 
converted back to a python string.

(3) is not so easy-- at least, not from what I have determined so far.  
Most of the issues seem to be due to limitations in Python (which is 
really just another way of saying "I don't know enough to approach this 
problem from the right direction"):

     - can't use weakref because one can't have a weak reference to a 
string or unicode object.   This means that a callback when a string 
ref is finalized on the python side is not possible.   It also means 
that creating a hash between ObjC string instances and Python string 
instances can't be done without using strong references, thereby 
creating the potential for leaking memory.

     - can't subclass string (but can unicode) to provide a class that 
acts exactly like a regular string while containing a reference to the 
foreign string object.  There doesn't appear to be anywhere to hide a 
hunk of data in the string instance, either.

     - can't use the character buffer APIs because a character buffer 
cannot be used consistently throughout the python APIs in the same 
places as a string.  Using str() to turn a char buffer into a string 
violates (1) [and doesn't make much sense anyway].

End result -- it is very difficult to preserve the association between 
an alien string instance and a PyString instance consistently.  Even if 
PyString instances provide very thorough and consistent hashing 
behavior where two strings with the same contents always hash the same, 
the same cannot be said of all alien environments.   Even when it is 
true, there are cases where the developer may be relying on the 
identity of the object to not change outside of their control.

Mutable strings obviously present issues of their own, but they are not 
particularly relevant to discussion on python-dev outside of how future 
development might make the support of such common idiosyncrasies a bit 
more straightforward. Ideally, one could have an object on the python 
side that looks/feels/smells like a string instance, but whose contents 
may change.  This creates any number of exciting problems.  To further 
compound problems, anything that is declared as returning an NSString 
*may* return an NSMutableString at whim.  It doesn't happen often, but 
when it does, if the handling of mutable vs. immutable strings is too 
radically different, it'll cause code to blow up in highly unexpected 
and very difficult to debug ways.

Rambling on....
b.bum