[C++-sig] boost::python::str and Python's str and unicode types
divinekid at gmail.com
Tue Aug 4 17:23:55 CEST 2009
On Tue, Aug 4, 2009 at 4:37 PM, Robert
Smallshire<Robert.Smallshire at roxar.com> wrote:
> On Tue, Jul 28, 2009 at 10:11 PM, Robert
> Smallshire<Robert.Smallshire at roxar.com> wrote:
>> I have modified my local build of boost.python to include a
>> boost::python::unicode class, together with appropriate conversions from
>> wchar_t, const wchar_t* and std::wstring...
> During testing we have encountered issues with the difference in size of wchar_t and Py_UNICODE.
> Windows : sizeof(wchar_t) == sizeof(Py_UNICODE) == 2
> Linux : sizeof(wchar_t) == 4 != sizeof(Py_UNICODE) == 2
> assuming a UCS-2 build of Python which is the default. If Python is built with UCS-4 support then I believe Py_UNICODE and wchar_t will become compatible on Linux, but I'm not sure what the implications are for compatibility of Unicode string pickles, for example, between UCS-2 and UCS-4 builds of Python.
> Unfortunately, extract<const wchar_t*> seems to be problematic to implement in a portable manner because of these size differences. I have identified the following options:
> 1) Don't support extract<const wchar_t*> at all. There are no portability problems, but we have reduced functionality and break the symmetry between boost::python::str and boost::python::unicode behaviour.
> 2) Only support extract<const wchar_t*> on platforms where sizeof(wchar_t) == sizeof(Py_UNICODE) where the PyUnicode_AsUnicode function can be used to return a pointer to Python's internal buffer. This has the API usability advantage of being symmetrical with how extract<const char*> works in boost.python today on platforms that support it. However, this makes writing portable code for clients awkward. This is what my current implementation does, and its broken on Linux.
> 3) Implement extract<const wchar_t*> such that it always copies the data from the Py_UNICODE buffer into a new wchar_t buffer using PyUnicode_AsWideChar under the hood. The caller is then responsible for managing the lifetime of the buffer using delete  or boost::shared_array. This is how the extract<std::wstring> is implemented which works without difficulty. However, this breaks the symmetry with extract<const char*> is a non-obvious way that would need to be prominently documented. I suggest this approach would be likely to lead to quite leaky usage of the API by unwary clients, especially when porting code to Unicode strings.
> 4) #ifdef between (2) and (3) above depending on whether sizeof(wchar_t) == sizeof(Py_UNICODE). Combines all the bad characteristics of the above.
> There may, of course, be other options.
> If the data needs to be copied into a new buffer of wchar_t, the lifetime of which needs to be managed by the client, that pretty much describes the raison d'être of std::wstring, so my current preference is for option (1). If we did this, we'd still be able to construct boost::python::unicode instances from const wchar_t*, but would only be able to extract them as std::wstring. I'm open to persuasion about the right way forward...
> Thanks in advance for any comments or suggestions, and also to the people who have expressed interest in these patches off list.
> Rob Smallshire
> Roxar Software Solutions
> This message contains information that may be privileged or confidential and is the property of the Roxar Group. It is intended only for the person to whom it is addressed. If you are not the intended recipient, you are not authorised to read, print, retain, copy, disseminate, distribute, or use this message or any part thereof. If you receive this message in error, please notify the sender immediately and delete all copies of this message.
> Cplusplus-sig mailing list
> Cplusplus-sig at python.org
There could be an option similar to your 3) but still keep the memory
managed at the Python side. The trick is, there is a "PyObject
*defenc" field in the PyUnicodeObject struct, which can be seen as an
internal object attached to and managed by PyUnicode. This field is
being used as an object of cached UTF-8 encoded PyString of the
PyUnicode object, by some Python API. For example, by
PyUnicode_AsString (_PyUnicode_AsString in Python 3, it changed to a
internal API). Thus, this object is managed by the PyUnicode object
and will be destroyed when the PyUnicode destroyed.
So we may hack this field to inject an object which storing a wchar_t*
and meanwhile managed by Python. This can be implemented by inherit
PyString with an additional field. But, eh, this sounds a bit crazy.
Anyway, for pointers like const wchar_t *, boost::python requires a
"lvalue converter", but when we create new object in the converter, it
no longer actually a lvalue converter. That would be a bit strange.
So I think at now just have a unicode implementation without const
wchar_t * converter is ok, as your option 1). We may have it
implemented in future.
Just some my thoughts.
School of Computing,
National University of Singapore.
More information about the Cplusplus-sig