[C++-sig] boost::python::str and Python's str and unicode types

Tue Aug 4 10:37:28 CEST 2009

On Tue, Jul 28, 2009 at 10:11 PM, Robert
Smallshire<Robert.Smallshire at roxar.com> wrote:

> I have modified my local build of boost.python to include a
> boost::python::unicode class, together with appropriate conversions from
> wchar_t, const wchar_t* and std::wstring...

During testing we have encountered issues with the difference in size of wchar_t and Py_UNICODE.

Windows : sizeof(wchar_t) == sizeof(Py_UNICODE) == 2
Linux   : sizeof(wchar_t) == 4 != sizeof(Py_UNICODE) == 2

assuming a UCS-2 build of Python which is the default. If Python is built with UCS-4 support then I believe Py_UNICODE and wchar_t will become compatible on Linux, but I'm not sure what the implications are for compatibility of Unicode string pickles, for example, between UCS-2 and UCS-4 builds of Python.

Unfortunately, extract<const wchar_t*> seems to be problematic to implement in a portable manner because of these size differences.  I have identified the following options:

1) Don't support extract<const wchar_t*> at all. There are no portability problems, but we have reduced functionality and break the symmetry between boost::python::str and boost::python::unicode behaviour.

2) Only support extract<const wchar_t*> on platforms where sizeof(wchar_t) == sizeof(Py_UNICODE) where the PyUnicode_AsUnicode function can be used to return a pointer to Python's internal buffer.  This has the API usability advantage of being symmetrical with how extract<const char*> works in boost.python today on platforms that support it. However, this makes writing portable code for clients awkward. This is what my current implementation does, and its broken on Linux.

3) Implement extract<const wchar_t*> such that it always copies the data from the Py_UNICODE buffer into a new wchar_t buffer using PyUnicode_AsWideChar under the hood.  The caller is then responsible for managing the lifetime of the buffer using delete [] or boost::shared_array.  This is how the extract<std::wstring> is implemented which works without difficulty.  However, this breaks the symmetry with extract<const char*> is a non-obvious way that would need to be prominently documented.  I suggest this approach would be likely to lead to quite leaky usage of the API by unwary clients, especially when porting code to Unicode strings.

4) #ifdef between (2) and (3) above depending on whether sizeof(wchar_t) == sizeof(Py_UNICODE).  Combines all the bad characteristics of the above.

There may, of course, be other options.

If the data needs to be copied into a new buffer of wchar_t, the lifetime of which needs to be managed by the client, that pretty much describes the raison d'être of std::wstring, so my current preference is for option (1). If we did this, we'd still be able to construct boost::python::unicode instances from const wchar_t*, but would only be able to extract them as std::wstring.   I'm open to persuasion about the right way forward...

Thanks in advance for any comments or suggestions, and also to the people who have expressed interest in these patches off list.

Regards,

Rob Smallshire
Roxar Software Solutions

DISCLAIMER:
This message contains information that may be privileged or confidential and is the property of the Roxar Group. It is intended only for the person to whom it is addressed. If you are not the intended recipient, you are not authorised to read, print, retain, copy, disseminate, distribute, or use this message or any part thereof. If you receive this message in error, please notify the sender immediately and delete all copies of this message.