[C++-sig] Some thoughts on py3k support

Thu Mar 19 01:05:24 CET 2009

On 18 Mar 2009 at 2:07, Haoyu Bai wrote:

> First thing come into my mind is the build system. For Python 3, we
> would have a separate build target, eg. having libboost_python.so
> built for Python 2.x and libboost_python3.so for Python 3. This would
> match the current situation that most Linux distro packaged Python 3
> in a way that lives together with Python 2.x. There's two build system
> for Boost now - cmake and bjam. Personally I want to start with cmake,
> and finally both of the two will up to date for Python 3.

I'd prefer a slightly different naming convention for the eventual 
case when Boost.Python itself bumps a version. How about this:

libboost_python_py2.so
libboost_python_py3.so

I'd also be happy with (remembering that current Boost.Python is v2):

libboost_python2_py2.so
libboost_python2_py3.so

> I have read some piece of Boost.Python code in these, it is
> understandable for me. And I'd say the usage of template
> metaprogramming is really smart! Thanks to the high level abstraction,
> there would be just a little code interfaced to Python C-API directly.
> So there would not be so much works.

Hmm, we'll see. It's much trickier than you might think.

> However there are something we need to take care of. One of them is,
> in Python 3, string is unicode (and the old string class is called
> bytes in Python 3). So if we have a C function
> 
> char const* hello(); // returns a "Hello"
> 
> According to the current behavior of Boost.Python converters, the
> wrapped function in Python 3 will return a b"Hello" (which is a bytes
> object but not a string). So code like this will broken:
> 
> if "Hello" == hello(): ...
> 
> Because string object "Hello" is not equal to bytes object b"Hello"
> returned by hello(). We may change the behavior of converter to return
> a unicode string in Python 3, that would keep most of existing code
> compatible. Anyway there will be code really need a single byte string
> returned, a new converter can be explicitly specified for this.

One shouldn't be doing such a comparison anyway IMHO, though the 
idiom of equivalence between C++ immutable strings and python 
immutable strings is long-standing. Also, we need to fix booleans not 
working quite properly in BPL.

> There are more issues similar to this. I'll figure out more and write
> a detailed proposal as soon as possible.

I have my own opinion on unicode and judging by the other posts, I'll 
be disagreeing with just about everyone else.

Firstly, I'd like to state that Python v3 has ditched the old string 
system for very good reasons and this change more than any other has 
created source incompatibilites in most code. One cannot expect much 
difference in Boost.Python - code *should* need to be explicitly 
ported.

Much like the open() function with text (and reusing its machinery), 
I propose you need to specify the *default* encoding for immutable 
const char * though it defaults from LC_LANG in most cases to UTF-8 - 
that's right, const char * will be UTF-8 by default though it's 
overridable. This default encoding should be a per-python interpreter 
setting (i.e. it uses whatever open() uses) though it can be 
temporarily overridden using a specifier template.

const unsigned char * looks better to me for immutable byte data - I 
agree that some compilers have the option for char * == unsigned char 
*, but these are rare and it's an unwise option in most cases.

std::vector<unsigned char> is definitely the fellow for mutable byte 
data. std::vector<char> smells too much like a mutable string.

I appreciate that const char * defaulting to UTF-8 might be 
controversial - after all, ISO C++ has remained very agnostic about 
the matter (see http://www.open-
std.org/jtc1/sc22/wg21/docs/papers/2007/n2442.htm for a proposed 
slight change). I have rationales though:

1. While unicode in C++ source anywhere other than in string literals 
is a very bad idea, both MSVC and GCC support having the entire 
source file in UTF-8 text and unicode characters in string literals 
being incorporated as-is. This method of incorporating non-European 
letters into string literals is too easy to avoid using ;). It's 
certainly very common in Asia and until the u8 literal specifier is 
added to C++ there isn't an easy workaround. It wouldn't be very open 
minded of us to assume the world can still easily make do with ASCII-
7.

2. A *lot* of library code e.g. GUI library code, most of Linux or 
indeed the GNU C library, has moved to UTF-8 for its string literals. 
Interfacing strings between BPL and this library code is made much 
easier and natural if it takes UTF-8 as default.

3. "char *" in C and C++ means "a string" in the minds of most 
programmers. Having it be a set of bytes might be standards correct 
but let's be clear, C style strings have always had escape sequences 
so they have never been entirely pure byte sequences. Making this 
immutable bytes instead will make BPL unnatural to use - expect 
questions here on c++-sig :)

4. Chances are that world + dog is going to move to UTF-8 eventually 
anyway and that means all C++ source code. Might as well make that 
the least typing required scenario.

Anyway, I expect few will agree with me, but that's my opinion.

Cheers,
Niall