[C++-sig] Some thoughts on py3k support
Niall Douglas
s_sourceforge at nedprod.com
Thu Mar 19 01:05:24 CET 2009
On 18 Mar 2009 at 2:07, Haoyu Bai wrote:
> First thing come into my mind is the build system. For Python 3, we
> would have a separate build target, eg. having libboost_python.so
> built for Python 2.x and libboost_python3.so for Python 3. This would
> match the current situation that most Linux distro packaged Python 3
> in a way that lives together with Python 2.x. There's two build system
> for Boost now - cmake and bjam. Personally I want to start with cmake,
> and finally both of the two will up to date for Python 3.
I'd prefer a slightly different naming convention for the eventual
case when Boost.Python itself bumps a version. How about this:
libboost_python_py2.so
libboost_python_py3.so
I'd also be happy with (remembering that current Boost.Python is v2):
libboost_python2_py2.so
libboost_python2_py3.so
> I have read some piece of Boost.Python code in these, it is
> understandable for me. And I'd say the usage of template
> metaprogramming is really smart! Thanks to the high level abstraction,
> there would be just a little code interfaced to Python C-API directly.
> So there would not be so much works.
Hmm, we'll see. It's much trickier than you might think.
> However there are something we need to take care of. One of them is,
> in Python 3, string is unicode (and the old string class is called
> bytes in Python 3). So if we have a C function
>
> char const* hello(); // returns a "Hello"
>
> According to the current behavior of Boost.Python converters, the
> wrapped function in Python 3 will return a b"Hello" (which is a bytes
> object but not a string). So code like this will broken:
>
> if "Hello" == hello(): ...
>
> Because string object "Hello" is not equal to bytes object b"Hello"
> returned by hello(). We may change the behavior of converter to return
> a unicode string in Python 3, that would keep most of existing code
> compatible. Anyway there will be code really need a single byte string
> returned, a new converter can be explicitly specified for this.
One shouldn't be doing such a comparison anyway IMHO, though the
idiom of equivalence between C++ immutable strings and python
immutable strings is long-standing. Also, we need to fix booleans not
working quite properly in BPL.
> There are more issues similar to this. I'll figure out more and write
> a detailed proposal as soon as possible.
I have my own opinion on unicode and judging by the other posts, I'll
be disagreeing with just about everyone else.
Firstly, I'd like to state that Python v3 has ditched the old string
system for very good reasons and this change more than any other has
created source incompatibilites in most code. One cannot expect much
difference in Boost.Python - code *should* need to be explicitly
ported.
Much like the open() function with text (and reusing its machinery),
I propose you need to specify the *default* encoding for immutable
const char * though it defaults from LC_LANG in most cases to UTF-8 -
that's right, const char * will be UTF-8 by default though it's
overridable. This default encoding should be a per-python interpreter
setting (i.e. it uses whatever open() uses) though it can be
temporarily overridden using a specifier template.
const unsigned char * looks better to me for immutable byte data - I
agree that some compilers have the option for char * == unsigned char
*, but these are rare and it's an unwise option in most cases.
std::vector<unsigned char> is definitely the fellow for mutable byte
data. std::vector<char> smells too much like a mutable string.
I appreciate that const char * defaulting to UTF-8 might be
controversial - after all, ISO C++ has remained very agnostic about
the matter (see http://www.open-
std.org/jtc1/sc22/wg21/docs/papers/2007/n2442.htm for a proposed
slight change). I have rationales though:
1. While unicode in C++ source anywhere other than in string literals
is a very bad idea, both MSVC and GCC support having the entire
source file in UTF-8 text and unicode characters in string literals
being incorporated as-is. This method of incorporating non-European
letters into string literals is too easy to avoid using ;). It's
certainly very common in Asia and until the u8 literal specifier is
added to C++ there isn't an easy workaround. It wouldn't be very open
minded of us to assume the world can still easily make do with ASCII-
7.
2. A *lot* of library code e.g. GUI library code, most of Linux or
indeed the GNU C library, has moved to UTF-8 for its string literals.
Interfacing strings between BPL and this library code is made much
easier and natural if it takes UTF-8 as default.
3. "char *" in C and C++ means "a string" in the minds of most
programmers. Having it be a set of bytes might be standards correct
but let's be clear, C style strings have always had escape sequences
so they have never been entirely pure byte sequences. Making this
immutable bytes instead will make BPL unnatural to use - expect
questions here on c++-sig :)
4. Chances are that world + dog is going to move to UTF-8 eventually
anyway and that means all C++ source code. Might as well make that
the least typing required scenario.
Anyway, I expect few will agree with me, but that's my opinion.
Cheers,
Niall
More information about the Cplusplus-sig
mailing list