[C++-sig] Some thoughts on py3k support

Thu Mar 19 14:53:35 CET 2009

On Thu, Mar 19, 2009 at 8:05 AM, Niall Douglas
<s_sourceforge at nedprod.com> wrote:
> On 18 Mar 2009 at 2:07, Haoyu Bai wrote:
>
> I'd prefer a slightly different naming convention for the eventual
> case when Boost.Python itself bumps a version. How about this:
>
> libboost_python_py2.so
> libboost_python_py3.so
>
> I'd also be happy with (remembering that current Boost.Python is v2):
>
> libboost_python2_py2.so
> libboost_python2_py3.so
>

This naming style is a bit more clear but broken user's build scripts
- not a big problem though. But when all the Python community evolved
to py3k, and 2.x come into history, should we change the name back
from libboost_python_py3.so to libboost_python.so?

>
> I have my own opinion on unicode and judging by the other posts, I'll
> be disagreeing with just about everyone else.
>
> Firstly, I'd like to state that Python v3 has ditched the old string
> system for very good reasons and this change more than any other has
> created source incompatibilites in most code. One cannot expect much
> difference in Boost.Python - code *should* need to be explicitly
> ported.
>
> Much like the open() function with text (and reusing its machinery),
> I propose you need to specify the *default* encoding for immutable
> const char * though it defaults from LC_LANG in most cases to UTF-8 -
> that's right, const char * will be UTF-8 by default though it's
> overridable. This default encoding should be a per-python interpreter
> setting (i.e. it uses whatever open() uses) though it can be
> temporarily overridden using a specifier template.
>
> const unsigned char * looks better to me for immutable byte data - I
> agree that some compilers have the option for char * == unsigned char
> *, but these are rare and it's an unwise option in most cases.
>
> std::vector<unsigned char> is definitely the fellow for mutable byte
> data. std::vector<char> smells too much like a mutable string.
>
> I appreciate that const char * defaulting to UTF-8 might be
> controversial - after all, ISO C++ has remained very agnostic about
> the matter (see http://www.open-
> std.org/jtc1/sc22/wg21/docs/papers/2007/n2442.htm for a proposed
> slight change). I have rationales though:
>
> 1. While unicode in C++ source anywhere other than in string literals
> is a very bad idea, both MSVC and GCC support having the entire
> source file in UTF-8 text and unicode characters in string literals
> being incorporated as-is. This method of incorporating non-European
> letters into string literals is too easy to avoid using ;). It's
> certainly very common in Asia and until the u8 literal specifier is
> added to C++ there isn't an easy workaround. It wouldn't be very open
> minded of us to assume the world can still easily make do with ASCII-
> 7.
>
> 2. A *lot* of library code e.g. GUI library code, most of Linux or
> indeed the GNU C library, has moved to UTF-8 for its string literals.
> Interfacing strings between BPL and this library code is made much
> easier and natural if it takes UTF-8 as default.
>
> 3. "char *" in C and C++ means "a string" in the minds of most
> programmers. Having it be a set of bytes might be standards correct
> but let's be clear, C style strings have always had escape sequences
> so they have never been entirely pure byte sequences. Making this
> immutable bytes instead will make BPL unnatural to use - expect
> questions here on c++-sig :)
>
> 4. Chances are that world + dog is going to move to UTF-8 eventually
> anyway and that means all C++ source code. Might as well make that
> the least typing required scenario.
>
> Anyway, I expect few will agree with me, but that's my opinion.
>
> Cheers,
> Niall
>

Would you mean for converting between char * and Python, we may use
the encoding as same as Python interpreter's default encoding, which
can be get by sys.getdefaultencoding() in Python? Or let user to
choose default encoding for their extension module via Boost.Python
API? I'd say either of these is very flexiable. Nice idea!

Also I think to use the same default encoding as Python's
sys.getdefaultencoding() it a bit better since it provides a
unification in the whole Python environment. And it is configurable as
Python's startup by sys.setdefaultencoding().

I'm felling the difference between char*, unsinged char* and the
constant version and std::vector version of them would be a bit
complicated and confusing. We may document it clearly, but things are
still complicated. Any thoughts?

Thanks!

-- Haoyu Bai