[IPython-dev] Buffers

Tue Jul 27 16:13:36 EDT 2010

On Tue, Jul 27, 2010 at 12:23 PM, Brian Granger <ellisonbg at gmail.com> wrote:
> This is definitely an issue.  Also, someone could set their own custom
> unicode encoding by hand and that would mess this up as well.
>
>>
>> If it is a problem, then there are some options:
>>
>> - disallow communication between ucs 2/4 pythons.
>
> But this doesn't account for other encoding/decoding setups.

Note that when I mention ucs2/4, that refers to the *internal* python
storage of all unicode objects.  That is: ucs2/4 is how the buffer,
under the hood for a unicode string, is written in memory.  There are
no other encoding/decoding setups for Python, this is strictly a
compile-time flag and can only be either ucs2 or ucs4.

You can see the value by typing:

In [1]: sys.maxunicode
Out[1]: 1114111

That's ucs-4, and that number is the whole of the current unicode
standard.  If you get instead 2^16, it means you have a ucs2 build,
and python can only encode strings in the BMP (basic multilingual
plane, where all living languages are stored but not math symbols,
musical symbols and some extended Asian characters).

Does that make sense?

Note that additionally, it's exceedingly rare for anyone to set up a
custom encoding for unicode.  It's hard to do right, requires plumbing
in the codecs module, and I think Python supports out of the box
enough encodings that I can't imagine why anyone would write a new
encoding.  But regardless, if a string has been encoded then it's OK:
now it's bytes, and there's no problem.

>> - detect a mismatch and encode/decode all unicode strings to utf-8 on
>> send/receive, but allow raw buffer sending if there's no mismatch.
>
> This will be tough though if users set their own encoding.

No, the issue with users having something other than utf-8 is
orthogonal to this.  The idea would be: - if both ends of the
transmission have conflicting ucs internals, then all unicode strings
are sent as utf-8.  If a user sends an encoded string, then that's
just a bunch of bytes and it doesn't matter how they encoded it, since
they will be responsible for decoding it on the other end.

But I still don't like this approach because the ucs2/4 mismatch is a
pair-wise problem, and for a multi-node setup managing this pair-wise
switching of protocols can be a nightmare.  And let's not even get
started on what pub/sub sockets would do with this...

>> - *always* encode/decode.
>>
>
> I think this is the option that I prefer (having users to this in their
> application code).

Yes, now that I think of pub/sub sockets, I don't think we have a
choice.  It's a bit unfortunate that Python recently decided *not* to
standardize on a storage scheme:

http://mail.python.org/pipermail/python-dev/2008-July/080886.html

because it means forever paying the price of encoding/decoding in this context.

Cheers,

f

ps - as you can tell, I've been finally doing my homework on unicode,
in preparation for an eventual 3.x transition :)