[IPython-dev] Buffers

Tue Jul 27 18:16:04 EDT 2010

Okay, so it sounds like we should never interpret unicode objects as simple
strings, if I am understanding the arguments correctly.

I certainly don't think that sending anything providing the buffer interface
should raise an exception, though. It should be up to the user to know
whether the buffer will be legible on the other side.

The situation I'm concerned about is that json gives you unicode strings,
whether that was the input or not.
s1 = 'word'
j = json.dumps(s1)
s2 = json.loads(j)
# u'word'

Now, if you have that logic internally, and you are sending messages based
on messages you received, unless you wrap _every single thing_ you pass to
send in str(), then you are calling things like send(u'word').  I really
don't think that should not raise an error, and trunk surely does.

The other options are either to always interpret unicode objects like
everything else, always sending by its buffer, trusting that the receiving
end will call decode (which may require that the message be copied at least
one extra time). This would also mean that if A sends something packed by
json to B, B unpacks it, and it included a str to be sent to C, then B has a
unicode wrapped version of it (not a str). If B then sends it on to C, C
will get a string that will _not_ be the same as the one A packed and sent
to B. I think this is terrible, since it seems like such an obvious (already
done) fix in zmq.

I think that the vast majority of the time you are faced with unicode
strings, they are in fact simple str instances that got wrapped, and we
should expect that and deal with it.

I decided to run some tests, since I currently have a UCS2 (OSX 10.6.4) and
UCS4 (ubuntu 10.04) machine
They are running my `patches' zmq branch right now, and I'm having no
problems.

case 1: sys.defaultencoding = utf8 on mac, ascii on ubuntu.
a.send(u'who') # valid ascii, valid utf-8, ascii string sent
b.recv()
# 'who'

u=u'whoπ'
# u'who\xcf\x80'

a.send(u'whoπ') # valid ascii, valid utf-8, utf string sent
b.recv().decode('utf-8')
# u'who\xcf\x80'

case 2: sys.defaultencoding = ascii,ascii
a.send(u'who') # valid ascii, string sent
b.recv()
# 'who'

u=u'whoπ'
u
# u'who\xcf\x80'

a.send(u'whoπ') # invalid ascii, buffer sent
s = b.recv()
# 'w\x00h\x00o\x00\xcf\x00\x80\x00'
s.decode('utf-8')
# UnicodeError (invalid utf-8)
s.decode('utf16')
# u'who\xcf\x80'

It seems that the _buffer_ of a unicode object is always utf16

I also did it with utf-8 on both sides, and threw in some latin-1, and there
was no difference between those and case 1.

I can't find the problem here.

As far as I can tell, a unicode object is:
a) a valid string for the sender, and the string is sent in the sender's
default encoding
on the receiver:
    sock.recv().decode(sender.defaultcodec)
    gets the object back
b) not a valid string for the sender, and the utf16 buffer is sent
on the receiver:
    sock.recv().decode('utf16')
    always seems to work

I even tried various instances of specifying the encoding as latin, etc. and
sending math symbols (√,∫) in various directions, and invariably the only
thing I needed to know on the receiver was the default encoding on the
sender. Everything was reconstructed properly with either
s.decode(sender.defaultcodec) or s.decode(utf16), depending solely on
whether str(u) would raise on the sender.

Are there specific symbols and/or directions where I should see a problem?
Based on reading, I figured that math symbols would if anything, but they
certainly don't in either direction.

-MinRK

On Tue, Jul 27, 2010 at 13:13, Fernando Perez <fperez.net at gmail.com> wrote:

> On Tue, Jul 27, 2010 at 12:23 PM, Brian Granger <ellisonbg at gmail.com>
> wrote:
> > This is definitely an issue.  Also, someone could set their own custom
> > unicode encoding by hand and that would mess this up as well.
> >
> >>
> >> If it is a problem, then there are some options:
> >>
> >> - disallow communication between ucs 2/4 pythons.
> >
> > But this doesn't account for other encoding/decoding setups.
>
> Note that when I mention ucs2/4, that refers to the *internal* python
> storage of all unicode objects.  That is: ucs2/4 is how the buffer,
> under the hood for a unicode string, is written in memory.  There are
> no other encoding/decoding setups for Python, this is strictly a
> compile-time flag and can only be either ucs2 or ucs4.
>
> You can see the value by typing:
>
> In [1]: sys.maxunicode
> Out[1]: 1114111
>
> That's ucs-4, and that number is the whole of the current unicode
> standard.  If you get instead 2^16, it means you have a ucs2 build,
> and python can only encode strings in the BMP (basic multilingual
> plane, where all living languages are stored but not math symbols,
> musical symbols and some extended Asian characters).
>
> Does that make sense?
>
> Note that additionally, it's exceedingly rare for anyone to set up a
> custom encoding for unicode.  It's hard to do right, requires plumbing
> in the codecs module, and I think Python supports out of the box
> enough encodings that I can't imagine why anyone would write a new
> encoding.  But regardless, if a string has been encoded then it's OK:
> now it's bytes, and there's no problem.
>
> >> - detect a mismatch and encode/decode all unicode strings to utf-8 on
> >> send/receive, but allow raw buffer sending if there's no mismatch.
> >
> > This will be tough though if users set their own encoding.
>
> No, the issue with users having something other than utf-8 is
> orthogonal to this.  The idea would be: - if both ends of the
> transmission have conflicting ucs internals, then all unicode strings
> are sent as utf-8.  If a user sends an encoded string, then that's
> just a bunch of bytes and it doesn't matter how they encoded it, since
> they will be responsible for decoding it on the other end.
>
> But I still don't like this approach because the ucs2/4 mismatch is a
> pair-wise problem, and for a multi-node setup managing this pair-wise
> switching of protocols can be a nightmare.  And let's not even get
> started on what pub/sub sockets would do with this...
>
> >> - *always* encode/decode.
> >>
> >
> > I think this is the option that I prefer (having users to this in their
> > application code).
>
> Yes, now that I think of pub/sub sockets, I don't think we have a
> choice.  It's a bit unfortunate that Python recently decided *not* to
> standardize on a storage scheme:
>
> http://mail.python.org/pipermail/python-dev/2008-July/080886.html
>
> because it means forever paying the price of encoding/decoding in this
> context.
>
> Cheers,
>
> f
>
> ps - as you can tell, I've been finally doing my homework on unicode,
> in preparation for an eventual 3.x transition :)
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/ipython-dev/attachments/20100727/0296131f/attachment.html>