[Twisted-Python] PEP3131: non-ascii identifiers
This does not seem to be supported by Python yet. Should that be enabled at all? If one process with PY3 sends such identifiers to a separate process with PY2, that will fail. I am not sure if that would be a problem, whoever uses this must make sure PY3 is used everywhere. If this should be forbidden, I will add a test to test_pb for this. And of course somebody should document that somewhere. There more PEP3131 is used, the more users will fall into this trap. If this should be enabled (which I think is not difficult, at least for pb): At least the patch below will be needed (only for PY3), maybe it is already sufficient. Given that nativeString and networkString are always used (done that for pb). networkString may then return bytes with the high bit set But since networkString is called in many places I want to ask and make sure that it may really be changed this way. https://twistedmatrix.com/documents/14.0.0/core/specifications/banana.html does not speak against it, so I wonder why networkString has that limitation to 7bit. concrete banana-encoded example, from modified test_pb: (the method name is getSimpleä) test_pb still passes with patched nativeString/networkString (but I only have one test for this so far, test_refcount). b'\x07\x80\x07\x82message\x01\x81\x03\x82foo\x0b\x82getSimple\xc3\xa4\x01\x81\x01\x80\x05\x82tuple\x01\x80\n\x82dictionary' diff --git twisted/python/compat.py twisted/python/compat.py index 6f76c39..6919cf6 100644 --- twisted/python/compat.py +++ twisted/python/compat.py @@ -348,10 +348,9 @@ def nativeString(s): raise TypeError("%r is neither bytes nor unicode" % s) if _PY3: if isinstance(s, bytes): - return s.decode("ascii") + return s.decode("utf-8") else: - # Ensure we're limited to ASCII subset: - s.encode("ascii") + return s else: if isinstance(s, unicode): return s.encode("ascii") @@ -428,7 +427,7 @@ if _PY3: def networkString(s): if not isinstance(s, unicode): raise TypeError("Can only convert text to bytes on Python 3, I got %r" % (s,)) - return s.encode('ascii') + return s.encode('utf-8') def networkChar(integer): """ -- Wolfgang
On 09:19 am, wolfgang.kde@rohdewald.de wrote:
This does not seem to be supported by Python yet.
Should that be enabled at all? If one process with PY3 sends such identifiers to a separate process with PY2, that will fail. I am not sure if that would be a problem, whoever uses this must make sure PY3 is used everywhere.
This is why we will *not* change the PB wire protocol as part of the porting work. The wire protocol will remain the same whether you are using Python 2 or Python 3 to run your program or. This is the point of a protocol, after all. It is to let two programs communicate with each other.
If this should be forbidden, I will add a test to test_pb for this. And of course somebody should document that somewhere. There more PEP3131 is used, the more users will fall into this trap.
I'm not exactly sure what you mean here. Using unicode where only bytes are allowed is probably already forbidden throughout PB.
If this should be enabled (which I think is not difficult, at least for pb):
At least the patch below will be needed (only for PY3), maybe it is already sufficient. Given that nativeString and networkString are always used (done that for pb).
networkString may then return bytes with the high bit set
Definitely not.
But since networkString is called in many places I want to ask and make sure that it may really be changed this way.
https://twistedmatrix.com/documents/14.0.0/core/specifications/banana.html does not speak against it, so I wonder why networkString has that limitation to 7bit.
That is the sole purpose of `networkString`. It is a work-around for the inconvenient fact that Python changed the meaning of the string literal syntax from bytes to unicode.
concrete banana-encoded example, from modified test_pb: (the method name is getSimpleä) test_pb still passes with patched nativeString/networkString (but I only have one test for this so far, test_refcount).
b'\x07\x80\x07\x82message\x01\x81\x03\x82foo\x0b\x82getSimple\xc3\xa4\x01\x81\x01\x80\x05\x82tuple\x01\x80\n\x82dictionary'
diff --git twisted/python/compat.py twisted/python/compat.py index 6f76c39..6919cf6 100644 --- twisted/python/compat.py +++ twisted/python/compat.py @@ -348,10 +348,9 @@ def nativeString(s): raise TypeError("%r is neither bytes nor unicode" % s) if _PY3: if isinstance(s, bytes): - return s.decode("ascii") + return s.decode("utf-8") else: - # Ensure we're limited to ASCII subset: - s.encode("ascii") + return s else: if isinstance(s, unicode): return s.encode("ascii") @@ -428,7 +427,7 @@ if _PY3: def networkString(s): if not isinstance(s, unicode): raise TypeError("Can only convert text to bytes on Python 3, I got %r" % (s,)) - return s.encode('ascii') + return s.encode('utf-8')
def networkChar(integer): """
This change definitely won't be acceptable. It completely removes the feature `networkString` exists to provide: verifying that strings that might be either unicode or bytes can still be implicitly combined into bytes. Can you point out the specific places where you think PB needs to start using UTF-8 instead of ASCII? Those are the places that need to be fixed, not the underlying porting helpers they happen to use. Jean-Paul
Am Montag, 8. September 2014, 10:54:44 schrieb exarkun@twistedmatrix.com:
Can you point out the specific places where you think PB needs to start using UTF-8 instead of ASCII? Those are the places that need to be fixed, not the underlying porting helpers they happen to use.
My question is whether PEP3131 should be supported or not. After all, it is an integral part of Python3. I have no opinion about this, I just want to point it out as a potential problem. So you say it should not be supported, and I have no problem with that. Just to be clear about it: In Python3 you can legally write def ÜnicödeMethod€Name(self): or see https://mail.python.org/pipermail/python-3000/2007-June/008172.html for a nice Russian example. but PB cannot transfer those identifiers. Which is not what the user would expect. This should be clearly communicated somewhere in the docs, IMHO. -- Wolfgang
On 11:44 am, wolfgang.kde@rohdewald.de wrote:
Am Montag, 8. September 2014, 10:54:44 schrieb exarkun@twistedmatrix.com:
Can you point out the specific places where you think PB needs to start using UTF-8 instead of ASCII? Those are the places that need to be fixed, not the underlying porting helpers they happen to use.
My question is whether PEP3131 should be supported or not. After all, it is an integral part of Python3. I have no opinion about this, I just want to point it out as a potential problem.
So you say it should not be supported, and I have no problem with that.
Just to be clear about it:
In Python3 you can legally write
def ÜnicödeMethod€Name(self):
or see https://mail.python.org/pipermail/python-3000/2007-June/008172.html for a nice Russian example.
but PB cannot transfer those identifiers. Which is not what the user would expect.
PB supports unicode perfectly well and has for many years. This is why I asked which specific part of PB has a problem. I don't see why you think unicode is a problem here. This doesn't mean there isn't a problem - after all, you've been looking at the code and trying to make it work on Python 3 and I haven't - but you haven't made it clear what that problem is. A failing unit test or a minimal example (<http://sscce.org/>) would communicate this most clearly, but perhaps you can just mention a specific API and give an incomplete example of how it will fail when it runs up against the changes defined by PEP 3131. Jean-Paul
This should be clearly communicated somewhere in the docs, IMHO.
-- Wolfgang
Am Montag, 8. September 2014, 12:04:46 schrieb exarkun@twistedmatrix.com:
PB supports unicode perfectly well and has for many years. This is why I asked which specific part of PB has a problem.
PB transfers names of methods and classes as bytes, not as unicode. Which is logical since PY2 does not support unicode identifiers, and bytes is already a native PY2 string. Unicode is only used for content. It not yet always clear to me what is content and what is a formal string like method names or the *_atom strings which must be bytes, this needs more testing. I guess I should patch banana.py such that it dumps all it encodes or decodes into one file, so I can compare output from PY2/PY3 tests. I was assuming that suddenly transferring method names as unicode would really be a break of wire protocol stability, or do you think otherwise? If you think this is acceptable, I will check if the existing twisted code can handle getting those as unicode without source code changes. Not sure. Just tested this with Python2.6, and I am surprised that it works:
getattr(A,u'x') <unbound method A.x>
Supporting PEP3131 would only introduce a backward-incompatibility. Of course you are right that this is not part of porting. Right now I have a long list of small unsorted git commits, I will have to do a lot of reshuffling and cleaning before I will ask you how to get it into the official codebase. Not all of those commits are strictly porting, some just clean the code, making the porting commits simpler.
A failing unit test or a minimal example (<http://sscce.org/>) would communicate this most clearly, but perhaps you can just mention a specific API and give an incomplete example of how it will fail when it runs up against the changes defined by PEP 3131.
see my first mail in this thread: take test_pb.py, rename getSimple to getSimpleä, run the test. -- Wolfgang
On 01:07 pm, wolfgang.kde@rohdewald.de wrote:
Am Montag, 8. September 2014, 12:04:46 schrieb exarkun@twistedmatrix.com:
PB supports unicode perfectly well and has for many years. This is why I asked which specific part of PB has a problem.
PB transfers names of methods and classes as bytes, not as unicode.
PB actually transfers *everything* as bytes. ;) Bytes are the only thing you can send over a socket. If you have anything else - integers, unicode, whatever - you have to encode them as bytes first. As far as I know, the PB protocol is not specified apart from the implementation in Twisted (and this being the only implementation (I am intentionally disregarding TwistedJava) it must serve as the specification, I think). What you meant above, I think, is that PB represents method names as bytes at the banana layer. That is, when you want to call a remote method, you indicate its name by supplying bytes to the banana encoding layer - not unicode (which is good because banana doesn't actually support unicode at all, that's a jelly feature). This does indeed mean we don't simply want to start sending unicode to refer to methods by name - because we can't! At least, not unless we extend banana to support a new type which we probably don't want to do - that would be another incompatible protocol change and so not allowed (since it could break interoperability between different implementations of PB). So, it is necessary to continue to represent method names using bytes. This is fairly easily done. On Python 3, encode any unicode strings which represent method names (using a well-known encoding, probably UTF-8) when making the call and decode them in the same way when dispatching those calls. This can almost be done at the application level: # Some Python 3 code def remote_ä(self): pass ref.callRemote(u"ä".encode("utf-8")) except that Python 3 has actually changed to enforce the type of the second argument of getattr - if it is not a unicode string then a TypeError is raised - so it's not possible to make the decoding step happen (which one might otherwise have done using `__getattr__` or by adding the encoded name of the method to the class dictionary). So if it is going to be supported in Twisted's PB API then that support probably needs to be in Twisted's PB implementation. The same general idea applies, though. Just move the encoding into the implementation of `callRemote`: def callRemote(self, _name, ...): ... _name.encode("utf-8") ... And add a corresponding decode to the other side (probably in `_recvMessage`). This would make the Python 3 PB API be "method names are unicode strings" which makes sense considering the decisions that were made for Python 3. Note that it does not change the wire protocol - method names are still bytes at the banana level. Or does it? These bytes were previously always an ASCII subset. Is expanding out of the ASCII range an incompatible change? What could this break? Let's say you have Python 2 talking to Python 2. It's already possible to construct a method call like this: class Foo(Referenceable): def remote_foo(self): pass setattr(Foo, u"remote_fooä".encode("utf-8"), Foo.remote_foo) ... ref.callRemote(u"remote_fooä".encode("utf-8")) This actually works. Python 2 doesn't much care about how you name your class attributes. PB doesn't care that the high bit is set in one or more of the bytes in the method name. It all just works. So let's say you have Python 2 talking to Python 3 instead. In Python 3, you can't do that setattr() call (the language and runtime disallow it). But you can have `def remote_fooä(self)` instead. If PB on Python 3 decodes the method name before dispatching it (using UTF-8) then again things work. And if you reverse the situation and PB on Python 3 encodes the method name before sending it, then Python 2 is still happen because it can operate on that UTF-8 encoded byte string. Finally, if Python 3 talks to Python 3 then it also works because the sending side encodes and the receiving side decodes. So we get to make a judgement call here, I think. Without a specification there's no objectively correct answer. So, because the current implementation is actually perfectly compatible with non-ASCII bytes - even though the intent is clearly that you would never have those - combined with the point that I made above, that there are no other PB implementations, I suspect it's fine to expand beyond ASCII here because it won't actually break any real world programs. The only case that I can think of that actually would be a problem is the case where someone is already sending non-ASCII, non-UTF-8 method names around. These might decode wrong or might fail to decode at all. I don't think this is likely enough to worry about - but maybe someone who is doing this will speak up and prove me wrong. ;) And to repeat myself a bit, none of this should change the Python 2 PB API. It should continue accepting bytes - because that's what it always accepted. Separately, we could introduce a new feature to support unicode on Python 2. This would be done in the usual way for introducing new features into any Twisted APIs (and there aren't reasonable backwards compatibility considerations here as far as I can tell). The point would be to make it a little more convenient for Python 2 applications to interact with other PB applications that have decided to use unicode method names. Lastly, on another topic, I am subscribed to the mailing list - you don't have to cc me on your replies. Jean-Paul
Which is logical since PY2 does not support unicode identifiers, and bytes is already a native PY2 string. Unicode is only used for content. It not yet always clear to me what is content and what is a formal string like method names or the *_atom strings which must be bytes, this needs more testing.
I guess I should patch banana.py such that it dumps all it encodes or decodes into one file, so I can compare output from PY2/PY3 tests.
I was assuming that suddenly transferring method names as unicode would really be a break of wire protocol stability, or do you think otherwise? If you think this is acceptable, I will check if the existing twisted code can handle getting those as unicode without source code changes. Not sure. Just tested this with Python2.6, and I am surprised that it works:
getattr(A,u'x') <unbound method A.x>
Supporting PEP3131 would only introduce a backward-incompatibility.
Of course you are right that this is not part of porting.
Right now I have a long list of small unsorted git commits, I will have to do a lot of reshuffling and cleaning before I will ask you how to get it into the official codebase. Not all of those commits are strictly porting, some just clean the code, making the porting commits simpler.
A failing unit test or a minimal example (<http://sscce.org/>) would communicate this most clearly, but perhaps you can just mention a specific API and give an incomplete example of how it will fail when it runs up against the changes defined by PEP 3131.
see my first mail in this thread: take test_pb.py, rename getSimple to getSimpleä, run the test.
-- Wolfgang
_______________________________________________ Twisted-Python mailing list Twisted-Python@twistedmatrix.com http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
participants (2)
-
exarkun@twistedmatrix.com
-
Wolfgang Rohdewald