[Twisted-Python] Sending unicode strings

Hello, I was undoubtedly surprised when I found out that I cannot pass unicode strings to Twisted. In 1.3 twisted.internet.abstract looks like this: def write(self, data): assert isinstance(data, str), "Data must be a string." if not self.connected: return and in 2.0: def write(self, data): if isinstance(data, unicode): # no, really, I mean it raise TypeError("Data must be not be unicode") Why do you mean it? Why I can't send unicode through twisted? It's ridiculous that I have to convert UTF8 strings to ISO on the client side and then once again from ISO to UTF8 on the server side, so I suppose you've got really good excuse. -- Michal Chruszcz -=- Never seen

On Sat, Apr 23, 2005, Michal Chruszcz wrote:
Why do you mean it? Why I can't send unicode through twisted? It's ridiculous that I have to convert UTF8 strings to ISO on the client side and then once again from ISO to UTF8 on the server side, so I suppose you've got really good excuse.
In so far as you'd call "design decisions" "excuses". Why is it ridiculous that you need to do this? It's not always going to be Twisted on the other end after all (at least, Twisted can't assume that at the level we're talking about), and there's umpteen million encodings that can be used for Unicode, so encoding all Unicode objects into bytes using the UTF8 encoding seems to be making a hell of an assumption. Your protocol is the right place to make that decision/assumption/assertation, write(...) is not. http://twistedmatrix.com/projects/core/documentation/howto/faq.html#auto25 -Mary

On Apr 23, 2005, at 1:34 AM, Michal Chruszcz wrote:
I was undoubtedly surprised when I found out that I cannot pass unicode strings to Twisted. In 1.3 twisted.internet.abstract looks like this: def write(self, data): assert isinstance(data, str), "Data must be a string." if not self.connected: return
and in 2.0: def write(self, data): if isinstance(data, unicode): # no, really, I mean it raise TypeError("Data must be not be unicode")
Why do you mean it? Why I can't send unicode through twisted? It's ridiculous that I have to convert UTF8 strings to ISO on the client side and then once again from ISO to UTF8 on the server side, so I suppose you've got really good excuse.
You must ALWAYS encode or decode unicode at I/O boundaries in any programming language/framework. Unicode has no encoding. You must choose one at the I/O boundary, but that choice is up to you. I suggest you read up on the hows and whys of Unicode, because apparently you missed something. Specifically, Twisted's transports are for writing BYTES (not text). Unicode is strictly a bunch of characters that have no inherent byte representation. The unicode type has nothing at all to do with UTF-8, I'm not sure why you decided they were related. Technically the unicode type is represented internally with either UCS-2 or UCS-4 depending on Python's configuration options. The same is true for file objects in Python. Though writing to them will automatically coerce to/from some default encoding (sys.getdefaultencoding()), usually ASCII, which hurts more than helps. Twisted takes the high road and explicitly provides no automagic conversion for unicode objects. If it did, your program would probably crash at random places if users of your application typed in non-ascii characters, because you didn't think enough about unicode before deciding to use it. Now you are required to have a modicum of understanding about what you're doing when you use unicode, so it is far less likely that you will write code that has such silly bugs. In more sane environments than Python, you will NOT have a single type that can represent both data and text at the same time (Python's str is evil). Additionally, it is often the case that text types in more sane environments don't have a single internal representation (so you don't have to pay the N-byte penalty, or encoding/decoding costs at I/O boundaries for text you never really manipulate, etc.). -bob

Michal Chruszcz wrote:
Why do you mean it? Why I can't send unicode through twisted? It's ridiculous that I have to convert UTF8 strings to ISO on the client side
... to convert Unicode strings to UTF-8 on the client side ...
and then once again from ISO to UTF8 on the server side, so I suppose
...from UTF-8 to Unicode on the server side... Personally, I think ass-u-ming Unicode is encoded as UTF-8 would have been sane, but I can understand that not everyone agrees; e.g. Java wants UCS-16 if I remember correctly. And not serializing to UTF-8 by default catches errors that would otherwise cause mysterious things to happen. Maybe add a transport wrapper class that has def write(self, data): if isinstance(data, unicode): data = self.serializeUnicode(data) self.wrapped.write(data) of course, that means the Protocol receiving this needs to convert from the used serialization format to Unicode.

On Apr 23, 2005, at 2:17 AM, Tommi Virtanen wrote:
Michal Chruszcz wrote:
Why do you mean it? Why I can't send unicode through twisted? It's ridiculous that I have to convert UTF8 strings to ISO on the client side
... to convert Unicode strings to UTF-8 on the client side ...
and then once again from ISO to UTF8 on the server side, so I suppose
...from UTF-8 to Unicode on the server side...
Personally, I think ass-u-ming Unicode is encoded as UTF-8 would have been sane, but I can understand that not everyone agrees; e.g. Java wants UCS-16 if I remember correctly. And not serializing to UTF-8 by default catches errors that would otherwise cause mysterious things to happen.
The most mysterious of things is that with such ass-umptions you put unicode in and you get str out. This is especially bad because your default encoding is not utf-8. So, your program explodes, people die, and you have to clean up the mess later. -bob

Tommi Virtanen wrote:
Personally, I think ass-u-ming Unicode is encoded as UTF-8 would have been sane, but I can understand that not everyone agrees; e.g. Java wants UCS-16 if I remember correctly. And not serializing to UTF-8 by default catches errors that would otherwise cause mysterious things to happen.
Most of the time, you should know the encoding. Instead of forcing the protocol to do the work, why not just have a way of setting the expected encoding for write() and similar methods? If the encoding is not set (ie, None), then raise the exception. Otherwise, use the specified encoding. This would have the added readability advantage in that unicode encoding -- uhh code -- wouldn't have to be sprinkled throughout the protocol classes -- only in places where the encoding is actually set -- in HTTP's headers for example. -Ken

On Apr 25, 2005, at 12:01 PM, Ken Kinder wrote:
Tommi Virtanen wrote:
Personally, I think ass-u-ming Unicode is encoded as UTF-8 would have been sane, but I can understand that not everyone agrees; e.g. Java wants UCS-16 if I remember correctly. And not serializing to UTF-8 by default catches errors that would otherwise cause mysterious things to happen.
Most of the time, you should know the encoding. Instead of forcing the protocol to do the work, why not just have a way of setting the expected encoding for write() and similar methods? If the encoding is not set (ie, None), then raise the exception. Otherwise, use the specified encoding. This would have the added readability advantage in that unicode encoding -- uhh code -- wouldn't have to be sprinkled throughout the protocol classes -- only in places where the encoding is actually set -- in HTTP's headers for example.
import codecs class MyProtocol(....): def __init__(self, encoding='ascii'): self.textwriter = codecs.getwriter(encoding)(self.transport) def write_text(self, s): self.textwriter.write(s) def write(self, s): self.transport.write(s) This way write_text will verify that you are only sending valid strings in the chosen encoding. If you call write_text() with a str then it will be decoded using sys.getdefaultencoding() and then encoded using the chosen encoding, so it really does guarantee that all strings sent with write_text are valid (at this level). You should really keep separate what you're doing with raw bytes (write) and what you're doing with text (write_text) as they are different beasts. There is no need to sprinkle this everywhere, just make it a mix-in or whatever and use as appropriate. -bob
participants (5)
-
Bob Ippolito
-
Ken Kinder
-
Mary Gardiner
-
Michal Chruszcz
-
Tommi Virtanen