On Apr 25, 2005, at 12:01 PM, Ken Kinder wrote:
Tommi Virtanen wrote:
Personally, I think ass-u-ming Unicode is encoded as UTF-8 would have been sane, but I can understand that not everyone agrees; e.g. Java wants UCS-16 if I remember correctly. And not serializing to UTF-8 by default catches errors that would otherwise cause mysterious things to happen.
Most of the time, you should know the encoding. Instead of forcing the protocol to do the work, why not just have a way of setting the expected encoding for write() and similar methods? If the encoding is not set (ie, None), then raise the exception. Otherwise, use the specified encoding. This would have the added readability advantage in that unicode encoding -- uhh code -- wouldn't have to be sprinkled throughout the protocol classes -- only in places where the encoding is actually set -- in HTTP's headers for example.
import codecs class MyProtocol(....): def __init__(self, encoding='ascii'): self.textwriter = codecs.getwriter(encoding)(self.transport) def write_text(self, s): self.textwriter.write(s) def write(self, s): self.transport.write(s) This way write_text will verify that you are only sending valid strings in the chosen encoding. If you call write_text() with a str then it will be decoded using sys.getdefaultencoding() and then encoded using the chosen encoding, so it really does guarantee that all strings sent with write_text are valid (at this level). You should really keep separate what you're doing with raw bytes (write) and what you're doing with text (write_text) as they are different beasts. There is no need to sprinkle this everywhere, just make it a mix-in or whatever and use as appropriate. -bob