On Sun, Jul 01, 2012 at 04:27:25PM +1000, Nick Coghlan wrote:
Rewinding back to the reasons the question is being asked, the reason this information is useful at the Python level is the same reason it is useful at the C level: it matters for finding the most efficient means of representing the text as bytes (which can then have further implications for the kind of quoting used, etc). The interesting breakpoints can actually be expressed in terms of the number of bits in the highest code point: 7 - encode as ASCII (or latin-1 or utf-8) 8 - encode as latin-1 8+ - encode as utf-8
I'm of two minds here. On the one hand, I question the wisdom of encouraging the use of anything but UTF-8. It's unfortunate enough that there are still cases where people have to use older encodings, without encouraging people to use Latin1 or ASCII in order to save a handful of bytes in a 20K email. On the other hand, there are use-cases for non-UTF-8 encodings, and people will want to check whether or not a string is encodable in various encodings. Why make that harder/slower/less convenient than it need be?
Specifically, it's a payload microoptimisation for the latin-1 case - the latin-1 string will be shorter than the corresponding utf-8 string
Just to be clear here, you're referring to byte strings, yes?
(how much shorter depends on the number of non-ASCII characters). I believe it also makes an additional difference in the email case by changing the kind of quoting that is used to something with lower overhead that can't handle utf-8.
The "try it and see" approach suffers a potentially high speed penalty if the non-latin-1 characters appear late in the string:
try: # Likely no need to try ASCII, since there's no efficiency gain over latin-1 payload = message.encode("latin-1") except UnicodeEncodeError: payload = message.encode("utf-8")
Using max() and ord() to check in advance doesn't help, since that *locks in* the O(n) penalty.
The reason I think a max_code_point() method is a good potential solution is that it can be advertised as O(n) worst case, but potentially O(1) if the implementation caches the answer internally.
The downside is that the caller is then responsible for interpreting that value (i.e. mapping a max code point to an encoding). The other downside is that doesn't do anything to help those who are stuck with legacy encodings. Although maybe that doesn't matter, since they will just do the "try it and see" approach.
Another alternative would be a __max__ and __min__ protocol that allowed efficient answers for the max() and min() builtins. The latter would have the advantage of allowing other containers (like range objects) to provide efficient implementations.
+1 on that, although I think that should be a separate issue. -- Steven