
On Sun, Jul 1, 2012 at 3:48 PM, Terry Reedy <tjreedy@udel.edu> wrote:
encodable would indirectly expose max_code_point since it would only be really useful and likely used when max_code_point was available and applicable. In other words, s.encodable('latin1') is equivalent to s.max_code_point == 255.
if isbmp *is* useful, I don't think it can be duplicated with .encodable. Python seems not to have a ucs-2 codec.
Rewinding back to the reasons the question is being asked, the reason this information is useful at the Python level is the same reason it is useful at the C level: it matters for finding the most efficient means of representing the text as bytes (which can then have further implications for the kind of quoting used, etc). The interesting breakpoints can actually be expressed in terms of the number of bits in the highest code point: 7 - encode as ASCII (or latin-1 or utf-8) 8 - encode as latin-1 8+ - encode as utf-8 Specifically, it's a payload microoptimisation for the latin-1 case - the latin-1 string will be shorter than the corresponding utf-8 string (how much shorter depends on the number of non-ASCII characters). I believe it also makes an additional difference in the email case by changing the kind of quoting that is used to something with lower overhead that can't handle utf-8. The "try it and see" approach suffers a potentially high speed penalty if the non-latin-1 characters appear late in the string: try: # Likely no need to try ASCII, since there's no efficiency gain over latin-1 payload = message.encode("latin-1") except UnicodeEncodeError: payload = message.encode("utf-8") Using max() and ord() to check in advance doesn't help, since that *locks in* the O(n) penalty. The reason I think a max_code_point() method is a good potential solution is that it can be advertised as O(n) worst case, but potentially O(1) if the implementation caches the answer internally. Another alternative would be a __max__ and __min__ protocol that allowed efficient answers for the max() and min() builtins. The latter would have the advantage of allowing other containers (like range objects) to provide efficient implementations. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia