[Python-ideas] isascii()/islatin1()/isbmp()

Mon Jul 2 10:52:24 CEST 2012

On Sun, Jul 01, 2012 at 04:27:25PM +1000, Nick Coghlan wrote:

> Rewinding back to the reasons the question is being asked, the reason
> this information is useful at the Python level is the same reason it
> is useful at the C level: it matters for finding the most efficient
> means of
> representing the text as bytes (which can then have further
> implications for the kind of quoting used, etc). The interesting
> breakpoints can actually be expressed in terms of the number of bits
> in the highest code point:
> 7 - encode as ASCII (or latin-1 or utf-8)
> 8 - encode as latin-1
> 8+ - encode as utf-8

I'm of two minds here. 

On the one hand, I question the wisdom of encouraging the use of 
anything but UTF-8. It's unfortunate enough that there are still cases 
where people have to use older encodings, without encouraging people to 
use Latin1 or ASCII in order to save a handful of bytes in a 20K email.

On the other hand, there are use-cases for non-UTF-8 encodings, and 
people will want to check whether or not a string is encodable in 
various encodings. Why make that harder/slower/less convenient than it 
need be?

> Specifically, it's a payload microoptimisation for the latin-1 case -
> the latin-1 string will be shorter than the corresponding utf-8 string

Just to be clear here, you're referring to byte strings, yes?

> (how much shorter depends on the number of non-ASCII characters). I
> believe it also makes an additional difference in the email case by
> changing the kind of quoting that is used to something with lower
> overhead that can't handle utf-8.
> 
> The "try it and see" approach suffers a potentially high speed penalty
> if the non-latin-1 characters appear late in the string:
> 
>     try:
>         # Likely no need to try ASCII, since there's no efficiency
> gain over latin-1
>         payload = message.encode("latin-1")
>     except UnicodeEncodeError:
>         payload = message.encode("utf-8")
> 
> Using max() and ord() to check in advance doesn't help, since that
> *locks in* the O(n) penalty.
> 
> The reason I think a max_code_point() method is a good potential
> solution is that it can be advertised as O(n) worst case, but
> potentially O(1) if the implementation caches the answer internally.

The downside is that the caller is then responsible for interpreting 
that value (i.e. mapping a max code point to an encoding).

The other downside is that doesn't do anything to help those who are 
stuck with legacy encodings. Although maybe that doesn't matter, since 
they will just do the "try it and see" approach.

> Another alternative would be a __max__ and __min__ protocol that
> allowed efficient answers for the max() and min() builtins. The latter
> would have the advantage of allowing other containers (like range
> objects) to provide efficient implementations.

+1 on that, although I think that should be a separate issue.

-- 
Steven