[Python-ideas] isascii()/islatin1()/isbmp()

Sun Jul 1 08:27:25 CEST 2012

On Sun, Jul 1, 2012 at 3:48 PM, Terry Reedy <tjreedy at udel.edu> wrote:
> encodable would indirectly expose max_code_point since it would only be
> really useful and likely used when max_code_point was available and
> applicable. In other words, s.encodable('latin1') is equivalent to
> s.max_code_point == 255.
>
> if isbmp *is* useful, I don't think it can be duplicated with .encodable.
> Python seems not to have a ucs-2 codec.

Rewinding back to the reasons the question is being asked, the reason
this information is useful at the Python level is the same reason it
is useful at the C level: it matters for finding the most efficient
means of
representing the text as bytes (which can then have further
implications for the kind of quoting used, etc). The interesting
breakpoints can actually be expressed in terms of the number of bits
in the highest code point:
7 - encode as ASCII (or latin-1 or utf-8)
8 - encode as latin-1
8+ - encode as utf-8

Specifically, it's a payload microoptimisation for the latin-1 case -
the latin-1 string will be shorter than the corresponding utf-8 string
(how much shorter depends on the number of non-ASCII characters). I
believe it also makes an additional difference in the email case by
changing the kind of quoting that is used to something with lower
overhead that can't handle utf-8.

The "try it and see" approach suffers a potentially high speed penalty
if the non-latin-1 characters appear late in the string:

    try:
        # Likely no need to try ASCII, since there's no efficiency
gain over latin-1
        payload = message.encode("latin-1")
    except UnicodeEncodeError:
        payload = message.encode("utf-8")

Using max() and ord() to check in advance doesn't help, since that
*locks in* the O(n) penalty.

The reason I think a max_code_point() method is a good potential
solution is that it can be advertised as O(n) worst case, but
potentially O(1) if the implementation caches the answer internally.
Another alternative would be a __max__ and __min__ protocol that
allowed efficient answers for the max() and min() builtins. The latter
would have the advantage of allowing other containers (like range
objects) to provide efficient implementations.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia