[Python-ideas] isascii()/islatin1()/isbmp()
Steven D'Aprano
steve at pearwood.info
Mon Jul 2 10:52:24 CEST 2012
On Sun, Jul 01, 2012 at 04:27:25PM +1000, Nick Coghlan wrote:
> Rewinding back to the reasons the question is being asked, the reason
> this information is useful at the Python level is the same reason it
> is useful at the C level: it matters for finding the most efficient
> means of
> representing the text as bytes (which can then have further
> implications for the kind of quoting used, etc). The interesting
> breakpoints can actually be expressed in terms of the number of bits
> in the highest code point:
> 7 - encode as ASCII (or latin-1 or utf-8)
> 8 - encode as latin-1
> 8+ - encode as utf-8
I'm of two minds here.
On the one hand, I question the wisdom of encouraging the use of
anything but UTF-8. It's unfortunate enough that there are still cases
where people have to use older encodings, without encouraging people to
use Latin1 or ASCII in order to save a handful of bytes in a 20K email.
On the other hand, there are use-cases for non-UTF-8 encodings, and
people will want to check whether or not a string is encodable in
various encodings. Why make that harder/slower/less convenient than it
need be?
> Specifically, it's a payload microoptimisation for the latin-1 case -
> the latin-1 string will be shorter than the corresponding utf-8 string
Just to be clear here, you're referring to byte strings, yes?
> (how much shorter depends on the number of non-ASCII characters). I
> believe it also makes an additional difference in the email case by
> changing the kind of quoting that is used to something with lower
> overhead that can't handle utf-8.
>
> The "try it and see" approach suffers a potentially high speed penalty
> if the non-latin-1 characters appear late in the string:
>
> try:
> # Likely no need to try ASCII, since there's no efficiency
> gain over latin-1
> payload = message.encode("latin-1")
> except UnicodeEncodeError:
> payload = message.encode("utf-8")
>
> Using max() and ord() to check in advance doesn't help, since that
> *locks in* the O(n) penalty.
>
> The reason I think a max_code_point() method is a good potential
> solution is that it can be advertised as O(n) worst case, but
> potentially O(1) if the implementation caches the answer internally.
The downside is that the caller is then responsible for interpreting
that value (i.e. mapping a max code point to an encoding).
The other downside is that doesn't do anything to help those who are
stuck with legacy encodings. Although maybe that doesn't matter, since
they will just do the "try it and see" approach.
> Another alternative would be a __max__ and __min__ protocol that
> allowed efficient answers for the max() and min() builtins. The latter
> would have the advantage of allowing other containers (like range
> objects) to provide efficient implementations.
+1 on that, although I think that should be a separate issue.
--
Steven
More information about the Python-ideas
mailing list