[Python-ideas] Adding str.isascii() ?

Tue Jan 30 00:12:52 EST 2018

On 30 January 2018 at 06:54, Chris Barker <chris.barker at noaa.gov> wrote:
> On Fri, Jan 26, 2018 at 5:27 PM, Steven D'Aprano <steve at pearwood.info>
> wrote:
>>
>> tcl/tk and Javascript only support UCS-2 (16 bit) Unicode strings.
>> Dealing with the Supplementary Unicode Planes have the same problems
>> that older "narrow" builds of Python sufferred from: single code points
>> were counted as len(2) instead of len(1), slicing could be wrong, etc.
>>
>> There are still many applications which assume Latin-1 data. For
>> instance, I use a media player which displays mojibake when passed
>> anything outside of Latin-1.
>>
>> Sometimes it is useful to know in advance when text you pass to another
>> application is going to run into problems because of the other
>> application's limitations.
>
>
> I'm confused -- isn't the way to do this to encode your text into the
> encoding the other application accepts ?
>
> if you really want to know in advance, it is so hard to run it through a
> encode/decode sandwich?
>
> Wait -- I can't find UCS-2 in the built-in encodings -- am I dense or is it
> not there? Shouldn't it be? If only for this reason?

If you're wanting to check whether or not something lies entirely
within the BMP, check for:

    2*len(text) == len(text.encode("utf-16")) # True iff text is UCS-2

If there's an astral code point in there, then the encoded version
will need more than 2 bytes for at least one element, so the result
will end up being longer than it would for UCS-2 data.

You can also check for pure ASCII in much the same way:

    len(text) == len(text.encode("utf-8")) # True iff text is 7-bit ASCII

So this is partly an optimisation question:

- folks want to avoid allocating a bytes object just to throw it away
- folks want to avoid running the equivalent of "max(map(ord, text))"
- folks know that CPython (at least) tracks this kind of info
internally to manage its own storage allocations

But it's also a readability question: "is_ascii()" and
"is_UCS2()/is_BMP()" just require knowing what 7-bit ASCII and UCS-2
(or the basic multilingual plane) *are*, whereas the current ways of
checking for them require knowing how they *behave*.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia