[Python-ideas] Adding str.isascii() ?
Nick Coghlan
ncoghlan at gmail.com
Tue Jan 30 00:12:52 EST 2018
On 30 January 2018 at 06:54, Chris Barker <chris.barker at noaa.gov> wrote:
> On Fri, Jan 26, 2018 at 5:27 PM, Steven D'Aprano <steve at pearwood.info>
> wrote:
>>
>> tcl/tk and Javascript only support UCS-2 (16 bit) Unicode strings.
>> Dealing with the Supplementary Unicode Planes have the same problems
>> that older "narrow" builds of Python sufferred from: single code points
>> were counted as len(2) instead of len(1), slicing could be wrong, etc.
>>
>> There are still many applications which assume Latin-1 data. For
>> instance, I use a media player which displays mojibake when passed
>> anything outside of Latin-1.
>>
>> Sometimes it is useful to know in advance when text you pass to another
>> application is going to run into problems because of the other
>> application's limitations.
>
>
> I'm confused -- isn't the way to do this to encode your text into the
> encoding the other application accepts ?
>
> if you really want to know in advance, it is so hard to run it through a
> encode/decode sandwich?
>
> Wait -- I can't find UCS-2 in the built-in encodings -- am I dense or is it
> not there? Shouldn't it be? If only for this reason?
If you're wanting to check whether or not something lies entirely
within the BMP, check for:
2*len(text) == len(text.encode("utf-16")) # True iff text is UCS-2
If there's an astral code point in there, then the encoded version
will need more than 2 bytes for at least one element, so the result
will end up being longer than it would for UCS-2 data.
You can also check for pure ASCII in much the same way:
len(text) == len(text.encode("utf-8")) # True iff text is 7-bit ASCII
So this is partly an optimisation question:
- folks want to avoid allocating a bytes object just to throw it away
- folks want to avoid running the equivalent of "max(map(ord, text))"
- folks know that CPython (at least) tracks this kind of info
internally to manage its own storage allocations
But it's also a readability question: "is_ascii()" and
"is_UCS2()/is_BMP()" just require knowing what 7-bit ASCII and UCS-2
(or the basic multilingual plane) *are*, whereas the current ways of
checking for them require knowing how they *behave*.
Cheers,
Nick.
--
Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia
More information about the Python-ideas
mailing list