[Python-ideas] Adding str.isascii() ?

Wed Jan 31 08:44:02 EST 2018

I like the idea of str.isdigit(ascii=True): would behave as
str.isdigit() and str.isascii(). It's easy to implement and likely to
be very efficient. I'm just not sure that it's so commonly required?

At least, I guess that some users can be surprised that str.isdigit()
is "Unicode aware", accept non-ASCII digits, as int(str).

Victor

2018-01-31 12:18 GMT+01:00 INADA Naoki <songofacandy at gmail.com>:
> Hm, it seems I was too hurry to implement it...
>
>>
>> There were discussions about this. See for example
>> https://bugs.python.org/issue18814.
>>
>> In short, there are two considerations that prevented adding this feature:
>>
>> 1. This function can have the constant computation complexity in CPython
>> (just check a single bit), but other implementations may provide only the
>> linear computation complexity.
>>
>
> Yes.  There are no O(1) guarantee about .isascii().
> But I expect UTF-8 based string implementation PyPy will have can achieve
> O(1); just test len(s) == __internal_utf8_len(s)
>
> I think if *some* of implementations can achieve O(1), it's beneficial
> to implement.
>
>
>> 2. In many cases just after taking the answer to this question we encode the
>> string to bytes (or decode bytes to string). Thus the most natural way to
>> determining if the string is ASCII-only is trying to encode it to ASCII.
>>
>
> Yes.  But ASCII is so special.
> Someone may want to check ASCII before passing string to int(),
> float(), decimal.Decimal(), etc...
> But I don't think there is real use case for encodings other than ASCII.
>
>> And adding a new method to the basic type has a high bar.
>>
>
> Agree.
>
>> The code in ipaddress
>>
>>         if not _BaseV4._DECIMAL_DIGITS.issuperset(prefixlen_str):
>>             cls._report_invalid_netmask(prefixlen_str)
>>         try:
>>             prefixlen = int(prefixlen_str)
>>         except ValueError:
>>             cls._report_invalid_netmask(prefixlen_str)
>>         if not (0 <= prefixlen <= cls._max_prefixlen):
>>             cls._report_invalid_netmask(prefixlen_str)
>>         return prefixlen
>>
>> can be rewritten as:
>>
>>         if not prefixlen_str.isdigit():
>>             cls._report_invalid_netmask(prefixlen_str)
>>         try:
>>             prefixlen = int(prefixlen_str.encode('ascii'))
>>         except UnicodeEncodeError:
>>             cls._report_invalid_netmask(prefixlen_str)
>>         except ValueError:
>>             cls._report_invalid_netmask(prefixlen_str)
>>         if not (0 <= prefixlen <= cls._max_prefixlen):
>>             cls._report_invalid_netmask(prefixlen_str)
>>         return prefixlen
>>
>
> Yes.  But .isascii() will be match faster than try ...
> .encode('ascii') ... except UnicodeEncodeError
> on most Python implementations.
>
>
>> Other possibility -- adding support of the boolean argument in str.isdigit()
>> and similar predicates that switch them to the ASCII-only mode. Such option
>> will be very useful for the str.strip(), str.split() and str.splilines()
>> methods. Currently they split using all Unicode whitespaces and line
>> separators, but there is a need to split only on ASCII whitespaces and line
>> separators CR, LF and CRLF. In case of str.strip() and str.split() you can
>> just pass the string of whitespace characters, but there is no such option
>> for str.splilines().
>>
>
> It sounds good idea.  Maybe, keyword only argument `ascii=False`?
>
> But if revert adding str.isascii() from Python 3.7, same keyword-only
> argument should be
> added to int(), float(), decimal.Decimal(), fractions.Fraction(),
> etc...  It's bit hard.
>
> So I think adding .isascii() is beneficial even if all str.is***()
> methods have `ascii=False` flag.
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/