[Python-ideas] Adding str.isascii() ?

INADA Naoki songofacandy at gmail.com
Wed Jan 31 06:18:42 EST 2018


Hm, it seems I was too hurry to implement it...

>
> There were discussions about this. See for example
> https://bugs.python.org/issue18814.
>
> In short, there are two considerations that prevented adding this feature:
>
> 1. This function can have the constant computation complexity in CPython
> (just check a single bit), but other implementations may provide only the
> linear computation complexity.
>

Yes.  There are no O(1) guarantee about .isascii().
But I expect UTF-8 based string implementation PyPy will have can achieve
O(1); just test len(s) == __internal_utf8_len(s)

I think if *some* of implementations can achieve O(1), it's beneficial
to implement.


> 2. In many cases just after taking the answer to this question we encode the
> string to bytes (or decode bytes to string). Thus the most natural way to
> determining if the string is ASCII-only is trying to encode it to ASCII.
>

Yes.  But ASCII is so special.
Someone may want to check ASCII before passing string to int(),
float(), decimal.Decimal(), etc...
But I don't think there is real use case for encodings other than ASCII.

> And adding a new method to the basic type has a high bar.
>

Agree.

> The code in ipaddress
>
>         if not _BaseV4._DECIMAL_DIGITS.issuperset(prefixlen_str):
>             cls._report_invalid_netmask(prefixlen_str)
>         try:
>             prefixlen = int(prefixlen_str)
>         except ValueError:
>             cls._report_invalid_netmask(prefixlen_str)
>         if not (0 <= prefixlen <= cls._max_prefixlen):
>             cls._report_invalid_netmask(prefixlen_str)
>         return prefixlen
>
> can be rewritten as:
>
>         if not prefixlen_str.isdigit():
>             cls._report_invalid_netmask(prefixlen_str)
>         try:
>             prefixlen = int(prefixlen_str.encode('ascii'))
>         except UnicodeEncodeError:
>             cls._report_invalid_netmask(prefixlen_str)
>         except ValueError:
>             cls._report_invalid_netmask(prefixlen_str)
>         if not (0 <= prefixlen <= cls._max_prefixlen):
>             cls._report_invalid_netmask(prefixlen_str)
>         return prefixlen
>

Yes.  But .isascii() will be match faster than try ...
.encode('ascii') ... except UnicodeEncodeError
on most Python implementations.


> Other possibility -- adding support of the boolean argument in str.isdigit()
> and similar predicates that switch them to the ASCII-only mode. Such option
> will be very useful for the str.strip(), str.split() and str.splilines()
> methods. Currently they split using all Unicode whitespaces and line
> separators, but there is a need to split only on ASCII whitespaces and line
> separators CR, LF and CRLF. In case of str.strip() and str.split() you can
> just pass the string of whitespace characters, but there is no such option
> for str.splilines().
>

It sounds good idea.  Maybe, keyword only argument `ascii=False`?

But if revert adding str.isascii() from Python 3.7, same keyword-only
argument should be
added to int(), float(), decimal.Decimal(), fractions.Fraction(),
etc...  It's bit hard.

So I think adding .isascii() is beneficial even if all str.is***()
methods have `ascii=False` flag.


More information about the Python-ideas mailing list