[Python-ideas] Adding str.isascii() ?

Wed Jan 31 05:49:42 EST 2018

26.01.18 10:42, INADA Naoki пише:
> Currently, int(), str.isdigit(), str.isalnum(), etc... accepts
> non-ASCII strings.
> 
>>>> s =  １２３"
>>>> s
> '１２３'
>>>> s.isdigit()
> True
>>>> print(ascii(s))
> '\uff11\uff12\uff13'
>>>> int(s)
> 123
> 
> But sometimes, we want to accept only ascii string.  For example,
> ipaddress module uses:
> 
> _DECIMAL_DIGITS = frozenset('0123456789')
> ...
> if _DECIMAL_DIGITS.issuperset(str):
> 
> ref: https://github.com/python/cpython/blob/e76daebc0c8afa3981a4c5a8b54537f756e805de/Lib/ipaddress.py#L491-L494
> 
> If str has str.isascii() method, it can be simpler:
> 
> `if s.isascii() and s.isdigit():`
> 
> I want to add it in Python 3.7 if there are no opposite opinions.

There were discussions about this. See for example 
https://bugs.python.org/issue18814.

In short, there are two considerations that prevented adding this feature:

1. This function can have the constant computation complexity in CPython 
(just check a single bit), but other implementations may provide only 
the linear computation complexity.

2. In many cases just after taking the answer to this question we encode 
the string to bytes (or decode bytes to string). Thus the most natural 
way to determining if the string is ASCII-only is trying to encode it to 
ASCII.

And adding a new method to the basic type has a high bar.

The code in ipaddress

         if not _BaseV4._DECIMAL_DIGITS.issuperset(prefixlen_str):
             cls._report_invalid_netmask(prefixlen_str)
         try:
             prefixlen = int(prefixlen_str)
         except ValueError:
             cls._report_invalid_netmask(prefixlen_str)
         if not (0 <= prefixlen <= cls._max_prefixlen):
             cls._report_invalid_netmask(prefixlen_str)
         return prefixlen

can be rewritten as:

         if not prefixlen_str.isdigit():
             cls._report_invalid_netmask(prefixlen_str)
         try:
             prefixlen = int(prefixlen_str.encode('ascii'))
         except UnicodeEncodeError:
             cls._report_invalid_netmask(prefixlen_str)
         except ValueError:
             cls._report_invalid_netmask(prefixlen_str)
         if not (0 <= prefixlen <= cls._max_prefixlen):
             cls._report_invalid_netmask(prefixlen_str)
         return prefixlen

Other possibility -- adding support of the boolean argument in 
str.isdigit() and similar predicates that switch them to the ASCII-only 
mode. Such option will be very useful for the str.strip(), str.split() 
and str.splilines() methods. Currently they split using all Unicode 
whitespaces and line separators, but there is a need to split only on 
ASCII whitespaces and line separators CR, LF and CRLF. In case of 
str.strip() and str.split() you can just pass the string of whitespace 
characters, but there is no such option for str.splilines().