[Python-ideas] Adding str.isascii() ?
INADA Naoki
songofacandy at gmail.com
Wed Jan 31 06:18:42 EST 2018
Hm, it seems I was too hurry to implement it...
>
> There were discussions about this. See for example
> https://bugs.python.org/issue18814.
>
> In short, there are two considerations that prevented adding this feature:
>
> 1. This function can have the constant computation complexity in CPython
> (just check a single bit), but other implementations may provide only the
> linear computation complexity.
>
Yes. There are no O(1) guarantee about .isascii().
But I expect UTF-8 based string implementation PyPy will have can achieve
O(1); just test len(s) == __internal_utf8_len(s)
I think if *some* of implementations can achieve O(1), it's beneficial
to implement.
> 2. In many cases just after taking the answer to this question we encode the
> string to bytes (or decode bytes to string). Thus the most natural way to
> determining if the string is ASCII-only is trying to encode it to ASCII.
>
Yes. But ASCII is so special.
Someone may want to check ASCII before passing string to int(),
float(), decimal.Decimal(), etc...
But I don't think there is real use case for encodings other than ASCII.
> And adding a new method to the basic type has a high bar.
>
Agree.
> The code in ipaddress
>
> if not _BaseV4._DECIMAL_DIGITS.issuperset(prefixlen_str):
> cls._report_invalid_netmask(prefixlen_str)
> try:
> prefixlen = int(prefixlen_str)
> except ValueError:
> cls._report_invalid_netmask(prefixlen_str)
> if not (0 <= prefixlen <= cls._max_prefixlen):
> cls._report_invalid_netmask(prefixlen_str)
> return prefixlen
>
> can be rewritten as:
>
> if not prefixlen_str.isdigit():
> cls._report_invalid_netmask(prefixlen_str)
> try:
> prefixlen = int(prefixlen_str.encode('ascii'))
> except UnicodeEncodeError:
> cls._report_invalid_netmask(prefixlen_str)
> except ValueError:
> cls._report_invalid_netmask(prefixlen_str)
> if not (0 <= prefixlen <= cls._max_prefixlen):
> cls._report_invalid_netmask(prefixlen_str)
> return prefixlen
>
Yes. But .isascii() will be match faster than try ...
.encode('ascii') ... except UnicodeEncodeError
on most Python implementations.
> Other possibility -- adding support of the boolean argument in str.isdigit()
> and similar predicates that switch them to the ASCII-only mode. Such option
> will be very useful for the str.strip(), str.split() and str.splilines()
> methods. Currently they split using all Unicode whitespaces and line
> separators, but there is a need to split only on ASCII whitespaces and line
> separators CR, LF and CRLF. In case of str.strip() and str.split() you can
> just pass the string of whitespace characters, but there is no such option
> for str.splilines().
>
It sounds good idea. Maybe, keyword only argument `ascii=False`?
But if revert adding str.isascii() from Python 3.7, same keyword-only
argument should be
added to int(), float(), decimal.Decimal(), fractions.Fraction(),
etc... It's bit hard.
So I think adding .isascii() is beneficial even if all str.is***()
methods have `ascii=False` flag.
More information about the Python-ideas
mailing list