[Python-ideas] isascii()/islatin1()/isbmp()
Terry Reedy
tjreedy at udel.edu
Sun Jul 1 04:05:21 CEST 2012
On 6/30/2012 8:59 PM, Steven D'Aprano wrote:
> Why just ASCII, Latin1 and BMP (whatever that is, googling has not come
> up with anything relevant)?
BMP = Unicode Basic MultilingualPlane, the first 2**16 codepoints
http://unicode.org/roadmaps/bmp/
I presume the proposed isbmp would exclude surrogates in 16 bit
implementations, but that was not clearly defined.
> It seems to me that adding these three tests
The temptation for these three tests is the the info is already
available (at least for 2 of them) as an internal
implementation-specific C-level attribute in 3.3(+). No O(n) scan
needed. Someone could make a CPython3.3-specific module available on PyPI.
> will open the doors to a steady stream of requests for new methods
> is<insert encoding name here>.
> I suggest that a better API would be a method that takes the name of an
> encoding (perhaps defaulting to 'ascii') and returns True|False:
>
> string.encodable(encoding='ascii') -> True|False
>
> Return True if string can be encoded using the named encoding, otherwise
> False.
But then one might as well try the encoding and check for exception. The
point of the proposal is to avoid things like
try:
body = text.encode('ascii')
header = 'ascii' #abbreviating here
except UnicodeEncodeError:
try:
body = text.encode('latin1')
header = 'latin1'
except UnicodeEncodeError:
body = text.encode('utf-8')
header = 'utf-8'
> One last pedantic issue: strings aren't ASCII or Latin1, etc., but
> Unicode. There is enough confusion between Unicode text strings and
> bytes without adding methods whose names blur the distinction slightly.
yes!
--
Terry Jan Reedy
More information about the Python-ideas
mailing list