Re: [Python-ideas] isascii()/islatin1()/isbmp()

30 Jun 2012

      On 6/30/2012 8:59 PM, Steven D'Aprano wrote:
...
Why just ASCII, Latin1 and BMP (whatever that is, googling has not come
up with anything relevant)?
BMP = Unicode Basic MultilingualPlane, the first 2**16 codepoints
http://unicode.org/roadmaps/bmp/

I presume the proposed isbmp would exclude surrogates in 16 bit 
implementations, but that was not clearly defined.
...
It seems to me that adding these three tests
The temptation for these three tests is the the info is already 
available (at least for 2 of them) as an internal 
implementation-specific C-level attribute in 3.3(+). No O(n) scan 
needed. Someone could make a CPython3.3-specific module available on PyPI.
...
will open the doors to a steady stream of requests for new methods
is<insert encoding name here>.
...
I suggest that a better API would be a method that takes the name of an
encoding (perhaps defaulting to 'ascii') and returns True|False:
string.encodable(encoding='ascii') -> True|False
Return True if string can be encoded using the named encoding, otherwise
False.
But then one might as well try the encoding and check for exception. The 
point of the proposal is to avoid things like

try:
   body = text.encode('ascii')
   header = 'ascii'  #abbreviating here
except UnicodeEncodeError:
   try:
     body = text.encode('latin1')
     header = 'latin1'
   except UnicodeEncodeError:
     body = text.encode('utf-8')
     header = 'utf-8'
...
One last pedantic issue: strings aren't ASCII or Latin1, etc., but
Unicode. There is enough confusion between Unicode text strings and
bytes without adding methods whose names blur the distinction slightly.
yes!

-- 
Terry Jan Reedy

Re: [Python-ideas] isascii()/islatin1()/isbmp()

Terry Reedy