[Python-ideas] isascii()/islatin1()/isbmp()

Sun Jul 1 04:05:21 CEST 2012

On 6/30/2012 8:59 PM, Steven D'Aprano wrote:

> Why just ASCII, Latin1 and BMP (whatever that is, googling has not come
> up with anything relevant)?

BMP = Unicode Basic MultilingualPlane, the first 2**16 codepoints
http://unicode.org/roadmaps/bmp/

I presume the proposed isbmp would exclude surrogates in 16 bit 
implementations, but that was not clearly defined.

> It seems to me that adding these three tests

The temptation for these three tests is the the info is already 
available (at least for 2 of them) as an internal 
implementation-specific C-level attribute in 3.3(+). No O(n) scan 
needed. Someone could make a CPython3.3-specific module available on PyPI.

> will open the doors to a steady stream of requests for new methods
> is<insert encoding name here>.

> I suggest that a better API would be a method that takes the name of an
> encoding (perhaps defaulting to 'ascii') and returns True|False:
>
> string.encodable(encoding='ascii') -> True|False
>
> Return True if string can be encoded using the named encoding, otherwise
> False.

But then one might as well try the encoding and check for exception. The 
point of the proposal is to avoid things like

try:
   body = text.encode('ascii')
   header = 'ascii'  #abbreviating here
except UnicodeEncodeError:
   try:
     body = text.encode('latin1')
     header = 'latin1'
   except UnicodeEncodeError:
     body = text.encode('utf-8')
     header = 'utf-8'

> One last pedantic issue: strings aren't ASCII or Latin1, etc., but
> Unicode. There is enough confusion between Unicode text strings and
> bytes without adding methods whose names blur the distinction slightly.

yes!

-- 
Terry Jan Reedy