[Python-ideas] isascii()/islatin1()/isbmp()

Sun Jul 1 07:48:14 CEST 2012

On 6/30/2012 11:21 PM, Steven D'Aprano wrote:
> Terry Reedy wrote:
>> On 6/30/2012 8:59 PM, Steven D'Aprano wrote:
>
>>> I suggest that a better API would be a method that takes the name of an
>>> encoding (perhaps defaulting to 'ascii') and returns True|False:
>>>
>>> string.encodable(encoding='ascii') -> True|False
>>>
>>> Return True if string can be encoded using the named encoding, otherwise
>>> False.
>>
>> But then one might as well try the encoding and check for exception.
>> The point of the proposal is to avoid things like
>>
>> try:
>>   body = text.encode('ascii')
>>   header = 'ascii'  #abbreviating here
>> except UnicodeEncodeError:
>>   try:
>>     body = text.encode('latin1')
>>     header = 'latin1'
>>   except UnicodeEncodeError:
>>     body = text.encode('utf-8')
>>     header = 'utf-8'
>
> Right. And re-written with the hypothetical encodable method, you have
> the usual advantage of LBYL that it is slightly more concise:
>
> body = header = None
> for encoding in ('ascii', 'latin1', 'utf-8'):
>      if text.encodable(encoding):
>          body = text.encode(encoding)
>          header = encoding

But you are doing about half the work twice.

> instead of:
>
> body = header = None
> for encoding in ('ascii', 'latin1', 'utf-8'):
>      try:
>          body = text.encode(encoding)
>          header = encoding
>      except UnicodeEncodeError:
>          pass

> As for as expressibility goes, it is not much of an advantage. But:
>
> - if there are optimizations that apply to some encodings but not others,
>    the encodable method can take advantage of them without it being a
>    promise of the language;

It would be an optimization limited to a couple of encodings with 
CPython. Using it for cross-version code would be something like the 
trap of depending on the CPython optimization of repeated string 
concatenation.

> - it only adds a single string method (and presumably a single bytes
>    method, decodable) rather than a plethora of methods;

Decodable would always require a scan of the bytes. Might as well just 
decode and look for UnicodeDecodeError.

> So, I don't care much either way for a LBYL test, but if there is a good
> use case for such a test,

My claim is that there is only a good use case if it is O(1), which 
would only be a few cases on CPython.

> better for it to be a single method taking the
> encoding name rather than a multitude of tests, or exposing an
> implementation-specific value that the coder then has to interpret
> themselves.
>
> -1 on isascii, islatin1, isbmp

I do not see much of any use for isbmp. Maybe I missed something in the 
original post.

> -1 on exposing max_code_point

Jython and IronPython are stuck with the underlying platform 
implementations, which I believe are like the current semi-utf-16 narrow 
builds. So it would have to be a CPython-only attribute for now. (PyPy 
might consider adopting the new Unicode implementation someday too.)

> +0.5 on encodable

encodable would indirectly expose max_code_point since it would only be 
really useful and likely used when max_code_point was available and 
applicable. In other words, s.encodable('latin1') is equivalent to 
s.max_code_point == 255.

if isbmp *is* useful, I don't think it can be duplicated with 
.encodable. Python seems not to have a ucs-2 codec.

-- 
Terry Jan Reedy