[Python-ideas] isascii()/islatin1()/isbmp()
Terry Reedy
tjreedy at udel.edu
Sun Jul 1 07:48:14 CEST 2012
On 6/30/2012 11:21 PM, Steven D'Aprano wrote:
> Terry Reedy wrote:
>> On 6/30/2012 8:59 PM, Steven D'Aprano wrote:
>
>>> I suggest that a better API would be a method that takes the name of an
>>> encoding (perhaps defaulting to 'ascii') and returns True|False:
>>>
>>> string.encodable(encoding='ascii') -> True|False
>>>
>>> Return True if string can be encoded using the named encoding, otherwise
>>> False.
>>
>> But then one might as well try the encoding and check for exception.
>> The point of the proposal is to avoid things like
>>
>> try:
>> body = text.encode('ascii')
>> header = 'ascii' #abbreviating here
>> except UnicodeEncodeError:
>> try:
>> body = text.encode('latin1')
>> header = 'latin1'
>> except UnicodeEncodeError:
>> body = text.encode('utf-8')
>> header = 'utf-8'
>
> Right. And re-written with the hypothetical encodable method, you have
> the usual advantage of LBYL that it is slightly more concise:
>
> body = header = None
> for encoding in ('ascii', 'latin1', 'utf-8'):
> if text.encodable(encoding):
> body = text.encode(encoding)
> header = encoding
But you are doing about half the work twice.
> instead of:
>
> body = header = None
> for encoding in ('ascii', 'latin1', 'utf-8'):
> try:
> body = text.encode(encoding)
> header = encoding
> except UnicodeEncodeError:
> pass
> As for as expressibility goes, it is not much of an advantage. But:
>
> - if there are optimizations that apply to some encodings but not others,
> the encodable method can take advantage of them without it being a
> promise of the language;
It would be an optimization limited to a couple of encodings with
CPython. Using it for cross-version code would be something like the
trap of depending on the CPython optimization of repeated string
concatenation.
> - it only adds a single string method (and presumably a single bytes
> method, decodable) rather than a plethora of methods;
Decodable would always require a scan of the bytes. Might as well just
decode and look for UnicodeDecodeError.
> So, I don't care much either way for a LBYL test, but if there is a good
> use case for such a test,
My claim is that there is only a good use case if it is O(1), which
would only be a few cases on CPython.
> better for it to be a single method taking the
> encoding name rather than a multitude of tests, or exposing an
> implementation-specific value that the coder then has to interpret
> themselves.
>
> -1 on isascii, islatin1, isbmp
I do not see much of any use for isbmp. Maybe I missed something in the
original post.
> -1 on exposing max_code_point
Jython and IronPython are stuck with the underlying platform
implementations, which I believe are like the current semi-utf-16 narrow
builds. So it would have to be a CPython-only attribute for now. (PyPy
might consider adopting the new Unicode implementation someday too.)
> +0.5 on encodable
encodable would indirectly expose max_code_point since it would only be
really useful and likely used when max_code_point was available and
applicable. In other words, s.encodable('latin1') is equivalent to
s.max_code_point == 255.
if isbmp *is* useful, I don't think it can be duplicated with
.encodable. Python seems not to have a ucs-2 codec.
--
Terry Jan Reedy
More information about the Python-ideas
mailing list