On 6/30/2012 11:21 PM, Steven D'Aprano wrote:
Terry Reedy wrote:
On 6/30/2012 8:59 PM, Steven D'Aprano wrote:
I suggest that a better API would be a method that takes the name of an encoding (perhaps defaulting to 'ascii') and returns True|False:
string.encodable(encoding='ascii') -> True|False
Return True if string can be encoded using the named encoding, otherwise False.
But then one might as well try the encoding and check for exception. The point of the proposal is to avoid things like
try: body = text.encode('ascii') header = 'ascii' #abbreviating here except UnicodeEncodeError: try: body = text.encode('latin1') header = 'latin1' except UnicodeEncodeError: body = text.encode('utf-8') header = 'utf-8'
Right. And re-written with the hypothetical encodable method, you have the usual advantage of LBYL that it is slightly more concise:
body = header = None for encoding in ('ascii', 'latin1', 'utf-8'): if text.encodable(encoding): body = text.encode(encoding) header = encoding
But you are doing about half the work twice.
instead of:
body = header = None for encoding in ('ascii', 'latin1', 'utf-8'): try: body = text.encode(encoding) header = encoding except UnicodeEncodeError: pass
As for as expressibility goes, it is not much of an advantage. But:
- if there are optimizations that apply to some encodings but not others, the encodable method can take advantage of them without it being a promise of the language;
It would be an optimization limited to a couple of encodings with CPython. Using it for cross-version code would be something like the trap of depending on the CPython optimization of repeated string concatenation.
- it only adds a single string method (and presumably a single bytes method, decodable) rather than a plethora of methods;
Decodable would always require a scan of the bytes. Might as well just decode and look for UnicodeDecodeError.
So, I don't care much either way for a LBYL test, but if there is a good use case for such a test,
My claim is that there is only a good use case if it is O(1), which would only be a few cases on CPython.
better for it to be a single method taking the encoding name rather than a multitude of tests, or exposing an implementation-specific value that the coder then has to interpret themselves.
-1 on isascii, islatin1, isbmp
I do not see much of any use for isbmp. Maybe I missed something in the original post.
-1 on exposing max_code_point
Jython and IronPython are stuck with the underlying platform implementations, which I believe are like the current semi-utf-16 narrow builds. So it would have to be a CPython-only attribute for now. (PyPy might consider adopting the new Unicode implementation someday too.)
+0.5 on encodable
encodable would indirectly expose max_code_point since it would only be really useful and likely used when max_code_point was available and applicable. In other words, s.encodable('latin1') is equivalent to s.max_code_point == 255. if isbmp *is* useful, I don't think it can be duplicated with .encodable. Python seems not to have a ucs-2 codec. -- Terry Jan Reedy