On 6/30/2012 8:59 PM, Steven D'Aprano wrote:
Why just ASCII, Latin1 and BMP (whatever that is, googling has not come up with anything relevant)?
BMP = Unicode Basic MultilingualPlane, the first 2**16 codepoints http://unicode.org/roadmaps/bmp/ I presume the proposed isbmp would exclude surrogates in 16 bit implementations, but that was not clearly defined.
It seems to me that adding these three tests
The temptation for these three tests is the the info is already available (at least for 2 of them) as an internal implementation-specific C-level attribute in 3.3(+). No O(n) scan needed. Someone could make a CPython3.3-specific module available on PyPI.
will open the doors to a steady stream of requests for new methods is<insert encoding name here>.
I suggest that a better API would be a method that takes the name of an encoding (perhaps defaulting to 'ascii') and returns True|False:
string.encodable(encoding='ascii') -> True|False
Return True if string can be encoded using the named encoding, otherwise False.
But then one might as well try the encoding and check for exception. The point of the proposal is to avoid things like try: body = text.encode('ascii') header = 'ascii' #abbreviating here except UnicodeEncodeError: try: body = text.encode('latin1') header = 'latin1' except UnicodeEncodeError: body = text.encode('utf-8') header = 'utf-8'
One last pedantic issue: strings aren't ASCII or Latin1, etc., but Unicode. There is enough confusion between Unicode text strings and bytes without adding methods whose names blur the distinction slightly.
yes! -- Terry Jan Reedy