As shown in issue #15016 [1], there is a use cases when it is useful to determine that string can be encoded in ASCII or Latin1. In working with Tk or Windows console applications can be useful to determine that string can be encoded in UCS2. C API provides interface for this, but at Python level it is not available. I propose to add to strings class new methods: isascii(), islatin1() and isbmp() (in addition to such methods as isalpha() or isdigit()). The implementation will be trivial. Pro: The current trick with trying to encode has O(n) complexity and has overhead of exception raising/catching. Contra: In most cases after determining characters range we still need to encode a string with the appropriate encoding. New methods will complicate already overloaded strings class. Objections? [1] http://bugs.python.org/issue15016
On Sun, Jul 1, 2012 at 2:03 AM, Serhiy Storchaka <storchaka@gmail.com> wrote:
As shown in issue #15016 [1], there is a use cases when it is useful to determine that string can be encoded in ASCII or Latin1. In working with Tk or Windows console applications can be useful to determine that string can be encoded in UCS2. C API provides interface for this, but at Python level it is not available.
I propose to add to strings class new methods: isascii(), islatin1() and isbmp() (in addition to such methods as isalpha() or isdigit()). The implementation will be trivial.
Why not just expose max_code_point directly instead of adding three new methods? Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Why not just expose max_code_point directly instead of adding three new methods?
+1 I accidentally sent my reply directly to Serhiy, but basically I said that I could really use this in my search library when I'm trying to write efficient compressed indexes, but all I need is to know the maximum char code (or the number of bytes per char). I've been meaning to ask about this for a while. Matt
On Sun, 1 Jul 2012 02:14:23 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
On Sun, Jul 1, 2012 at 2:03 AM, Serhiy Storchaka <storchaka@gmail.com> wrote:
As shown in issue #15016 [1], there is a use cases when it is useful to determine that string can be encoded in ASCII or Latin1. In working with Tk or Windows console applications can be useful to determine that string can be encoded in UCS2. C API provides interface for this, but at Python level it is not available.
I propose to add to strings class new methods: isascii(), islatin1() and isbmp() (in addition to such methods as isalpha() or isdigit()). The implementation will be trivial.
Why not just expose max_code_point directly instead of adding three new methods?
Because it's really an implementation detail. We don't want to carry around such a legacy. Besides, we don't know the max code point for sure, only an upper bound of it (and, implicitly, also a lower bound). So while I'm -0 on the methods (calling encode() is as simple), I'm -1 on max_code_point. Regards Antoine.
Well, there would be constants. What about both the methods and the max_code_point, and use it as an excuse to explain again that encodings exists, and point to the encodings docs? -- Be prepared to have your predictions come true
On 30.06.12 19:43, Antoine Pitrou wrote:
Because it's really an implementation detail. We don't want to carry around such a legacy. Besides, we don't know the max code point for sure, only an upper bound of it (and, implicitly, also a lower bound).
So while I'm -0 on the methods (calling encode() is as simple), I'm -1 on max_code_point.
Thanks, Antoine. This objection also just occurred to me. We cannot guarantee that isascii() always will be O(1). Several enchantments have already been rejected for this reason. If an extension author wants to take advantage of CPython, he should use CPython's C API.
Nick Coghlan <ncoghlan@...> writes:
Why not just expose max_code_point directly instead of adding three new methods?
All of these proposals rely on the *current* implementation of CPython unicode (at least for their efficiency). Let's not pollute the language with features that will be bad on others implementations or even ours in the future. Regards, Benjamin
Serhiy Storchaka wrote:
As shown in issue #15016 [1], there is a use cases when it is useful to determine that string can be encoded in ASCII or Latin1. In working with Tk or Windows console applications can be useful to determine that string can be encoded in UCS2. C API provides interface for this, but at Python level it is not available.
I propose to add to strings class new methods: isascii(), islatin1() and isbmp() (in addition to such methods as isalpha() or isdigit()). The implementation will be trivial.
Pro: The current trick with trying to encode has O(n) complexity and has overhead of exception raising/catching.
Are you suggesting that isascii and friends would be *better* than O(n)? How can that work -- wouldn't it have to scan the string and look at each character? Why just ASCII, Latin1 and BMP (whatever that is, googling has not come up with anything relevant)? It seems to me that adding these three tests will open the doors to a steady stream of requests for new methods is<insert encoding name here>. I suggest that a better API would be a method that takes the name of an encoding (perhaps defaulting to 'ascii') and returns True|False: string.encodable(encoding='ascii') -> True|False Return True if string can be encoded using the named encoding, otherwise False. One last pedantic issue: strings aren't ASCII or Latin1, etc., but Unicode. There is enough confusion between Unicode text strings and bytes without adding methods whose names blur the distinction slightly. -- Steven
On 6/30/2012 8:59 PM, Steven D'Aprano wrote:
Why just ASCII, Latin1 and BMP (whatever that is, googling has not come up with anything relevant)?
BMP = Unicode Basic MultilingualPlane, the first 2**16 codepoints http://unicode.org/roadmaps/bmp/ I presume the proposed isbmp would exclude surrogates in 16 bit implementations, but that was not clearly defined.
It seems to me that adding these three tests
The temptation for these three tests is the the info is already available (at least for 2 of them) as an internal implementation-specific C-level attribute in 3.3(+). No O(n) scan needed. Someone could make a CPython3.3-specific module available on PyPI.
will open the doors to a steady stream of requests for new methods is<insert encoding name here>.
I suggest that a better API would be a method that takes the name of an encoding (perhaps defaulting to 'ascii') and returns True|False:
string.encodable(encoding='ascii') -> True|False
Return True if string can be encoded using the named encoding, otherwise False.
But then one might as well try the encoding and check for exception. The point of the proposal is to avoid things like try: body = text.encode('ascii') header = 'ascii' #abbreviating here except UnicodeEncodeError: try: body = text.encode('latin1') header = 'latin1' except UnicodeEncodeError: body = text.encode('utf-8') header = 'utf-8'
One last pedantic issue: strings aren't ASCII or Latin1, etc., but Unicode. There is enough confusion between Unicode text strings and bytes without adding methods whose names blur the distinction slightly.
yes! -- Terry Jan Reedy
Terry Reedy wrote:
On 6/30/2012 8:59 PM, Steven D'Aprano wrote:
I suggest that a better API would be a method that takes the name of an encoding (perhaps defaulting to 'ascii') and returns True|False:
string.encodable(encoding='ascii') -> True|False
Return True if string can be encoded using the named encoding, otherwise False.
But then one might as well try the encoding and check for exception. The point of the proposal is to avoid things like
try: body = text.encode('ascii') header = 'ascii' #abbreviating here except UnicodeEncodeError: try: body = text.encode('latin1') header = 'latin1' except UnicodeEncodeError: body = text.encode('utf-8') header = 'utf-8'
Right. And re-written with the hypothetical encodable method, you have the usual advantage of LBYL that it is slightly more concise: body = header = None for encoding in ('ascii', 'latin1', 'utf-8'): if text.encodable(encoding): body = text.encode(encoding) header = encoding instead of: body = header = None for encoding in ('ascii', 'latin1', 'utf-8'): try: body = text.encode(encoding) header = encoding except UnicodeEncodeError: pass As for as expressibility goes, it is not much of an advantage. But: - if there are optimizations that apply to some encodings but not others, the encodable method can take advantage of them without it being a promise of the language; - it only adds a single string method (and presumably a single bytes method, decodable) rather than a plethora of methods; So, I don't care much either way for a LBYL test, but if there is a good use case for such a test, better for it to be a single method taking the encoding name rather than a multitude of tests, or exposing an implementation-specific value that the coder then has to interpret themselves. -1 on isascii, islatin1, isbmp -1 on exposing max_code_point +0.5 on encodable -- Steven
On 01.07.12 02:22, Greg Ewing wrote:
Serhiy Storchaka wrote:
Several enchantments have already been rejected for this reason. ^^^^^^^^^^^^
Yeah, programming does seem to be a black art sometimes...
Sorry, I mean enhancements. But yes, there is no good programming without magic.
On 6/30/2012 11:21 PM, Steven D'Aprano wrote:
Terry Reedy wrote:
On 6/30/2012 8:59 PM, Steven D'Aprano wrote:
I suggest that a better API would be a method that takes the name of an encoding (perhaps defaulting to 'ascii') and returns True|False:
string.encodable(encoding='ascii') -> True|False
Return True if string can be encoded using the named encoding, otherwise False.
But then one might as well try the encoding and check for exception. The point of the proposal is to avoid things like
try: body = text.encode('ascii') header = 'ascii' #abbreviating here except UnicodeEncodeError: try: body = text.encode('latin1') header = 'latin1' except UnicodeEncodeError: body = text.encode('utf-8') header = 'utf-8'
Right. And re-written with the hypothetical encodable method, you have the usual advantage of LBYL that it is slightly more concise:
body = header = None for encoding in ('ascii', 'latin1', 'utf-8'): if text.encodable(encoding): body = text.encode(encoding) header = encoding
But you are doing about half the work twice.
instead of:
body = header = None for encoding in ('ascii', 'latin1', 'utf-8'): try: body = text.encode(encoding) header = encoding except UnicodeEncodeError: pass
As for as expressibility goes, it is not much of an advantage. But:
- if there are optimizations that apply to some encodings but not others, the encodable method can take advantage of them without it being a promise of the language;
It would be an optimization limited to a couple of encodings with CPython. Using it for cross-version code would be something like the trap of depending on the CPython optimization of repeated string concatenation.
- it only adds a single string method (and presumably a single bytes method, decodable) rather than a plethora of methods;
Decodable would always require a scan of the bytes. Might as well just decode and look for UnicodeDecodeError.
So, I don't care much either way for a LBYL test, but if there is a good use case for such a test,
My claim is that there is only a good use case if it is O(1), which would only be a few cases on CPython.
better for it to be a single method taking the encoding name rather than a multitude of tests, or exposing an implementation-specific value that the coder then has to interpret themselves.
-1 on isascii, islatin1, isbmp
I do not see much of any use for isbmp. Maybe I missed something in the original post.
-1 on exposing max_code_point
Jython and IronPython are stuck with the underlying platform implementations, which I believe are like the current semi-utf-16 narrow builds. So it would have to be a CPython-only attribute for now. (PyPy might consider adopting the new Unicode implementation someday too.)
+0.5 on encodable
encodable would indirectly expose max_code_point since it would only be really useful and likely used when max_code_point was available and applicable. In other words, s.encodable('latin1') is equivalent to s.max_code_point == 255. if isbmp *is* useful, I don't think it can be duplicated with .encodable. Python seems not to have a ucs-2 codec. -- Terry Jan Reedy
On Sun, Jul 1, 2012 at 3:48 PM, Terry Reedy <tjreedy@udel.edu> wrote:
encodable would indirectly expose max_code_point since it would only be really useful and likely used when max_code_point was available and applicable. In other words, s.encodable('latin1') is equivalent to s.max_code_point == 255.
if isbmp *is* useful, I don't think it can be duplicated with .encodable. Python seems not to have a ucs-2 codec.
Rewinding back to the reasons the question is being asked, the reason this information is useful at the Python level is the same reason it is useful at the C level: it matters for finding the most efficient means of representing the text as bytes (which can then have further implications for the kind of quoting used, etc). The interesting breakpoints can actually be expressed in terms of the number of bits in the highest code point: 7 - encode as ASCII (or latin-1 or utf-8) 8 - encode as latin-1 8+ - encode as utf-8 Specifically, it's a payload microoptimisation for the latin-1 case - the latin-1 string will be shorter than the corresponding utf-8 string (how much shorter depends on the number of non-ASCII characters). I believe it also makes an additional difference in the email case by changing the kind of quoting that is used to something with lower overhead that can't handle utf-8. The "try it and see" approach suffers a potentially high speed penalty if the non-latin-1 characters appear late in the string: try: # Likely no need to try ASCII, since there's no efficiency gain over latin-1 payload = message.encode("latin-1") except UnicodeEncodeError: payload = message.encode("utf-8") Using max() and ord() to check in advance doesn't help, since that *locks in* the O(n) penalty. The reason I think a max_code_point() method is a good potential solution is that it can be advertised as O(n) worst case, but potentially O(1) if the implementation caches the answer internally. Another alternative would be a __max__ and __min__ protocol that allowed efficient answers for the max() and min() builtins. The latter would have the advantage of allowing other containers (like range objects) to provide efficient implementations. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Sun, Jul 01, 2012 at 01:48:14AM -0400, Terry Reedy wrote:
As for as expressibility goes, it is not much of an advantage. But:
- if there are optimizations that apply to some encodings but not others, the encodable method can take advantage of them without it being a promise of the language;
It would be an optimization limited to a couple of encodings with CPython. Using it for cross-version code would be something like the trap of depending on the CPython optimization of repeated string concatenation.
I'd hardly call it a trap. It's not like string concatenation which is expected to be O(N**2) on CPython but occasionally falls back to O(N**2). It would be O(N) expected on all platforms, but occasionally does better. Perhaps an anti-trap -- sometimes it does better than expected, rather than worse.
- it only adds a single string method (and presumably a single bytes method, decodable) rather than a plethora of methods;
Decodable would always require a scan of the bytes. Might as well just decode and look for UnicodeDecodeError.
*shrug* Perhaps so. bytes.decodable() would only be a LBYL convenience method.
So, I don't care much either way for a LBYL test, but if there is a good use case for such a test,
My claim is that there is only a good use case if it is O(1), which would only be a few cases on CPython.
*shrug* Again, I'm not exactly championing this proposal. I can see that an encodable method would be useful, but not that much more useful than trying to encode and catching the exception. A naive O(N) version of encodable() is trivial to implement. -- Steven
On Sun, Jul 01, 2012 at 04:27:25PM +1000, Nick Coghlan wrote:
Rewinding back to the reasons the question is being asked, the reason this information is useful at the Python level is the same reason it is useful at the C level: it matters for finding the most efficient means of representing the text as bytes (which can then have further implications for the kind of quoting used, etc). The interesting breakpoints can actually be expressed in terms of the number of bits in the highest code point: 7 - encode as ASCII (or latin-1 or utf-8) 8 - encode as latin-1 8+ - encode as utf-8
I'm of two minds here. On the one hand, I question the wisdom of encouraging the use of anything but UTF-8. It's unfortunate enough that there are still cases where people have to use older encodings, without encouraging people to use Latin1 or ASCII in order to save a handful of bytes in a 20K email. On the other hand, there are use-cases for non-UTF-8 encodings, and people will want to check whether or not a string is encodable in various encodings. Why make that harder/slower/less convenient than it need be?
Specifically, it's a payload microoptimisation for the latin-1 case - the latin-1 string will be shorter than the corresponding utf-8 string
Just to be clear here, you're referring to byte strings, yes?
(how much shorter depends on the number of non-ASCII characters). I believe it also makes an additional difference in the email case by changing the kind of quoting that is used to something with lower overhead that can't handle utf-8.
The "try it and see" approach suffers a potentially high speed penalty if the non-latin-1 characters appear late in the string:
try: # Likely no need to try ASCII, since there's no efficiency gain over latin-1 payload = message.encode("latin-1") except UnicodeEncodeError: payload = message.encode("utf-8")
Using max() and ord() to check in advance doesn't help, since that *locks in* the O(n) penalty.
The reason I think a max_code_point() method is a good potential solution is that it can be advertised as O(n) worst case, but potentially O(1) if the implementation caches the answer internally.
The downside is that the caller is then responsible for interpreting that value (i.e. mapping a max code point to an encoding). The other downside is that doesn't do anything to help those who are stuck with legacy encodings. Although maybe that doesn't matter, since they will just do the "try it and see" approach.
Another alternative would be a __max__ and __min__ protocol that allowed efficient answers for the max() and min() builtins. The latter would have the advantage of allowing other containers (like range objects) to provide efficient implementations.
+1 on that, although I think that should be a separate issue. -- Steven
On Sat, Jun 30, 2012 at 12:03 PM, Serhiy Storchaka <storchaka@gmail.com> wrote:
As shown in issue #15016 [1], there is a use cases when it is useful to determine that string can be encoded in ASCII or Latin1. In working with Tk or Windows console applications can be useful to determine that string can be encoded in UCS2. C API provides interface for this, but at Python level it is not available.
I propose to add to strings class new methods: isascii(), islatin1() and isbmp() (in addition to such methods as isalpha() or isdigit()). The implementation will be trivial.
Pro: The current trick with trying to encode has O(n) complexity and has overhead of exception raising/catching.
Contra: In most cases after determining characters range we still need to encode a string with the appropriate encoding. New methods will complicate already overloaded strings class.
Objections?
-1 It doesn't make sense to special case them, instead of a simpler canencode() method added. It could save memory, but I don't see it saving time.
[1] http://bugs.python.org/issue15016
_______________________________________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas
-- Read my blog! I depend on your acceptance of my opinion! I am interesting! http://techblog.ironfroggy.com/ Follow me if you're into that sort of thing: http://www.twitter.com/ironfroggy
On 02.07.12 11:52, Steven D'Aprano wrote:
Another alternative would be a __max__ and __min__ protocol that allowed efficient answers for the max() and min() builtins. The latter would have the advantage of allowing other containers (like range objects) to provide efficient implementations.
+1 on that, although I think that should be a separate issue.
This is issue #15226.
On 7/2/2012 7:32 AM, Serhiy Storchaka wrote:
On 02.07.12 11:52, Steven D'Aprano wrote:
Another alternative would be a __max__ and __min__ protocol that allowed efficient answers for the max() and min() builtins. The latter would have the advantage of allowing other containers (like range objects) to provide efficient implementations.
+1 on that, although I think that should be a separate issue.
This is issue #15226.
http://bugs.python.org/issue15226 was about exposing the max codepoint, which the OP thought was readily available in C, but is not. (It is at creation, but it is then replaced by 1 of 4 values.) -- Terry Jan Reedy
participants (10)
-
Antoine Pitrou
-
Benjamin Peterson
-
Calvin Spealman
-
Christopher Reay
-
Greg Ewing
-
Matt Chaput
-
Nick Coghlan
-
Serhiy Storchaka
-
Steven D'Aprano
-
Terry Reedy