Mini-Pep: An Empty String ABC Target: Py2.6 and Py3.0 Author: Raymond Hettinger Proposal -------- Add a new collections ABC specified as: class String(Sequence): pass Motivation ---------- Having an ABC for strings allows string look-alike classes to declare themselves as sequences that contain text. Client code (such as a flatten operation or tree searching tool) may use that ABC to usefully differentiate strings from other sequences (i.e. containers vs containees). And in code that only relies on sequence behavior, isinstance(x,str) may be usefully replaced by isinstance(x,String) so that look-alikes can be substituted in calling code. A natural temptation is add other methods to the String ABC, but strings are a tough case. Beyond simple sequence manipulation, the string methods get very complex. An ABC that included those methods would make it tough to write a compliant class that could be registered as a String. The split(), rsplit(), partition(), and rpartition() methods are examples of methods that would be difficult to emulate correctly. Also, starting with Py3.0, strings are essentially abstract sequences of code points, meaning that an encode() method is essential to being able to usefully transform them back into concrete data. Unfortunately, the encode method is so complex that it cannot be readily emulated by an aspiring string look-alike. Besides complexity, another problem with the concrete str API is the extensive number of methods. If string look-alikes were required to emulate the likes of zfill(), ljust(), title(), translate(), join(), etc., it would significantly add to the burden of writing a class complying with the String ABC. The fundamental problem is that of balancing a client function's desire to rely on a broad number of behaviors against the difficulty of writing a compliant look-alike class. For other ABCs, the balance is more easily struck because the behaviors are fewer in number, because they are easier to implement correctly, and because some methods can be provided as mixins. For a String ABC, the balance should lean toward minimalism due to the large number of methods and how difficult it is to implement some of the correctly. A last reason to avoid expanding the String API is that almost none of the candidate methods characterize the notion of "stringiness". With something calling itself an integer, an __add__() method would be expected as it is fundamental to the notion of "integeriness". In contrast, methods like startswith() and title() are non-essential extras -- we would not discount something as being not stringlike if those methods were not present.
Raymond Hettinger
Also, starting with Py3.0, strings are essentially abstract sequences of code points, meaning that an encode() method is essential to being able to usefully transform them back into concrete data.
Well, that depends: - is a String the specification of a generic range of types which one might want to special-case in some algorithms, e.g. flatten() - or is a String the specification of something which is meant to be used as a replacement of str (or, perhaps, bytes)? If you answer the former, the String API should be very minimal and there is no reason for it to support "encoding" or "decoding". Such a String doesn't have to be a string of characters, it can contain arbitrary objects, e.g. DNA elements. If you answer the latter, what use is a String subclass which isn't a drop-in replacement for either str or bytes? Saying "hello, I'm a String" is not very useful if you can't be used anywhere in existing code. I think most Python coders wouldn't go out of their way to allow arbitrary String instances as parameters for their functions, rather than objects conforming to the full str (or, perhaps, bytes) API. I'd like to know the use cases of a String ABC representing replacements of the str class, though. I must admit I've never used UserString and the like, and don't know how useful they can be. However, the docs have the following to say: « This UserString class from this module is available for backward compatibility only. If you are writing code that does not need to work with versions of Python earlier than Python 2.2, please consider subclassing directly from the built-in str type instead of using UserString ». So, apart from compatibility purposes, what is the point currently of *not* directly subclassing str? Regards Antoine.
This PEP is incomplete without specifying exactly which built-in and
stdlib types should be registered as String instances.
I'm also confused -- the motivation seems mostly "so that you can skip
iterating over it when flattening a nested sequence" but earlier you
rejected my "Atomic" proposal, saying "Earlier in the thread it was
made clear that that atomicity is not an intrinsic property of a type;
instead it varies across applications [...]". Isn't this String
proposal just that by another name?
Finally, I fully expect lots of code writing isinstance(x, String) and
making many more assumptions than promised by the String ABC. For
example that s[0] has the same type as s (not true for bytes). Or that
it is hashable (the Sequence class doesn't define __hash__). Or that
s1+s2 will work (not in the Sequence class either). And many more.
All this makes me lean towards a rejection of this proposal -- it
seems worse than no proposal at all. It could perhaps be rescued by
adding some small set of defined operations.
--Guido
On Sat, May 31, 2008 at 11:59 PM, Raymond Hettinger
Mini-Pep: An Empty String ABC Target: Py2.6 and Py3.0 Author: Raymond Hettinger
Proposal --------
Add a new collections ABC specified as:
class String(Sequence): pass
Motivation ---------- Having an ABC for strings allows string look-alike classes to declare themselves as sequences that contain text. Client code (such as a flatten operation or tree searching tool) may use that ABC to usefully differentiate strings from other sequences (i.e. containers vs containees). And in code that only relies on sequence behavior, isinstance(x,str) may be usefully replaced by isinstance(x,String) so that look-alikes can be substituted in calling code.
A natural temptation is add other methods to the String ABC, but strings are a tough case. Beyond simple sequence manipulation, the string methods get very complex. An ABC that included those methods would make it tough to write a compliant class that could be registered as a String. The split(), rsplit(), partition(), and rpartition() methods are examples of methods that would be difficult to emulate correctly. Also, starting with Py3.0, strings are essentially abstract sequences of code points, meaning that an encode() method is essential to being able to usefully transform them back into concrete data. Unfortunately, the encode method is so complex that it cannot be readily emulated by an aspiring string look-alike.
Besides complexity, another problem with the concrete str API is the extensive number of methods. If string look-alikes were required to emulate the likes of zfill(), ljust(), title(), translate(), join(), etc., it would significantly add to the burden of writing a class complying with the String ABC.
The fundamental problem is that of balancing a client function's desire to rely on a broad number of behaviors against the difficulty of writing a compliant look-alike class. For other ABCs, the balance is more easily struck because the behaviors are fewer in number, because they are easier to implement correctly, and because some methods can be provided as mixins. For a String ABC, the balance should lean toward minimalism due to the large number of methods and how difficult it is to implement some of the correctly.
A last reason to avoid expanding the String API is that almost none of the candidate methods characterize the notion of "stringiness". With something calling itself an integer, an __add__() method would be expected as it is fundamental to the notion of "integeriness". In contrast, methods like startswith() and title() are non-essential extras -- we would not discount something as being not stringlike if those methods were not present.
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/guido%40python.org
-- --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum wrote:
This PEP is incomplete without specifying exactly which built-in and stdlib types should be registered as String instances.
I'm also confused -- the motivation seems mostly "so that you can skip iterating over it when flattening a nested sequence" but earlier you rejected my "Atomic" proposal, saying "Earlier in the thread it was made clear that that atomicity is not an intrinsic property of a type; instead it varies across applications [...]". Isn't this String proposal just that by another name?
Finally, I fully expect lots of code writing isinstance(x, String) and making many more assumptions than promised by the String ABC. For example that s[0] has the same type as s (not true for bytes). Or that it is hashable (the Sequence class doesn't define __hash__). Or that s1+s2 will work (not in the Sequence class either). And many more.
I think the PEP also needs to explain why having multiple small one-off string ABCs is a bad thing. The whole point of providing a standard ABC mechanism is to enable exactly that: allowing a library to say "Here is my concept of what a string class needs to provide - register with this ABC to tell me that I can use your class without blowing up unexpectedly". The library can then preregister a bunch of other classes it knows about that do the right thing (such as the builtin str type) That is, to write a flatten operation with ABC's you might do something along the lines of: from abc import ABCMeta class Atomic(metaclass=ABCMeta): """ABC for iterables that the flatten function will not expand""" Atomic.register(str) # Consider builtin strings to be atomic def flatten(obj, atomic=Atomic): itr = None if not isinstance(obj, atomic): try: itr = iter(obj) except (TypeError, AttributeError): pass if itr is not None: for item in itr: for x in flatten(item): yield x else: yield obj Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org
From: "Guido van Rossum"
All this makes me lean towards a rejection of this proposal -- it seems worse than no proposal at all. It could perhaps be rescued by adding some small set of defined operations.
By subclassing Sequence, we get index() and count() mixins for free. We can also add other mixin freebies like __hash__(), __eq__(), __ne__(), endswith(), startswith(), find(), rfind(), and rindex(). It's tempting to add center, lust, rjust, and zfill, but those require some sort of constructor that accepts an iterable argument. As important as what is included are the methods intentionally left out. I'm trying to avoid insisting on abstractmethods like encode(), split(), join(), and other methods that place an undue burden on a class being registered as a String. Raymond
Please try to find the largest set of methods that you're comfortable
with. __add__ comes to mind.
Note that if you add __hash__, this rules out bytearray -- is that
your intention? __hash__ is intentionally not part of the "read-only"
ABCs because read-only doesn't mean immutable.
Also, (again) please list which built-in types you want to register.
On Mon, Jun 2, 2008 at 1:54 PM, Raymond Hettinger
From: "Guido van Rossum"
All this makes me lean towards a rejection of this proposal -- it seems worse than no proposal at all. It could perhaps be rescued by adding some small set of defined operations.
By subclassing Sequence, we get index() and count() mixins for free.
We can also add other mixin freebies like __hash__(), __eq__(), __ne__(), endswith(), startswith(), find(), rfind(), and rindex().
It's tempting to add center, lust, rjust, and zfill, but those require some sort of constructor that accepts an iterable argument.
As important as what is included are the methods intentionally left out. I'm trying to avoid insisting on abstractmethods like encode(), split(), join(), and other methods that place an undue burden on a class being registered as a String.
Raymond
-- --Guido van Rossum (home page: http://www.python.org/~guido/)
Raymond Hettinger
By subclassing Sequence, we get index() and count() mixins for free.
It seems to me that Sequence.index()/count() and String.index()/count() shouldn't have the same semantics. In the former case they search for items in the Sequence, in the latter case they search for substrings of the String.
From: "Antoine Pitrou"
It seems to me that Sequence.index()/count() and String.index()/count() shouldn't have the same semantics. In the former case they search for items in the Sequence, in the latter case they search for substrings of the String.
And the same applies to __contains__(). Raymond
Given the lack of responses to my questions, let's reject this. -- --Guido van Rossum (home page: http://www.python.org/~guido/)
participants (4)
-
Antoine Pitrou
-
Guido van Rossum
-
Nick Coghlan
-
Raymond Hettinger