Access (ordered) dict by index; insert slice
Date: Fri, 26 Jun 2020 18:47:44 +0200 From: Hans Ginzel <hans@matfyz.cz> To: Hans Ginzel <hans@artax.karlin.mff.cuni.cz> Subject: Access (ordered) dict by index; insert slice Hello, thank you for making dict ordered. Is it planned to access key,value pair(s) by index? See https://stackoverflow.com/a/44687752/2556118 for example. Both for reading and (re)writing? Is it planned to insert pair(s) on exact index? Or generally to slice? See splice() in Perl, https://perldoc.perl.org/functions/splice.html. Use case: Represent database table metadata (columns). It is useful as to access columns both by name and by index as to insert column on specific position, https://dev.mysql.com/doc/refman/8.0/en/alter-table.html, “ALTER TABLE ADD COLUMN [FIRST |AFTER col]” (consider default order or table storage size optimisation by aligning). Thank you in advance, Hans PS1: Named tuples cannot be used, are immutable. PS2: See https://metacpan.org/pod/perlref#Pseudo-hashes:-Using-an-array-as-a-hash
Why can't you do `tuple(dict.items())` to get your indexable pairs? Otherwise there are no plans as you would have to introduce a new method as you can't assume e.g. `0` is being used as a dictionary key. On Fri, Jun 26, 2020 at 10:32 AM Hans Ginzel <hans@matfyz.cz> wrote:
Date: Fri, 26 Jun 2020 18:47:44 +0200 From: Hans Ginzel <hans@matfyz.cz> To: Hans Ginzel <hans@artax.karlin.mff.cuni.cz> Subject: Access (ordered) dict by index; insert slice
Hello,
thank you for making dict ordered. Is it planned to access key,value pair(s) by index? See https://stackoverflow.com/a/44687752/2556118 for example. Both for reading and (re)writing? Is it planned to insert pair(s) on exact index? Or generally to slice? See splice() in Perl, https://perldoc.perl.org/functions/splice.html.
Use case: Represent database table metadata (columns). It is useful as to access columns both by name and by index as to insert column on specific position, https://dev.mysql.com/doc/refman/8.0/en/alter-table.html, “ALTER TABLE ADD COLUMN [FIRST |AFTER col]” (consider default order or table storage size optimisation by aligning).
Thank you in advance, Hans PS1: Named tuples cannot be used, are immutable. PS2: See https://metacpan.org/pod/perlref#Pseudo-hashes:-Using-an-array-as-a-hash _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/S7UMTW... Code of Conduct: http://python.org/psf/codeofconduct/
I think Hans would like to do `my_dict.items()[0]` for example, which shouldn't conflict with anything. On Fri, Jun 26, 2020 at 7:48 PM Brett Cannon <brett@python.org> wrote:
Why can't you do `tuple(dict.items())` to get your indexable pairs?
Otherwise there are no plans as you would have to introduce a new method as you can't assume e.g. `0` is being used as a dictionary key.
On Fri, Jun 26, 2020 at 10:32 AM Hans Ginzel <hans@matfyz.cz> wrote:
Date: Fri, 26 Jun 2020 18:47:44 +0200 From: Hans Ginzel <hans@matfyz.cz> To: Hans Ginzel <hans@artax.karlin.mff.cuni.cz> Subject: Access (ordered) dict by index; insert slice
Hello,
thank you for making dict ordered. Is it planned to access key,value pair(s) by index? See https://stackoverflow.com/a/44687752/2556118 for example. Both for reading and (re)writing? Is it planned to insert pair(s) on exact index? Or generally to slice? See splice() in Perl, https://perldoc.perl.org/functions/splice.html.
Use case: Represent database table metadata (columns). It is useful as to access columns both by name and by index as to insert column on specific position, https://dev.mysql.com/doc/refman/8.0/en/alter-table.html, “ALTER TABLE ADD COLUMN [FIRST |AFTER col]” (consider default order or table storage size optimisation by aligning).
Thank you in advance, Hans PS1: Named tuples cannot be used, are immutable. PS2: See https://metacpan.org/pod/perlref#Pseudo-hashes:-Using-an-array-as-a-hash _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/S7UMTW... Code of Conduct: http://python.org/psf/codeofconduct/
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/AQQVNZ... Code of Conduct: http://python.org/psf/codeofconduct/
On Fri, Jun 26, 2020 at 10:45:07AM -0700, Brett Cannon wrote:
Why can't you do `tuple(dict.items())` to get your indexable pairs?
I don't think that an immutable copy is going to help Hans with his use-case, since he already mentions that tuples don't solve his problem. Swapping to a list gives you a mutable sequence that makes inserting columns easy, but now lookups by column name are O(N) instead of O(1). Hans, I think that a dict is probably not the best solution here, but you can use a dict in an augmented data structure. I would consider keeping your columns in a list, indexed by position, and keeping a table of columns to index in a dict. Whenever you insert or remove a column, you can update the table. If this sounds like a lot of work, yes, it probably is, and making dicts perform that work for *every single dict* would slow down the language a lot. Dicts are critical for speed and performance because they are used so extensively. Globals, builtins, almost every module, class and instance use dicts internally, so they need to be as fast as possible. If your experience with Perl tells you differently, please explain to us how Perl hashes work in order that they support both fast hash indexing and positional indexing at the same time. -- Steven
Perhaps a NamedList would be helpful here? One could write one fairly easily. -CHB On Fri, Jun 26, 2020 at 6:13 PM Steven D'Aprano <steve@pearwood.info> wrote:
On Fri, Jun 26, 2020 at 10:45:07AM -0700, Brett Cannon wrote:
Why can't you do `tuple(dict.items())` to get your indexable pairs?
I don't think that an immutable copy is going to help Hans with his use-case, since he already mentions that tuples don't solve his problem.
Swapping to a list gives you a mutable sequence that makes inserting columns easy, but now lookups by column name are O(N) instead of O(1).
Hans, I think that a dict is probably not the best solution here, but you can use a dict in an augmented data structure. I would consider keeping your columns in a list, indexed by position, and keeping a table of columns to index in a dict. Whenever you insert or remove a column, you can update the table.
If this sounds like a lot of work, yes, it probably is, and making dicts perform that work for *every single dict* would slow down the language a lot. Dicts are critical for speed and performance because they are used so extensively. Globals, builtins, almost every module, class and instance use dicts internally, so they need to be as fast as possible.
If your experience with Perl tells you differently, please explain to us how Perl hashes work in order that they support both fast hash indexing and positional indexing at the same time.
-- Steven _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/POHINM... Code of Conduct: http://python.org/psf/codeofconduct/
-- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
Tahnk you, On Fri, Jun 26, 2020 at 10:45:07AM -0700, Brett Cannon wrote:
Why can't you do `tuple(dict.items())` to get your indexable pairs?
of course, I can. But how it is expensive/effective? What are the reasons, why object dict.items() is not subscriptable – dict.items()[0]?
Otherwise there are no plans as you would have to introduce a new method as you can't assume e.g. `0` is being used as a dictionary key.
I see. What about to use the items() method – dict.items(1) for the second pair or items(1:3) or … ? H.
…
On Mon, Jun 29, 2020 at 9:12 PM Hans Ginzel <hans@matfyz.cz> wrote:
Tahnk you,
On Fri, Jun 26, 2020 at 10:45:07AM -0700, Brett Cannon wrote:
Why can't you do `tuple(dict.items())` to get your indexable pairs?
of course, I can. But how it is expensive/effective? What are the reasons, why object dict.items() is not subscriptable – dict.items()[0]?
Because dict is optimized for random access by key and iteration, but not for random access by index. For example: sample 1: items = [*d.items()] for i in range(len(items)): do(items[i]) sample 2: for i in range(len(d)): do(d.items()[i]) # if dict_items supports index access. sample 1 is O(n) where n = len(d), but sample 2 is O(n^2). By not supporting index access, dict_items prevents to write such inefficient code. -- Inada Naoki <songofacandy@gmail.com>
On Tue, Jun 30, 2020 at 11:10:20AM +0900, Inada Naoki wrote:
On Mon, Jun 29, 2020 at 9:12 PM Hans Ginzel <hans@matfyz.cz> wrote:
What are the reasons, why object dict.items() is not subscriptable – dict.items()[0]?
Because dict is optimized for random access by key and iteration, but not for random access by index.
But we're not talking about *dict*, we're talking about dict.items which returns a set-like object: py> from collections.abc import Set py> isinstance({}.items(), Set) True So dict.items isn't subscriptable because it's an unordered set, not a sequence.
for i in range(len(d)): do(d.items()[i]) # if dict_items supports index access.
sample 1 is O(n) where n = len(d), but sample 2 is O(n^2).
By not supporting index access, dict_items prevents to write such inefficient code.
I don't think that this is a compelling or strong argument. We can write inefficient code because dicts do *not* support index access: for i in range(len(d)): for j,item in enumerate(d.items()): if j == i: do(item) I guess that adding index read access to dicts could be done, perhaps with a method: d.getitem(3) # Returns (key, value) and I don't think that would be expensive. My recollection is that internally dicts have, or had, an internal array of keys. So it could be a constant time operation to look up the key by index. We could allow setitem too: d.setitem(3) = value At worst, this would be O(N), which isn't too bad. There are many other O(N) operations in Python, such as list.index, str.upper, etc and we trust people to use them appropriately, or at least for small N. I recently wrote O(N**3) code and don't care. I'm never going to need it for N greater than about 10, and usually more like 2, 3, or 4, so it will be fast enough for my needs. This is easy enough to solve with an augmented dict. A sketch of a possible solution: class IndexableDict(dict): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self._items = {i: key for i, key in enumerate(self.keys()) def getitem(self, i): key = self._items[i] return (key, self[key]) Overload the other methods as needed. But even if we could do these efficiently, who would use them? If you need this for a table, you probably need to support insertion and deletion as well, maybe slices, and you probably need both columns and rows, so I really don't think that dict is the right place for this. I think people who need this functionality should look at other data structures, optimized for what you need, not try to force dict to be a second-class table. Not every data structure should be forced into dicts. But if a simple indexable dict is all you need, try writing a subclass. If you can find good uses for it, propose it to be added to the collections module. I would support that. -- Steven
But we're not talking about *dict*, we're talking about dict.items which returns a set-like object:
py> from collections.abc import Set py> isinstance({}.items(), Set) True
So dict.items isn't subscriptable because it's an unordered set, not a sequence.
Or is it a set because it can’t be indexed? If I have the history right, dict.items was first implemented with the “old” dict implementation, which did not preserve order, but did provide O(1) access, so making the dict views set-like was easy, and making them Sequences was impossible. But now dicts do preserve order, and so making the dict views sequence-like IS possible, and can be done efficiently—so why not? But if a simple indexable dict is all you need, try writing a subclass. I don’t think it’s possible to make that efficient without access to the dict internals. -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Mon, Jun 29, 2020 at 11:02:54PM -0700, Christopher Barker wrote:
So dict.items isn't subscriptable because it's an unordered set, not a sequence.
Or is it a set because it can’t be indexed? If I have the history right, dict.items was first implemented with the “old” dict implementation, which did not preserve order, but did provide O(1) access, so making the dict views set-like was easy, and making them Sequences was impossible.
No, making them set-like was intentional. The history goes something like this: In Python 1, and early Python 2, dict.keys(), .values() and .items() each returned lists. Some time by 2.4 (possibly earlier) dicts grew additional methods returning iterators: dict.iterkeys() .itervalues() .iteritems() As part of the migration from Python 2 to 3, it was planned to change them to set-like views: https://www.python.org/dev/peps/pep-3106/ so as an aid to migration, these set-like views were added to 2.7: dict.viewkeys() .viewvalues() .viewitems() and finally in 3.x, the list and iterator versions were all dropped and the views were renamed. So there was already a sequence version of items etc, and the set-like views were explicitly intended to be set-like. -- Steven
On Tue, Jun 30, 2020 at 5:35 AM Steven D'Aprano <steve@pearwood.info> wrote:
On Mon, Jun 29, 2020 at 11:02:54PM -0700, Christopher Barker wrote:
Or is it a set because it can’t be indexed? If I have the history right, dict.items was first implemented with the “old” dict implementation, which did not preserve order, but did provide O(1) access, so making the dict views set-like was easy, and making them Sequences was impossible.
No, making them set-like was intentional.
The history goes something like this:
In Python 1, and early Python 2, dict.keys(), .values() and .items() each returned lists. Some time by 2.4 (possibly earlier) dicts grew additional methods returning iterators:
dict.iterkeys() .itervalues() .iteritems()
As part of the migration from Python 2 to 3, it was planned to change them to set-like views:
https://www.python.org/dev/peps/pep-3106/
so as an aid to migration, these set-like views were added to 2.7:
dict.viewkeys() .viewvalues() .viewitems()
and finally in 3.x, the list and iterator versions were all dropped and the views were renamed.
So there was already a sequence version of items etc, and the set-like views were explicitly intended to be set-like.
Thanks for the details, but I just re-read the PEP, and I'm not sure my hand wavy version is wrong:
there was already a sequence version of items etc
Well, yes, but they explicitly created list copies -- there was no sequence-like version that was a view. The first improvement on that was the iter* methods, and then (inspired by JAVA) it was determined that we could have view objects that could provide efficient iterators, and also set-like behavior. But I don't see any comments in the PEP about why indexing was rejected, but it's implicit in the fact that dicts were explicitly unordered at the time that PEP was written. Now that dicts do preserve order, and there is an implementation that can provide somewhat efficient indexing ( not O(1), but yes O(N) with a small constant) there is the opportunity to add a new feature. It does bring up a question: if the view objects support indexing, it would be a bit odd for two dict view objects to be equal if they have the same elements, but not the same order, but that would be required to remain consistent with the way dicts currently are (and will remain) One way to think about is that adding indexing to the views would NOT make them Sequences, it would simply provide that one feature. So how useful is that feature? I recall a couple (actually, essentailly one) which is selecting an arbitrary or random item from a dict, i.e. being able to pass a dict view to random.choice() As it happens. I've had that use case, but if that's the only one folks can think of, then maybe not particularly worthwhile. Final note about performance: yes, we don't want to encourage folks to use not-great-performance approaches, but the alternative to the O(N) with small constant performance being proposed here is to make a Sequence out of the views, and then index into them, which is O(N) with a larger constant and more memory use. -CHB
-- Steven _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/XT4T7O... Code of Conduct: http://python.org/psf/codeofconduct/
-- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On 30/06/2020 18:31, Christopher Barker wrote:
The first improvement on that was the iter* methods, and then (inspired by JAVA) it was determined that we could have view objects that could provide efficient iterators, and also set-like behavior. But I don't see any comments in the PEP about why indexing was rejected, but it's implicit in the fact that dicts were explicitly unordered at the time that PEP was written.
I'm sorry, I don't think that logic holds water. Dicts at the time were explicitly unordered (and I still mostly find it more useful to think of them as unordered), but iterators are explicitly ordered. You probably couldn't rely on that order being consistent if you fiddled with the dict between iterations, but that's not behaviour we encourage with list-handling either. Don't get me wrong, if it's not going to cause performance issues I think being able to index views would be great, but I don't think this is the right way to justify it. If anyone who knows could comment on the feasibility, that would be great too :-) -- Rhodri James *-* Kynesim Ltd
On Tue, Jun 30, 2020 at 11:55 AM Rhodri James <rhodri@kynesim.co.uk> wrote:
The first improvement on that was the iter* methods, and then (inspired by JAVA) it was determined that we could have view objects that could
On 30/06/2020 18:31, Christopher Barker wrote: provide
efficient iterators, and also set-like behavior. But I don't see any comments in the PEP about why indexing was rejected, but it's implicit in the fact that dicts were explicitly unordered at the time that PEP was written.
I'm sorry, I don't think that logic holds water. Dicts at the time were explicitly unordered (and I still mostly find it more useful to think of them as unordered), but iterators are explicitly ordered. You probably couldn't rely on that order being consistent if you fiddled with the dict between iterations, but that's not behaviour we encourage with list-handling either.
Don't get me wrong, if it's not going to cause performance issues I think being able to index views would be great, but I don't think this is the right way to justify it.
I wasn't justifying it -- I was simply saying that no one ever specifically decided that we DIDN'T want to be able to index dict views -- when they were created, it simply wasn't an option. Now it is, so we need to rethink it.
if it's not going to cause performance issues
I can't see how it could possibly *cause* performance issues. It isn't as great performance as we would like, but I think it's still as good, and probably much better, than any other way to accomplish the same thing. Even if the performance were bad, I don't think we'd be handing out a foot gun here: who's going to write: my_dict.items[n] when they know what the key is for the nth item? Other than the usual "more code to write and maintain", I still don't see a downside. -CHB
If anyone who knows could comment on the feasibility, that would be great too :-)
-- Rhodri James *-* Kynesim Ltd _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/5ZZ2MN... Code of Conduct: http://python.org/psf/codeofconduct/
-- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Wed, 1 Jul 2020 at 05:36, Christopher Barker <pythonchb@gmail.com> wrote:
On Tue, Jun 30, 2020 at 11:55 AM Rhodri James <rhodri@kynesim.co.uk> wrote:
Don't get me wrong, if it's not going to cause performance issues I think being able to index views would be great, but I don't think this is the right way to justify it.
I wasn't justifying it -- I was simply saying that no one ever specifically decided that we DIDN'T want to be able to index dict views -- when they were created, it simply wasn't an option. Now it is, so we need to rethink it.
if it's not going to cause performance issues
I can't see how it could possibly *cause* performance issues. It isn't as great performance as we would like, but I think it's still as good, and probably much better, than any other way to accomplish the same thing.
If indexing is O(N) then repeated indexing can be a lot less efficient e.g. if order is the same size as d then this is O(N): items = list(d.items()) for n in order: do_stuff(items[n]) If the items view itself is indexable with O(N) cost then you can avoid constructing the list but the result is quadratic: for n in order: do_stuff(items[n]) Standard containers right now are all O(1) for indexing so I think it would be surprising that quadratic behaviour could result from indexing an object rather than making a copy of it. -- Oscar
On Tue, Jun 30, 2020 at 09:33:25PM -0700, Christopher Barker wrote:
I wasn't justifying it -- I was simply saying that no one ever specifically decided that we DIDN'T want to be able to index dict views -- when they were created, it simply wasn't an option. Now it is, so we need to rethink it.
Of course it was an option. Dicts always had *some* internal order, it is just that this internal order was an artifact of how the hash table was built, the history of insertions and deletions, etc. Keys appeared in some arbitrary order, not "no order at all". So there has always been the option to return the nth item according to that arbitrary order. It would be (presumably) the same order as items would appear if you iterated over the dict, or printed it. The only difference now is that instead of having keys appear is some arbitrary order, we can guarantee they appear in a non-arbitrary order. Under what circumstances would you find it useful to access the fifth key or item added to the dict? my_dict.items[4] Having a consistant, predictable, deterministic order is useful. Having that deterministic order be insertion order is useful. Associating an explicit index to the items, I'm not so sure.
Even if the performance were bad, I don't think we'd be handing out a foot gun here: who's going to write:
my_dict.items[n]
when they know what the key is for the nth item?
Who is going to write it at all? This still strikes me as a feature in search of a use-case. -- Steven
On 01/07/2020 11:24, Steven D'Aprano wrote:
On Tue, Jun 30, 2020 at 07:40:23PM +0100, Rhodri James wrote:
Don't get me wrong, if it's not going to cause performance issues I think being able to index views would be great
What are your use-cases for indexing set-like views of a dict?
Personally none, but that's because I still don't treat dicts as being ordered and arrange my data structures accordingly. I seem to remember the OP having a use case, and I can imagine people who have ordering ingrained in their style rather harder than me could find uses. -- Rhodri James *-* Kynesim Ltd
On Wed, Jul 01, 2020 at 12:10:12PM +0100, Rhodri James wrote:
On 01/07/2020 11:24, Steven D'Aprano wrote:
On Tue, Jun 30, 2020 at 07:40:23PM +0100, Rhodri James wrote:
Don't get me wrong, if it's not going to cause performance issues I think being able to index views would be great
What are your use-cases for indexing set-like views of a dict?
Personally none, but that's because I still don't treat dicts as being ordered and arrange my data structures accordingly. I seem to remember the OP having a use case, and I can imagine people who have ordering ingrained in their style rather harder than me could find uses.
The OP's use-case, as far as I can tell, would not actually be helped by adding indexing to dicts. He has a table of columns, and needs to insert and delete columns into any position, including slices of multiple columns. As far as I can see, the OP should be using a list of columns, not a dict, perhaps augmented by a mapping of column-names to column position (assuming the names are unique). Or maybe some fancier data structure designed for this purpose. -- Steven
One way to think about is that adding indexing to the views would NOT make them Sequences, it would simply provide that one feature
As you can see in an e-mail from me in another thread yesterday - `__eq__`, oddly enough, is not currently a Sequence method, if we take Sequences to be what is collections.abc.Sequence interface. (In that thread e-mail I am actually querying if there would be no drawbacks in doing that, though) On Tue, 30 Jun 2020 at 15:10, Christopher Barker <pythonchb@gmail.com> wrote:
On Tue, Jun 30, 2020 at 5:35 AM Steven D'Aprano <steve@pearwood.info> wrote:
On Mon, Jun 29, 2020 at 11:02:54PM -0700, Christopher Barker wrote:
Or is it a set because it can’t be indexed? If I have the history right, dict.items was first implemented with the “old” dict implementation, which did not preserve order, but did provide O(1) access, so making the dict views set-like was easy, and making them Sequences was impossible.
No, making them set-like was intentional.
The history goes something like this:
In Python 1, and early Python 2, dict.keys(), .values() and .items() each returned lists. Some time by 2.4 (possibly earlier) dicts grew additional methods returning iterators:
dict.iterkeys() .itervalues() .iteritems()
As part of the migration from Python 2 to 3, it was planned to change them to set-like views:
https://www.python.org/dev/peps/pep-3106/
so as an aid to migration, these set-like views were added to 2.7:
dict.viewkeys() .viewvalues() .viewitems()
and finally in 3.x, the list and iterator versions were all dropped and the views were renamed.
So there was already a sequence version of items etc, and the set-like views were explicitly intended to be set-like.
Thanks for the details, but I just re-read the PEP, and I'm not sure my hand wavy version is wrong:
there was already a sequence version of items etc
Well, yes, but they explicitly created list copies -- there was no sequence-like version that was a view.
The first improvement on that was the iter* methods, and then (inspired by JAVA) it was determined that we could have view objects that could provide efficient iterators, and also set-like behavior. But I don't see any comments in the PEP about why indexing was rejected, but it's implicit in the fact that dicts were explicitly unordered at the time that PEP was written.
Now that dicts do preserve order, and there is an implementation that can provide somewhat efficient indexing ( not O(1), but yes O(N) with a small constant) there is the opportunity to add a new feature.
It does bring up a question: if the view objects support indexing, it would be a bit odd for two dict view objects to be equal if they have the same elements, but not the same order, but that would be required to remain consistent with the way dicts currently are (and will remain) One way to think about is that adding indexing to the views would NOT make them Sequences, it would simply provide that one feature.
So how useful is that feature? I recall a couple (actually, essentailly one) which is selecting an arbitrary or random item from a dict, i.e. being able to pass a dict view to random.choice()
As it happens. I've had that use case, but if that's the only one folks can think of, then maybe not particularly worthwhile.
Final note about performance: yes, we don't want to encourage folks to use not-great-performance approaches, but the alternative to the O(N) with small constant performance being proposed here is to make a Sequence out of the views, and then index into them, which is O(N) with a larger constant and more memory use.
-CHB
-- Steven _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/XT4T7O... Code of Conduct: http://python.org/psf/codeofconduct/
-- Christopher Barker, PhD
Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/KHXGUV... Code of Conduct: http://python.org/psf/codeofconduct/
Hi A quick collection of responses: But we're not talking about *dict*, we're talking about dict.items which
returns a set-like object:
py> from collections.abc import Set py> isinstance({}.items(), Set) True
This property is nice, as .keys() (for example) can operate like a set and give lots of convenience features. However this doesn't really mean that these objects are trying to *be* sets in any meaning way, dictionaries are now ordered, this is a property of dictionaries, whereas sets are explicitly unordered. Plus, the following example shows how quickly the set emulation falls-down: py> set({'a': [1]}.items()) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: unhashable type: 'list'
for i in range(len(d)): do(d.items()[i]) # if dict_items supports index access.
sample 1 is O(n) where n = len(d), but sample 2 is O(n^2).
By not supporting index access, dict_items prevents to write such inefficient code.
I don't think that this is a compelling or strong argument. We can write inefficient code because dicts do *not* support index access:
for i in range(len(d)): for j,item in enumerate(d.items()): if j == i: do(item)
I guess that adding index read access to dicts could be done, perhaps with a method:
d.getitem(3) # Returns (key, value)
and I don't think that would be expensive. My recollection is that internally dicts have, or had, an internal array of keys. So it could be a constant time operation to look up the key by index.
I don't think inventing new methods for emulating __getitem__ is that useful here, there is a well established standard for getting items from a container by index, I think we should use it :). Here's my understanding of the performance implications of allowing __getitem__ on dictview objects: There are currently two variant layouts of PyDictObjects, 'traditional' dicts, and split dicts. Split dicts are stored as (two) compact arrays with an added hashtable, so O(1) lookup is possible. However, split dicts are *not* really relevant here, because typically dicts in use aren't split() (for example mutating a split dict) returns a traditional' dict. Traditional dicts are stored as an array based on insertion order (like split dicts), *but* deletions result in the entry being NULL'd out in-place. Therefore index access is O(n) because you have to walk the items in the DK_ENTRIES list counting non-NULL keys. This should be really fast because it's just walking a list of pointers, no dereferences or anything, however it isn't O(1) which does make me slightly hesitant about the proposal.
This is easy enough to solve with an augmented dict. A sketch of a
Implementing this functionality is trivial, but adding __getitem__ avoids these paper-cut like surprising examples, and fixes code that should "just work", for example `random.choice`
But even if we could do these efficiently, who would use them? If you need this for a table, you probably need to support insertion and deletion as well, maybe slices, and you probably need both columns and rows, so I really don't think that dict is the right place for this.
I don't think so, you can see some examples I dug out earlier in the thread (when my posts get moderated), that support this use-case without needing mutation ability or any table structures. Steve --
Steven _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/44LD4S... Code of Conduct: http://python.org/psf/codeofconduct/
On Tue, Jun 30, 2020 at 10:18:34AM +0100, Stestagg wrote:
This property is nice, as .keys() (for example) can operate like a set and give lots of convenience features. However this doesn't really mean that these objects are trying to *be* sets in any meaning way,
Please read the PEP: https://www.python.org/dev/peps/pep-3106/ It was an explicit design choice to make them be set-like, not an accident. The whole point was to offer a set interface.
Plus, the following example shows how quickly the set emulation falls-down:
py> set({'a': [1]}.items()) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: unhashable type: 'list'
They aren't small-s sets, the concrete set class. But they are large-s Sets, the ABC. They provide the same set interface without needing elements to be hashable. An analogy for you to think about: both floats and complex numbers are numbers. But: py> from numbers import Number py> isinstance(1.5, Number) and isinstance(1.5j, Number) True py> float(1.5j) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: can't convert complex to float I trust you aren't going to argue that one or the other of floats or complex numbers aren't numbers.
I don't think inventing new methods for emulating __getitem__ is that useful here, there is a well established standard for getting items from a container by index, I think we should use it :).
We can't use dict[index] because the index can also be a key: d = {'a': None, 'b': None, 0: 999} d[0] # return 'a' or 999? So one way or another we have to invent a new standard for getting items by index position. Dict views are not containers, they are *views*, so it makes just as much sense (maybe more) to add a method on the dict itself to return the nth key as to pretend that set-like views are sequences.
Here's my understanding of the performance implications of allowing __getitem__ on dictview objects: [...] However, split dicts are *not* really relevant here, because typically dicts in use aren't split() (for example mutating a split dict) returns a traditional' dict.
Doesn't matter whether they are "typically" used or not, the same API must work regardless of whether the dict is a split dict or traditional dict. -- Steven
On Wed, Jul 1, 2020 at 12:18 PM Steven D'Aprano <steve@pearwood.info> wrote:
On Tue, Jun 30, 2020 at 10:18:34AM +0100, Stestagg wrote:
This property is nice, as .keys() (for example) can operate like a set and give lots of convenience features. However this doesn't really mean that these objects are trying to *be* sets in any meaning way,
Please read the PEP:
https://www.python.org/dev/peps/pep-3106/
It was an explicit design choice to make them be set-like, not an accident. The whole point was to offer a set interface.
I'm pretty familiar with that pep, I'm not sure how knowledge of its contents affects this idea/discussion in any material way.
Plus, the following example shows how quickly the set emulation falls-down:
py> set({'a': [1]}.items()) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: unhashable type: 'list'
They aren't small-s sets, the concrete set class. But they are large-s Sets, the ABC. They provide the same set interface without needing elements to be hashable.
Right, as demonstrated in my example. The point being made, was that the utility of trying to shoehorn the dict_* types into a Set pidgeon-hole is quite limited, as practicalities get in the way.
I don't think inventing new methods for emulating __getitem__ is that useful here, there is a well established standard for getting items from a container by index, I think we should use it :).
We can't use dict[index] because the index can also be a key:
d = {'a': None, 'b': None, 0: 999} d[0] # return 'a' or 999?
The currently discussed option is for adding`__getitem__` to the dict_keys/dict_items/dict_values types, *NOT* the base `dict` type.
So one way or another we have to invent a new standard for getting items by index position. Dict views are not containers, they are *views*, so it makes just as much sense (maybe more) to add a method on the dict itself to return the nth key as to pretend that set-like views are sequences.
Here's my understanding of the performance implications of allowing __getitem__ on dictview objects: [...] However, split dicts are *not* really relevant here, because typically dicts in use aren't split() (for example mutating a split dict) returns a traditional' dict.
Doesn't matter whether they are "typically" used or not, the same API must work regardless of whether the dict is a split dict or traditional dict.
If you read the OP again you'll see that indexing the views of split dicts is trivially doable in O(1) time. My point about the "typical" usage, is that having a variant that's O(1) is not material to the performance concerns, because in most cases, an O(n) version will have to be used. The point wasn't that: "It's atypical so we can ignore it", but rather: "It's faster/better, but seldom available to be used, here's how we do it for the more common case..."
-- Steven _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/ZLAJXS... Code of Conduct: http://python.org/psf/codeofconduct/
On Wed, Jul 01, 2020 at 01:36:34PM +0100, Stestagg wrote:
On Wed, Jul 1, 2020 at 12:18 PM Steven D'Aprano <steve@pearwood.info> wrote:
On Tue, Jun 30, 2020 at 10:18:34AM +0100, Stestagg wrote:
This property is nice, as .keys() (for example) can operate like a set and give lots of convenience features. However this doesn't really mean that these objects are trying to *be* sets in any meaning way,
Please read the PEP:
https://www.python.org/dev/peps/pep-3106/
It was an explicit design choice to make them be set-like, not an accident. The whole point was to offer a set interface.
I'm pretty familiar with that pep, I'm not sure how knowledge of its contents affects this idea/discussion in any material way.
You: I don't think that dict views are trying to be sets. Me: Dict views were designed right from the start to be like sets, as described in the PEP. You: I don't think the PEP is relevant. Okay, whatever you say.
The point being made, was that the utility of trying to shoehorn the dict_* types into a Set pidgeon-hole is quite limited, as practicalities get in the way.
*shrug* They seem to work pretty well for many uses. Perhaps they could be better, but they are what they are, and there's no point pretending that they aren't large-s Sets when they clearly are. py> from collections.abc import Set py> isinstance({}.items(), Set) True This is not an accident, it is not a mistake, it wasn't forced on the language because dicts used to be unordered. It was an intentional design choice. If you think that *dict views are set-like* being an intentional choice is not important to the discussion, what do you consider important? As far as I am concerned, making dict views into sequences by making them indexable should be ruled out. They were intended to be set-like, not sequence-like, and that is the end of the story. Adding indexing onto them is bolting on an API that was never part of the design. I still don't think there are any compelling use-cases for indexing dicts, but if there are, we should offer a getter method on the dict object itself, not try to turn dict views into a hybrid set+sequence. Change my mind.
I don't think inventing new methods for emulating __getitem__ is that useful here, there is a well established standard for getting items from a container by index, I think we should use it :).
We can't use dict[index] because the index can also be a key:
d = {'a': None, 'b': None, 0: 999} d[0] # return 'a' or 999?
The currently discussed option is for adding`__getitem__` to the dict_keys/dict_items/dict_values types, *NOT* the base `dict` type.
You: We shouldn't invent a new way of indexing containers. Also you: Let's invent a new way of indexing containers (subscripting on dict views). So which is it? Subscripting on the dict object is ruled out because it is ambiguous. Here are three possible choices (there may be others): - Do nothing, we don't need this functionality, there are no compelling use-cases for it. Passing dict keys to random.choice is not a compelling use-case. - Add a new interface by making dict views a hybrid set+sequence. - Add a new getter method on the dict itself. Aside from the first do-nothing option, one way or the other we're adding a new way of indexing the object we actually want to index, namely the dict. The choice is between a getter interface, and a subscript interface on a proxy (a dict view). Dicts already offer getter interfaces (dict.get, dict.setdefault) so it is not like calling dict methods is unheard of. [...]
The point wasn't that: "It's atypical so we can ignore it", but rather: "It's faster/better, but seldom available to be used, here's how we do it for the more common case..."
Okay, thanks for the correction. -- Steven
I'll try to unwind the rabbit holes a bit and suggest that the differences in opinion here boil down to: Is it most useful to consider dict_keys/items/values as: (This doesn't mean what isinstance() returns, or what a previous pep has stated, just how one thinks about them): 1. ordered collections that have some set-like operators added for convenience, (conceptually, if not exactly, class ...(Set, Sequence):) but are missing a single method full Sequence interface (__getitem__) or 2. Sets that happen to have a stable/deterministic element order My opinion is that, as of Python 3.7, they are effectively 1, even if the isinstance hooks haven't been updated. I can see why people may think of them as 2. I don't have any desire to change minds on this :) Semantic quibbling aside, My opinions/reasoning on the different options are the following: * Do nothing: -0.5: I can live with the current situation, some things I do infrequently will be slightly less convenient, but can work around easily. * Add numeric `__getitem__` to `dict_*` view classes: +1 - This restores code that used to work in python 2, and makese some things a bit easier. The O(n) behaviour is not ideal, but in my opinion is an acceptable compromise * Add new method to dict class: -1 This doesn't feel like a good solution to me. Rather than continue to argue about the justification for why, let's just say its a personal judgement I'm interested in any other opinions on these options Steve On Wed, Jul 1, 2020 at 3:24 PM Steven D'Aprano <steve@pearwood.info> wrote:
On Wed, Jul 01, 2020 at 01:36:34PM +0100, Stestagg wrote:
On Wed, Jul 1, 2020 at 12:18 PM Steven D'Aprano <steve@pearwood.info> wrote:
On Tue, Jun 30, 2020 at 10:18:34AM +0100, Stestagg wrote:
This property is nice, as .keys() (for example) can operate like a set and give lots of convenience features. However this doesn't really mean that these objects are trying to *be* sets in any meaning way,
Please read the PEP:
https://www.python.org/dev/peps/pep-3106/
It was an explicit design choice to make them be set-like, not an accident. The whole point was to offer a set interface.
I'm pretty familiar with that pep, I'm not sure how knowledge of its contents affects this idea/discussion in any material way.
You: I don't think that dict views are trying to be sets.
Me: Dict views were designed right from the start to be like sets, as described in the PEP.
You: I don't think the PEP is relevant.
Okay, whatever you say.
The point being made, was that the utility of trying to shoehorn the dict_* types into a Set pidgeon-hole is quite limited, as practicalities get in the way.
*shrug*
They seem to work pretty well for many uses. Perhaps they could be better, but they are what they are, and there's no point pretending that they aren't large-s Sets when they clearly are.
py> from collections.abc import Set py> isinstance({}.items(), Set) True
This is not an accident, it is not a mistake, it wasn't forced on the language because dicts used to be unordered. It was an intentional design choice.
If you think that *dict views are set-like* being an intentional choice is not important to the discussion, what do you consider important?
As far as I am concerned, making dict views into sequences by making them indexable should be ruled out. They were intended to be set-like, not sequence-like, and that is the end of the story. Adding indexing onto them is bolting on an API that was never part of the design.
I still don't think there are any compelling use-cases for indexing dicts, but if there are, we should offer a getter method on the dict object itself, not try to turn dict views into a hybrid set+sequence. Change my mind.
I don't think inventing new methods for emulating __getitem__ is that useful here, there is a well established standard for getting items from a container by index, I think we should use it :).
We can't use dict[index] because the index can also be a key:
d = {'a': None, 'b': None, 0: 999} d[0] # return 'a' or 999?
The currently discussed option is for adding`__getitem__` to the dict_keys/dict_items/dict_values types, *NOT* the base `dict` type.
You: We shouldn't invent a new way of indexing containers.
Also you: Let's invent a new way of indexing containers (subscripting on dict views).
So which is it?
Subscripting on the dict object is ruled out because it is ambiguous. Here are three possible choices (there may be others):
- Do nothing, we don't need this functionality, there are no compelling use-cases for it. Passing dict keys to random.choice is not a compelling use-case.
- Add a new interface by making dict views a hybrid set+sequence.
- Add a new getter method on the dict itself.
Aside from the first do-nothing option, one way or the other we're adding a new way of indexing the object we actually want to index, namely the dict.
The choice is between a getter interface, and a subscript interface on a proxy (a dict view). Dicts already offer getter interfaces (dict.get, dict.setdefault) so it is not like calling dict methods is unheard of.
[...]
The point wasn't that: "It's atypical so we can ignore it", but rather: "It's faster/better, but seldom available to be used, here's how we do it for the more common case..."
Okay, thanks for the correction.
-- Steven _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/A34NF6... Code of Conduct: http://python.org/psf/codeofconduct/
For what it's worth, I've attached a patch that implements this as a prototype. It's definitely not mergeable right now. I'm not aware of any issues, but the code needs more error checking, and the coding style doesn't match cpython's. Having said that, it's a 70-line change, so fixing that up should be trivial once the change has some agreement behind it. Steve On Wed, Jul 1, 2020 at 4:31 PM Stestagg <stestagg@gmail.com> wrote:
I'll try to unwind the rabbit holes a bit and suggest that the differences in opinion here boil down to:
Is it most useful to consider dict_keys/items/values as: (This doesn't mean what isinstance() returns, or what a previous pep has stated, just how one thinks about them): 1. ordered collections that have some set-like operators added for convenience, (conceptually, if not exactly, class ...(Set, Sequence):) but are missing a single method full Sequence interface (__getitem__) or 2. Sets that happen to have a stable/deterministic element order
My opinion is that, as of Python 3.7, they are effectively 1, even if the isinstance hooks haven't been updated. I can see why people may think of them as 2. I don't have any desire to change minds on this :)
Semantic quibbling aside, My opinions/reasoning on the different options are the following:
* Do nothing: -0.5: I can live with the current situation, some things I do infrequently will be slightly less convenient, but can work around easily.
* Add numeric `__getitem__` to `dict_*` view classes: +1 - This restores code that used to work in python 2, and makese some things a bit easier. The O(n) behaviour is not ideal, but in my opinion is an acceptable compromise
* Add new method to dict class: -1 This doesn't feel like a good solution to me. Rather than continue to argue about the justification for why, let's just say its a personal judgement
I'm interested in any other opinions on these options
Steve
On Wed, Jul 1, 2020 at 3:24 PM Steven D'Aprano <steve@pearwood.info> wrote:
On Wed, Jul 01, 2020 at 01:36:34PM +0100, Stestagg wrote:
On Wed, Jul 1, 2020 at 12:18 PM Steven D'Aprano <steve@pearwood.info> wrote:
On Tue, Jun 30, 2020 at 10:18:34AM +0100, Stestagg wrote:
This property is nice, as .keys() (for example) can operate like a set and give lots of convenience features. However this doesn't really mean that these objects are trying to *be* sets in any meaning way,
Please read the PEP:
https://www.python.org/dev/peps/pep-3106/
It was an explicit design choice to make them be set-like, not an accident. The whole point was to offer a set interface.
I'm pretty familiar with that pep, I'm not sure how knowledge of its contents affects this idea/discussion in any material way.
You: I don't think that dict views are trying to be sets.
Me: Dict views were designed right from the start to be like sets, as described in the PEP.
You: I don't think the PEP is relevant.
Okay, whatever you say.
The point being made, was that the utility of trying to shoehorn the dict_* types into a Set pidgeon-hole is quite limited, as practicalities get in the way.
*shrug*
They seem to work pretty well for many uses. Perhaps they could be better, but they are what they are, and there's no point pretending that they aren't large-s Sets when they clearly are.
py> from collections.abc import Set py> isinstance({}.items(), Set) True
This is not an accident, it is not a mistake, it wasn't forced on the language because dicts used to be unordered. It was an intentional design choice.
If you think that *dict views are set-like* being an intentional choice is not important to the discussion, what do you consider important?
As far as I am concerned, making dict views into sequences by making them indexable should be ruled out. They were intended to be set-like, not sequence-like, and that is the end of the story. Adding indexing onto them is bolting on an API that was never part of the design.
I still don't think there are any compelling use-cases for indexing dicts, but if there are, we should offer a getter method on the dict object itself, not try to turn dict views into a hybrid set+sequence. Change my mind.
I don't think inventing new methods for emulating __getitem__ is that useful here, there is a well established standard for getting items from a container by index, I think we should use it :).
We can't use dict[index] because the index can also be a key:
d = {'a': None, 'b': None, 0: 999} d[0] # return 'a' or 999?
The currently discussed option is for adding`__getitem__` to the dict_keys/dict_items/dict_values types, *NOT* the base `dict` type.
You: We shouldn't invent a new way of indexing containers.
Also you: Let's invent a new way of indexing containers (subscripting on dict views).
So which is it?
Subscripting on the dict object is ruled out because it is ambiguous. Here are three possible choices (there may be others):
- Do nothing, we don't need this functionality, there are no compelling use-cases for it. Passing dict keys to random.choice is not a compelling use-case.
- Add a new interface by making dict views a hybrid set+sequence.
- Add a new getter method on the dict itself.
Aside from the first do-nothing option, one way or the other we're adding a new way of indexing the object we actually want to index, namely the dict.
The choice is between a getter interface, and a subscript interface on a proxy (a dict view). Dicts already offer getter interfaces (dict.get, dict.setdefault) so it is not like calling dict methods is unheard of.
[...]
The point wasn't that: "It's atypical so we can ignore it", but rather: "It's faster/better, but seldom available to be used, here's how we do it for the more common case..."
Okay, thanks for the correction.
-- Steven _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/A34NF6... Code of Conduct: http://python.org/psf/codeofconduct/
This new patch works for keys(), .items(), and .values() (as intended by the original change) On Wed, Jul 1, 2020 at 5:27 PM Stestagg <stestagg@gmail.com> wrote:
For what it's worth, I've attached a patch that implements this as a prototype.
It's definitely not mergeable right now. I'm not aware of any issues, but the code needs more error checking, and the coding style doesn't match cpython's.
Having said that, it's a 70-line change, so fixing that up should be trivial once the change has some agreement behind it.
Steve
On Wed, Jul 1, 2020 at 4:31 PM Stestagg <stestagg@gmail.com> wrote:
I'll try to unwind the rabbit holes a bit and suggest that the differences in opinion here boil down to:
Is it most useful to consider dict_keys/items/values as: (This doesn't mean what isinstance() returns, or what a previous pep has stated, just how one thinks about them): 1. ordered collections that have some set-like operators added for convenience, (conceptually, if not exactly, class ...(Set, Sequence):) but are missing a single method full Sequence interface (__getitem__) or 2. Sets that happen to have a stable/deterministic element order
My opinion is that, as of Python 3.7, they are effectively 1, even if the isinstance hooks haven't been updated. I can see why people may think of them as 2. I don't have any desire to change minds on this :)
Semantic quibbling aside, My opinions/reasoning on the different options are the following:
* Do nothing: -0.5: I can live with the current situation, some things I do infrequently will be slightly less convenient, but can work around easily.
* Add numeric `__getitem__` to `dict_*` view classes: +1 - This restores code that used to work in python 2, and makese some things a bit easier. The O(n) behaviour is not ideal, but in my opinion is an acceptable compromise
* Add new method to dict class: -1 This doesn't feel like a good solution to me. Rather than continue to argue about the justification for why, let's just say its a personal judgement
I'm interested in any other opinions on these options
Steve
On Wed, Jul 1, 2020 at 3:24 PM Steven D'Aprano <steve@pearwood.info> wrote:
On Wed, Jul 01, 2020 at 01:36:34PM +0100, Stestagg wrote:
On Wed, Jul 1, 2020 at 12:18 PM Steven D'Aprano <steve@pearwood.info> wrote:
On Tue, Jun 30, 2020 at 10:18:34AM +0100, Stestagg wrote:
This property is nice, as .keys() (for example) can operate like a set and give lots of convenience features. However this doesn't really mean that these objects are trying to *be* sets in any meaning way,
Please read the PEP:
https://www.python.org/dev/peps/pep-3106/
It was an explicit design choice to make them be set-like, not an accident. The whole point was to offer a set interface.
I'm pretty familiar with that pep, I'm not sure how knowledge of its contents affects this idea/discussion in any material way.
You: I don't think that dict views are trying to be sets.
Me: Dict views were designed right from the start to be like sets, as described in the PEP.
You: I don't think the PEP is relevant.
Okay, whatever you say.
The point being made, was that the utility of trying to shoehorn the dict_* types into a Set pidgeon-hole is quite limited, as practicalities get in the way.
*shrug*
They seem to work pretty well for many uses. Perhaps they could be better, but they are what they are, and there's no point pretending that they aren't large-s Sets when they clearly are.
py> from collections.abc import Set py> isinstance({}.items(), Set) True
This is not an accident, it is not a mistake, it wasn't forced on the language because dicts used to be unordered. It was an intentional design choice.
If you think that *dict views are set-like* being an intentional choice is not important to the discussion, what do you consider important?
As far as I am concerned, making dict views into sequences by making them indexable should be ruled out. They were intended to be set-like, not sequence-like, and that is the end of the story. Adding indexing onto them is bolting on an API that was never part of the design.
I still don't think there are any compelling use-cases for indexing dicts, but if there are, we should offer a getter method on the dict object itself, not try to turn dict views into a hybrid set+sequence. Change my mind.
I don't think inventing new methods for emulating __getitem__ is that useful here, there is a well established standard for getting items from a container by index, I think we should use it :).
We can't use dict[index] because the index can also be a key:
d = {'a': None, 'b': None, 0: 999} d[0] # return 'a' or 999?
The currently discussed option is for adding`__getitem__` to the dict_keys/dict_items/dict_values types, *NOT* the base `dict` type.
You: We shouldn't invent a new way of indexing containers.
Also you: Let's invent a new way of indexing containers (subscripting on dict views).
So which is it?
Subscripting on the dict object is ruled out because it is ambiguous. Here are three possible choices (there may be others):
- Do nothing, we don't need this functionality, there are no compelling use-cases for it. Passing dict keys to random.choice is not a compelling use-case.
- Add a new interface by making dict views a hybrid set+sequence.
- Add a new getter method on the dict itself.
Aside from the first do-nothing option, one way or the other we're adding a new way of indexing the object we actually want to index, namely the dict.
The choice is between a getter interface, and a subscript interface on a proxy (a dict view). Dicts already offer getter interfaces (dict.get, dict.setdefault) so it is not like calling dict methods is unheard of.
[...]
The point wasn't that: "It's atypical so we can ignore it", but rather: "It's faster/better, but seldom available to be used, here's how we do it for the more common case..."
Okay, thanks for the correction.
-- Steven _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/A34NF6... Code of Conduct: http://python.org/psf/codeofconduct/
Stestagg writes:
Having said that, it's a 70-line change, so fixing that up should be trivial once the change has some agreement behind it.
I still don't see a plausible use case that isn't well served by list(view). The OP's "addressing database columns" is superficially plausible, but the indexing mapping is *intended* to be unstable (inserting columns at positions). So the only operation where you can actually use an index would be something like i = db.columns.index("Name") db.columns.insert(i + 1, Address, address_data) because now all indicies > i have been invalidated. But I can't see how having a dict to map names to indicies and simply rebuilding the dict on every change to the list of names is going to be a problem for this application. And of course you'd need to add index and insert methods to support that: things are getting complicated. You say you don't want to copy data out of giant structures. That's a good idea, and of course very much a motivation for the view concept itself. But since views are immutable unless you change the underlying dict, many applications (sorting and creating heaps come to mind) are going to require copying the data. Many applications that don't will require a way to pick out a particular index, which you haven't provided. Yet others process the whole data, which can be iterated (and in particular functools.reduce works as expected). If you don't like the iteration order (eg, because you want an accurate sum() for floats), you need to copy the data (see "sorting and creating heaps"). So what are these use cases? Granted, my imagination may be insufficient, but so far, so are those of the proponents, who all write in terms of "convenience" and other real but very abstract values. On the downside, dicts may be the single most carefully optimized data type in Python, and that effort has clearly been repaid. But it also required a large change in the layout of the data structure. It seems unlikely that that would happen again, but genius is pretty unpredictable. Do we want to further constrain the implementation by requiring support for what seem to be marginally-useful operations? Regards, Steve
On Wed, Jul 1, 2020 at 8:37 AM Stestagg <stestagg@gmail.com> wrote:
1. ordered collections that have some set-like operators added for convenience, (conceptually, if not exactly, class ...(Set, Sequence):) but are missing a single method full Sequence interface (__getitem__) or 2. Sets that happen to have a stable/deterministic element order
My opinion is that, as of Python 3.7, they are effectively 1, even if the isinstance hooks haven't been updated. I can see why people may think of them as 2. I don't have any desire to change minds on this :)
Semantic quibbling aside, My opinions/reasoning on the different options are the following:
as long as we are semantic quibbling (1) is not quite right -- they don't "have some set-like operators, they have the full set of them, and they they ARE Sets: In [44]: d = {"this": 4, "that": 23} In [45]: isinstance(d.keys(), Set) Out[45]: True But see another recent thread on this list - they don't have all the same methods as the builtin set() type. But the ABC doesn't specify those, so they don't need to. However, that doesn't mean that it's somehow critical for them not to grow some extra functionality -- they would still be Sets. * Add numeric `__getitem__` to `dict_*` view classes:
+1 - This restores code that used to work in python 2, and makese some things a bit easier. The O(n) behaviour is not ideal, but in my opinion is an acceptable compromise
well, not really -- it only lets some things work like they did in Py2: most code I've ported from py2 to py3 that I needed to wrap list() around the views needed more than indexing -- I usually needed the list itself to be a persistent, mutable Sequence, not a view. I'm +1 on on this, ;cause why not? but honestly , the only good use case I've seen is random.choice() -- and while I've needed that, I can't say I've needed it often.
* Add new method to dict class: -1 This doesn't feel like a good solution to me. Rather than continue to argue about the justification for why, let's just say its a personal judgement
wqe neither, gut it's not just a personal taste -- a getter wouldn't work with random.choice, the only known use case :-) -- unless the new method returned a Sequence-like view that could be indexed. Granted, it would allow an easier way to get a random item, but not as slick as: random.choice(a_dict.items()) I think it was Oscar that pointed out that repeated indexing of a view would be worse performance wise than making a proper list -- so that *might* be considered an attractive nuisance. -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On 30/06/2020 04:08, Steven D'Aprano wrote:
We can write inefficient code because dicts do *not* support index access:
for i in range(len(d)): for j,item in enumerate(d.items()): if j == i: do(item)
We certainly can. The above is no different (apart from binding `i` and `j`) from for item in d.items(): do(item)
On 30.06.20 05:08, Steven D'Aprano wrote:
On Tue, Jun 30, 2020 at 11:10:20AM +0900, Inada Naoki wrote:
On Mon, Jun 29, 2020 at 9:12 PM Hans Ginzel <hans@matfyz.cz> wrote:
What are the reasons, why object dict.items() is not subscriptable – dict.items()[0]? Because dict is optimized for random access by key and iteration, but not for random access by index. But we're not talking about *dict*, we're talking about dict.items which returns a set-like object:
py> from collections.abc import Set py> isinstance({}.items(), Set) True
So dict.items isn't subscriptable because it's an unordered set, not a sequence.
What is the reason for `dict.items` to return a set-like object? The values can be non-hashable an in this case the behavior can be surprising: >>> {'a': []}.items() & {'b'} TypeError: unhashable type: 'list' Furthermore [the documentation](https://docs.python.org/3/library/stdtypes.html#dict-views) states the following:
If all values are hashable, so that |(key, value)| pairs are unique and hashable, then the items view is also set-like.
This sounds like the return type of `dict.items` depended on the actual values contained (which again would be surprising) but actually it doesn't seem to be the case: >>> from collections.abc import Set >>> isinstance({'a': []}.items(), Set) True `dict.items` could provide all of its "standard" behavior (like membership testing, reversing, etc) without being set-like. The fact that `==` with `dict.items` raises TypeError for non-hashable values is also a little surprising since it supports membership testing and hence could check `len(self) == len(other) and all(x in self for x in other)` (though that drops the type comparison, but if you cannot have a set, why would you compare them anyway): >>> self = {'a': []}.items() >>> other = {('a', 1)} >>> self == other TypeError: unhashable type: 'list' >>> len(self) == len(other) and all(x in self for x in other) False >>> other = [('a', [])] >>> len(self) == len(other) and all(x in self for x in other) True
On Wed, Jul 01, 2020 at 01:07:34PM +0200, Dominik Vilsmeier wrote:
What is the reason for `dict.items` to return a set-like object?
This is the third time I've linked to the PEP: https://www.python.org/dev/peps/pep-3106/ More than that, you would have to ask Guido, or the designers of the Java framework that inspired it.
>>> from collections.abc import Set >>> isinstance({'a': []}.items(), Set) True
`dict.items` could provide all of its "standard" behavior (like membership testing, reversing, etc) without being set-like.
Its standard behaviour includes set-like unions and intersections, and has done for over a decade, back to 3.0 and 2.7. -- Steven
On 01.07.20 13:32, Steven D'Aprano wrote:
On Wed, Jul 01, 2020 at 01:07:34PM +0200, Dominik Vilsmeier wrote:
What is the reason for `dict.items` to return a set-like object? This is the third time I've linked to the PEP:
Thanks for linking to the PEP (to be fair, the second link arrived after my post). However reading that PEP only increases the number of question marks. The PEP speaks mostly about `dict.keys` and occasionally mentions that similar things can be done for `dict.items`. For example:
Because of the set behavior, it will be possible to check whether two dicts have the same keys by simply testing: if a.keys() == b.keys(): ...
which makes perfectly sense. But then, immediately after that, it mentions:
and similarly for .items().
So this means we can do `a.items() == b.items()`. But wait, if the dicts' items are equal, aren't just the dicts themselves equal? So why not just compare `a == b`? Why would I want to compare `a.items() == b.items()`? Then, regarding the equality tests (`==`), the Specification section mentions the following for `dict_keys`:
To specify the semantics, we can specify x == y as: set(x) == set(y) if both x and y are d_keys instances set(x) == y if x is a d_keys instance x == set(y) if y is a d_keys instance
This makes sense again. And then for `dict_items`: # As well as the set operations mentioned for d_keys above. # However the specifications suggested there will not work if # the values aren't hashable. Fortunately, the operations can # still be implemented efficiently. For example, this is how # intersection can be specified: # [...] # And here is equality: def __eq__(self, other): if isinstance(other, (set, frozenset, d_keys)): if len(self) != len(other): return False for item in other: if item not in self: return False return True # [...] handling other cases This is what I expected how `dict_items` would handle equality tests. However it doesn't seem to do that: >>> self = {'a': []}.items() >>> other = {'b'} >>> self == other TypeError: unhashable type: 'list' >>> def __eq__(self, other): ... # paste the function here >>> d_keys = type({}.keys()) >>> __eq__(self, other) False >>> self == self True I don't know if it was always implemented like that but at least the PEP mentions this caveat and I find it surprising as well.
On Wed, Jul 01, 2020 at 06:07:40PM +0200, Dominik Vilsmeier wrote:
Because of the set behavior, it will be possible to check whether two dicts have the same keys by simply testing: if a.keys() == b.keys(): ...
which makes perfectly sense. But then, immediately after that, it mentions:
and similarly for .items().
So this means we can do `a.items() == b.items()`. But wait, if the dicts' items are equal, aren't just the dicts themselves equal? So why not just compare `a == b`? Why would I want to compare `a.items() == b.items()`?
Because you might not have access to `a` and `b`, but only the views. Views are first-class objects. You can bind them to variables, pass them to functions, put them in lists etc. So if you ever get passed two dict views for some reason, equality works. Regarding your observation that dict views behave poorly if they have unhashable values, I agree, it is both odd and makes them less useful. Possibly at some point between the PEP and the release of the feature something changed, or perhaps it's just an oversight. -- Steven
Steven D'Aprano writes:
Regarding your observation that dict views behave poorly if they have unhashable values, I agree, it is both odd and makes them less useful. Possibly at some point between the PEP and the release of the feature something changed, or perhaps it's just an oversight.
I'm not sure what you expect from views, though: Python 3.8.3 (default, May 15 2020, 14:39:37)
set([[1]]) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: unhashable type: 'list' {'a' : [1]}.keys() <= {'a' : [1], 'b' : 2}.keys() True {'a' : [1]}.values() <= {'a' : [1], 'b' : 2}.values() Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: '<=' not supported between instances of 'dict_values' and 'dict_values' {'a' : [1]}.items() <= {'a' : [1], 'b' : 2}.items() True
So all of the above are consistent with the behavior of sets, except that items views do some part of comparisons themselves to deal with non-hashables which is an extension to set behavior. And values views don't pretend to be sets. The ValuesView ABC is not derived from Set, presumably because dict.values returns something like a multiset. Most set operations on key and item views seem to convert to set and where appropriate return set (which makes sense, since returning a view would require synthesizing a dict to be the view of!) This means you can't do set operations (except comparisons) on items views if any values aren't hashable. I'm not sure what I think about this:
{'a' : 1, 'b' : 2}.values() == {'b' : 2, 'a' : 1}.values() False
That does seem less than useful. But I guess a multiset comparison requires an auxiliary data structure that can be sorted or a complicated, possibly O(n^2), comparison in place.
On 05.07.20 16:56, Stephen J. Turnbull wrote:
Steven D'Aprano writes:
Regarding your observation that dict views behave poorly if they have unhashable values, I agree, it is both odd and makes them less useful. Possibly at some point between the PEP and the release of the feature something changed, or perhaps it's just an oversight.
I'm not sure what you expect from views, though:
Python 3.8.3 (default, May 15 2020, 14:39:37)
set([[1]]) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: unhashable type: 'list' {'a' : [1]}.keys() <= {'a' : [1], 'b' : 2}.keys() True {'a' : [1]}.values() <= {'a' : [1], 'b' : 2}.values() Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: '<=' not supported between instances of 'dict_values' and 'dict_values' {'a' : [1]}.items() <= {'a' : [1], 'b' : 2}.items() True
So all of the above are consistent with the behavior of sets, except that items views do some part of comparisons themselves to deal with non-hashables which is an extension to set behavior. And values views don't pretend to be sets. The ValuesView ABC is not derived from Set, presumably because dict.values returns something like a multiset.
Most set operations on key and item views seem to convert to set and where appropriate return set (which makes sense, since returning a view would require synthesizing a dict to be the view of!) This means you can't do set operations (except comparisons) on items views if any values aren't hashable.
Well, the point is that this "except comparisons" is not quite true: >>> i = {'a': []}.items() >>> s = {('a', 1)} >>> i == s TypeError: unhashable type: 'list' If passed a set as `other` operand, dict_items seems to decide to convert itself to a set, for no obvious reasons since, as you mentioned, it does know how to compare itself to another view containing non-hashable values: >>> i == {'a': {}}.items() False So if you're dealing with items views and want to compare them to a set representing dict items, then you need an extra `try/except` in order to handle non-hashable values in the items view. Not only does this require an extra precautionary step, it also seems strange given that in Python you can compare all sorts of objects without exceptions being raised. I can't think of any another built-in type that would raise an exception on equality `==` comparison. dict_items seems to make an exception to that rule.
I'm not sure what I think about this:
{'a' : 1, 'b' : 2}.values() == {'b' : 2, 'a' : 1}.values() False
That does seem less than useful. But I guess a multiset comparison requires an auxiliary data structure that can be sorted or a complicated, possibly O(n^2), comparison in place.
dict_values seems to rely on object.__eq__ since they always compare unequal (except when it is the same object); this behavior is mentioned by the docs:
An equality comparison between one `dict.values()` view and another will always return `False`. This also applies when comparing `dict.values()` to itself.
Surely that must be a relic from pre-3.7 days where dicts were unordered and hence order-based comparison wouldn't be possible (though PEP 3106 describes an O(n*m) algorithm). However the current behavior is unfortunate because it might trick users into believing that this is a meaningful comparison between distinct objects (given that it works with `dict.keys` and `dict.items`) when it isn't. So why not make dict_values a Sequence, providing __getitem__ and additionally order-based __eq__ comparison?
On Tue, Jul 7, 2020 at 10:52 PM Dominik Vilsmeier <dominik.vilsmeier@gmx.de> wrote:
Surely that must be a relic from pre-3.7 days where dicts were unordered and hence order-based comparison wouldn't be possible (though PEP 3106 describes an O(n*m) algorithm). However the current behavior is unfortunate because it might trick users into believing that this is a meaningful comparison between distinct objects (given that it works with `dict.keys` and `dict.items`) when it isn't.
So why not make dict_values a Sequence, providing __getitem__ and additionally order-based __eq__ comparison?
It was rejected in this thread. https://mail.python.org/archives/list/python-dev@python.org/thread/R2MPDTTMJ... -- Inada Naoki <songofacandy@gmail.com>
On Tue, Jul 7, 2020 at 10:52 PM Dominik Vilsmeier <dominik.vilsmeier@gmx.de> wrote:
Surely that must be a relic from pre-3.7 days where dicts were unordered and hence order-based comparison wouldn't be possible (though PEP 3106 describes an O(n*m) algorithm). However the current behavior is unfortunate because it might trick users into believing that this is a meaningful comparison between distinct objects (given that it works with `dict.keys` and `dict.items`) when it isn't.
So why not make dict_values a Sequence, providing __getitem__ and additionally order-based __eq__ comparison? It was rejected in this thread. https://mail.python.org/archives/list/python-dev@python.org/thread/R2MPDTTMJ... All right, I see that having __eq__ for dict_values is not for debate. But what about the other idea, making dict_values a Sequence? It does
On 07.07.20 17:37, Inada Naoki wrote: provide some useful features, like getting the first value of a dict or `.count` values.
On Tue, Jul 7, 2020 at 6:56 AM Dominik Vilsmeier <dominik.vilsmeier@gmx.de> wrote:
Well, the point is that this "except comparisons" is not quite true:
>>> i = {'a': []}.items() >>> s = {('a', 1)} >>> i == s TypeError: unhashable type: 'list'
If passed a set as `other` operand, dict_items seems to decide to convert itself to a set, for no obvious reasons since, as you mentioned, it does know how to compare itself to another view containing non-hashable values:
>>> i == {'a': {}}.items() False
So if you're dealing with items views and want to compare them to a set representing dict items, then you need an extra `try/except` in order to handle non-hashable values in the items view. Not only does this require an extra precautionary step, it also seems strange given that in Python you can compare all sorts of objects without exceptions being raised. I can't think of any another built-in type that would raise an exception on equality `==` comparison. dict_items seems to make an exception to that rule.
I think this really is a bug (well, missing feature). It surely, *could* be implemented to work, but maybe not efficiently or easily, so may well not be worth it -- is the use case of comparing a dict_items with another dict_items really helpful? In fact, my first thought was that the way to do the comparison is to convert to a dict, rather than a set, and then do the compare. And then I realized that dict_items don't exist without a dict anyway, so you really should be comparing the "host" dicts anyway. Which leaves exactly no use cases for this operation. I suppose we could add that capability by referencing the host dict if two dict_items are compared -- but really, why? Though maybe a nicer message: "TypeError: These views cannot be compared: compare the dicts themselves"
An equality comparison between one `dict.values()` view and another
will always return `False`. This also applies when comparing `dict.values()` to itself.
Surely that must be a relic from pre-3.7 days where dicts were unordered and hence order-based comparison wouldn't be possible
well, maybe -- or it's not a relic, and rather a result of the fact that while dicts are now order-preserving, they are still order-not-important. Given that: In [90]: {'a' : 1, 'b' : 2} == {'b' : 2, 'a' : 1} Out[90]: True then: {'a' : 1, 'b' : 2}.values() == {'b' : 2, 'a' : 1}.values() should also be True. but that's not possible to do correctly, since it can not be known that all the same values are there, there is no way to know that they are attached to the same keys. And if it did an order-preserving compare, then values would compare false when the dict is not the same. So: we *could* either do an order preserving (by converting to a sequence) or not (by converting to a sequence and then sorting) comparison of the values -- but would that be at all useful? Do we ever need to ask the question: Do these two dicts have all the same values without regard to the keys? Now that I say that -- I suppose the answer is yes -- someone may want to ask that question -- but I think it would be better for them to ask it with their own code, so it's clear what it actually means. -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On 07.07.20 19:09, Christopher Barker wrote:
On Tue, Jul 7, 2020 at 6:56 AM Dominik Vilsmeier <dominik.vilsmeier@gmx.de <mailto:dominik.vilsmeier@gmx.de>> wrote:
Well, the point is that this "except comparisons" is not quite true:
>>> i = {'a': []}.items() >>> s = {('a', 1)} >>> i == s TypeError: unhashable type: 'list'
If passed a set as `other` operand, dict_items seems to decide to convert itself to a set, for no obvious reasons since, as you mentioned, it does know how to compare itself to another view containing non-hashable values:
>>> i == {'a': {}}.items() False
So if you're dealing with items views and want to compare them to a set representing dict items, then you need an extra `try/except` in order to handle non-hashable values in the items view. Not only does this require an extra precautionary step, it also seems strange given that in Python you can compare all sorts of objects without exceptions being raised. I can't think of any another built-in type that would raise an exception on equality `==` comparison. dict_items seems to make an exception to that rule.
I think this really is a bug (well, missing feature). It surely, *could* be implemented to work, but maybe not efficiently or easily, so may well not be worth it -- is the use case of comparing a dict_items with another dict_items really helpful?
I think it can be done both efficient and easy, since the current implementation has chosen that converting the view to a set fulfills these requirements. So all you'd have to do is to catch this TypeError inside __eq__ and return False since a `set` can never contain non-hashable elements. Note that it's the comparison with a `set` that causes this TypeError, not the one between two dict views.
In fact, my first thought was that the way to do the comparison is to convert to a dict, rather than a set, and then do the compare. And then I realized that dict_items don't exist without a dict anyway, so you really should be comparing the "host" dicts anyway. Which leaves exactly no use cases for this operation.
I tested converting the `set` to an items view via `dict(a_set).items()` and then use this for the __eq__ comparison but it is quite a bit slower than comparing a view to the set directly (if it doesn't contain any non-hashable values); after all it creates a new object including memory allocation.
-1 on this new functionality Referring again to PEP 3106, and its support for views - this was inspired by the functionality seen with java.util.Map. Now that dict is sort of like java.util.LinkedHashMap - albeit safe for concurrent operations* - it's worth exploring if there's anything in the expanded API of LinkedHashMap vs Map that could inspire us for additions to Python. I honestly don't see anything there. It is possible for LinkedHashMap to support cache behavior, through the use of removeEldestEntry (= potentially to pop on Python dict views, which then update the parent dict) and using access-based ordering (so more functionality than in Python currently), but in Java, this has been superseded in real usage by actually using specialized caches such as seen in Guava or Caffeine https://www.javadoc.io/doc/com.github.ben-manes.caffeine/caffeine/latest/ind.... (In turn, it's possible to use a Caffeine-based cache to support a lock-free version of LinkedHashMap - such caches are more general.) We see the same thing in lru_cache in Python. Although it uses a dict as part of its implementation, this is further wrapped with a linked list (rather cleverly) to track access, whether in the pure Python implementation https://github.com/python/cpython/blob/master/Lib/functools.py#L479 or in the C equivalent. In general, such composite structures are exactly what the original poster would need. In that case, following the use case of working with database schema, it is just a matter of maintaining two indexes, one by name, the other by position. We should keep the most heavily accessed object type in Python as lightweight as possible, and then build interesting structures around it, just like we always do. *Due to the interaction of C code, Python code, and the GIL, dict in the CPython reference implementation can be treated as lock free, and this is how it is implemented in Jython with a ConcurrentHashMap. - Jim On Tue, Jul 7, 2020 at 11:13 AM Christopher Barker <pythonchb@gmail.com> wrote:
On Tue, Jul 7, 2020 at 6:56 AM Dominik Vilsmeier <dominik.vilsmeier@gmx.de> wrote:
Well, the point is that this "except comparisons" is not quite true:
>>> i = {'a': []}.items() >>> s = {('a', 1)} >>> i == s TypeError: unhashable type: 'list'
If passed a set as `other` operand, dict_items seems to decide to convert itself to a set, for no obvious reasons since, as you mentioned, it does know how to compare itself to another view containing non-hashable values:
>>> i == {'a': {}}.items() False
So if you're dealing with items views and want to compare them to a set representing dict items, then you need an extra `try/except` in order to handle non-hashable values in the items view. Not only does this require an extra precautionary step, it also seems strange given that in Python you can compare all sorts of objects without exceptions being raised. I can't think of any another built-in type that would raise an exception on equality `==` comparison. dict_items seems to make an exception to that rule.
I think this really is a bug (well, missing feature). It surely, *could* be implemented to work, but maybe not efficiently or easily, so may well not be worth it -- is the use case of comparing a dict_items with another dict_items really helpful?
In fact, my first thought was that the way to do the comparison is to convert to a dict, rather than a set, and then do the compare. And then I realized that dict_items don't exist without a dict anyway, so you really should be comparing the "host" dicts anyway. Which leaves exactly no use cases for this operation.
I suppose we could add that capability by referencing the host dict if two dict_items are compared -- but really, why?
Though maybe a nicer message:
"TypeError: These views cannot be compared: compare the dicts themselves"
An equality comparison between one `dict.values()` view and another
will always return `False`. This also applies when comparing `dict.values()` to itself.
Surely that must be a relic from pre-3.7 days where dicts were unordered and hence order-based comparison wouldn't be possible
well, maybe -- or it's not a relic, and rather a result of the fact that while dicts are now order-preserving, they are still order-not-important.
Given that: In [90]: {'a' : 1, 'b' : 2} == {'b' : 2, 'a' : 1}
Out[90]: True
then: {'a' : 1, 'b' : 2}.values() == {'b' : 2, 'a' : 1}.values()
should also be True. but that's not possible to do correctly, since it can not be known that all the same values are there, there is no way to know that they are attached to the same keys. And if it did an order-preserving compare, then values would compare false when the dict is not the same.
So: we *could* either do an order preserving (by converting to a sequence) or not (by converting to a sequence and then sorting) comparison of the values -- but would that be at all useful? Do we ever need to ask the question:
Do these two dicts have all the same values without regard to the keys?
Now that I say that -- I suppose the answer is yes -- someone may want to ask that question -- but I think it would be better for them to ask it with their own code, so it's clear what it actually means.
-CHB
-- Christopher Barker, PhD
Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/UAF4FU... Code of Conduct: http://python.org/psf/codeofconduct/
Jim Baker writes:
We should keep the most heavily accessed object type in Python as lightweight as possible, and then build interesting structures around it, just like we always do.
+1 But we're not talking about the dict itself in this subthread. We're talking about views, and the implementation of comparing items containing non-hashable values for equality will be done in a view method which is already worst-case O(n), not a dict method. Note that the current implementation means that an equality comparison can raise, and that's a very bad thing. If you look at the thread Inada-san cited a couple posts back, you'll see some very strong statements in support of the proposition that == should *never* raise. I find that persuasive. I'm still against the implementation of "values view as sequence", since all the alleged use cases are of the form "maybe you'd want this someday for a really big dict so you don't want to listify so you can sort/index/etc." Show me real code, preferably production (ie, used at work -- might even be a throwaway script) and I'll concede there's an argument. But nothing abstract is gonna move me even one Planck unit from -1. (That's just my opinion, as usual YMMV etc)
To add some numbers to this discussion. Using the patch I provided earlier, I tried running some performance comparisons against different call patterns & dictionary sizes: These tests have two versions: direct_index: uses the __getitem__ method in the patch (O(n)) on the view list_copy: converts the view into list(), then does an index lookup Each score is the fastest result of running the test 10 times. Accessing the last item [-1] using the __getitem__ approach is the *worst-case scenario* for the __getitem__ method. d.keys()[-1] vs list(d.keys())[-1] Items in dict: 1 direct_index: took 1us list_copy: took 2us Speedup: 1.85x Items in dict: 26 direct_index: took 1us list_copy: took 2us Speedup: 2.18x Items in dict: 650 direct_index: took 1us list_copy: took 6us Speedup: 7.68x Items in dict: 15,600 direct_index: took 6us list_copy: took 119us Speedup: 19.38x Items in dict: 358,800 direct_index: took 129us list_copy: took 6065us Speedup: 47.12x Items in dict: 7,893,600 direct_index: took 2827us list_copy: took 210419us Speedup: 74.43x random.choice(d.items()) vs random.choice(list(d.items())) Items in dict: 1 direct_index: took 3us list_copy: took 3us Speedup: 1.23x Items in dict: 26 direct_index: took 3us list_copy: took 5us Speedup: 1.75x Items in dict: 650 direct_index: took 5us list_copy: took 19us Speedup: 4.23x Items in dict: 15,600 direct_index: took 7us list_copy: took 2771us Speedup: 377.55x Items in dict: 358,800 direct_index: took 112us list_copy: took 76626us Speedup: 682.33x Items in dict: 7,893,600 direct_index: took 1667us list_copy: took 1602864us Speedup: 961.37x d.items()[0] vs list(d.items())[0] Items in dict: 1 direct_index: took 1us list_copy: took 1us Speedup: 1.75x Items in dict: 26 direct_index: took 1us list_copy: took 2us Speedup: 1.84x Items in dict: 650 direct_index: took 1us list_copy: took 19us Speedup: 30.87x Items in dict: 15,600 direct_index: took 1us list_copy: took 2867us Speedup: 4777.73x Items in dict: 358,800 direct_index: took 1us list_copy: took 73268us Speedup: 112719.85x Items in dict: 7,893,600 direct_index: took 1us list_copy: took 1604230us Speedup: 2765900.12x So, use-case discussions aside, you get some pretty convincing performance wins even on small-ish dictionaries if direct indexing is used. Steve On Wed, Jul 8, 2020 at 3:11 PM Stephen J. Turnbull < turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
Jim Baker writes:
We should keep the most heavily accessed object type in Python as lightweight as possible, and then build interesting structures around it, just like we always do.
+1
But we're not talking about the dict itself in this subthread. We're talking about views, and the implementation of comparing items containing non-hashable values for equality will be done in a view method which is already worst-case O(n), not a dict method.
Note that the current implementation means that an equality comparison can raise, and that's a very bad thing. If you look at the thread Inada-san cited a couple posts back, you'll see some very strong statements in support of the proposition that == should *never* raise. I find that persuasive.
I'm still against the implementation of "values view as sequence", since all the alleged use cases are of the form "maybe you'd want this someday for a really big dict so you don't want to listify so you can sort/index/etc." Show me real code, preferably production (ie, used at work -- might even be a throwaway script) and I'll concede there's an argument. But nothing abstract is gonna move me even one Planck unit from -1. (That's just my opinion, as usual YMMV etc) _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/A6XVKF... Code of Conduct: http://python.org/psf/codeofconduct/
I think this comparison is unfair.
d.items()[0] vs list(d.items())[0]
Should be compared with `next(iter(d.items())`
d.keys()[-1] vs list(d.keys())[-1]
Should be compared with `next(reversed(d.keys()))`, or `next(reversed(d))`.
random.choice(d.items()) vs random.choice(list(d.items()))
Should be compared with `random.choice(items_list)` with `items_list = list(d.items())` setup too. -- Inada Naoki <songofacandy@gmail.com>
On Wed, Jul 8, 2020 at 7:13 PM Inada Naoki <songofacandy@gmail.com> wrote:
I think this comparison is unfair.
well, benchmarks always lie ....
d.items()[0] vs list(d.items())[0]
Should be compared with `next(iter(d.items())`
why? the entire point of this idea is to have indexing syntax -- we can already use the iteration protocol top do this. Not that it's a bad idea to time that too, but since under the hood it's doing the same or less work, I'm not sure what the point is.
d.keys()[-1] vs list(d.keys())[-1]
Should be compared with `next(reversed(d.keys()))`, or `next(reversed(d))`.
Same point - the idea is to have indexing syntax. Though yes, it would be good to see how it compares. But I know predicting performance is usually wrong, but this is going to require a full traversal of the underlying keys in either case.
random.choice(d.items()) vs random.choice(list(d.items()))
Should be compared with `random.choice(items_list)` with `items_list = list(d.items())` setup too.
I don't follow this one -- could you explain? what is items_list ? But what this didn't check is how bad the performance could be for what I expect would be a bad performance case -- indexing teh keys repeatedly: for i in lots_of_indexes: a_dict.keys[i] vs: keys_list = list(a_dict.keys) for it in lots_of_indexes: keys_list[i] I suspect it wouldn't take all that many indexes for making a list a better option. But again, we are badk to use cases. As Stephen pointed out no one has produced an actualy production code use case. I myself have wanted this (the random.choice option) -- but it wasn't production code it was a learning exercise, specifically: http://codekata.com/kata/kata14-tom-swift-under-the-milkwood/ I can imagine there would be that use case in real world code, but I haven't had it. -CHB
-- Inada Naoki <songofacandy@gmail.com> _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/73YTQD... Code of Conduct: http://python.org/psf/codeofconduct/
-- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Thu, Jul 9, 2020 at 12:45 PM Christopher Barker <pythonchb@gmail.com> wrote:
On Wed, Jul 8, 2020 at 7:13 PM Inada Naoki <songofacandy@gmail.com> wrote:
I think this comparison is unfair.
well, benchmarks always lie ....
d.items()[0] vs list(d.items())[0]
Should be compared with `next(iter(d.items())`
why? the entire point of this idea is to have indexing syntax -- we can already use the iteration protocol top do this. Not that it's a bad idea to time that too, but since under the hood it's doing the same or less work, I'm not sure what the point is.
Because this code solves "take the first item in the dict". If you need to benchmark index access, you should compare your dict.items()[0] and list index. You shouldn't create list from d.items8) every loop.
d.keys()[-1] vs list(d.keys())[-1]
Should be compared with `next(reversed(d.keys()))`, or `next(reversed(d))`.
Same point - the idea is to have indexing syntax. Though yes, it would be good to see how it compares. But I know predicting performance is usually wrong, but this is going to require a full traversal of the underlying keys in either case.
Same here. And note that dict and dict views now supports reversed().
random.choice(d.items()) vs random.choice(list(d.items()))
Should be compared with `random.choice(items_list)` with `items_list = list(d.items())` setup too.
I don't follow this one -- could you explain? what is items_list ?
I explained `item_list = list(d.items())`. Do it in setup (e.g. before loop.) ("setup" is term used by timeit module.)
But what this didn't check is how bad the performance could be for what I expect would be a bad performance case -- indexing teh keys repeatedly:
for i in lots_of_indexes: a_dict.keys[i]
vs:
keys_list = list(a_dict.keys) for it in lots_of_indexes: keys_list[i]
You should do this.
I suspect it wouldn't take all that many indexes for making a list a better option.
If you need to index access many times, creating list is the recommended way. You shouldn't ignore it. That's why I said it's an unfair comparison. You should compare "current recommended way" vs "propsed way".
But again, we are badk to use cases. As Stephen pointed out no one has produced an actualy production code use case.
I agree. Regards, -- Inada Naoki <songofacandy@gmail.com>
Re the benchmarks: yes, they're micro benchmarks, the intent was to show that the performance can be non impacting no, that doesn't invalidate them (just scopes their usefulness, my sales pitch at the end was slightly over-egging things but reasonable, imo), yes I ignored direct quadratic behaviour in indexing, as I would never propose that as a goal adding in iter() based comparisons would be interesting, however it doesn't invalidate the list() option, as this is very often used as a solution to the problem. It's true, benchmarks that don't match your incentives and opinions always lie. As for use-cases, I'll admit that I see this as a fairly minor quality-of-life issue. Finding use-cases is a bit tricky, as the fact that dictionaries have defined order is a recent feature, and I know I am (and I'm sure many other people) are still adapting to take advantage of this new functionality. There's also the fact that in python < 3, the results of dict.keys(), values() and items() was a Sequence, so the impact of this change may still be being felt (yes, even decades later, the majority of the python I've written to deal with 'messy' data involving lots of dictionaries has been in python 2). However I've put together a set of cases that I personally would like to work as they appear to (Several of these are paraphrases of production code I've worked with): --
import random random.choice({'a': 1, 'b': 2}.keys()) 'a' -- import numpy as np mapping_table = np.array(BIG_LOOKUP_DICT.items()) [[1, 99], [2, 23], ... ] -- import sqlite3 conn = sqlite3.connect(":memory:") params = {'a': 1, 'b': 2} placeholders = ', '.join(f':{p}' for p in params) statement = f"select {placeholders}" print(f"Running: {statement}") Running: select :a, :b cur=conn.execute(statement, params.values()) cur.fetchall() [(1, 2)] -- # This currently works, but is deprecated in 3.9 import random dict(random.sample({'a': 1, 'b': 2}.items(), 2)) {'b': 2, 'a': 1} -- def min_max_keys(d): min_key, min_val = d.items()[0] max_key, max_val = min_key, min_val for key, value in d.items(): if value < min_val: min_key = key min_val = value if value > max_val: max_key = key max_val = value return min_key, max_key
min_max_keys({'a': 1, 'b': 2, 'c': -9999}) min_max_keys({'a': 'x', 'b': 'y', 'c': 'z'}) -- import os users = {'cups': 209, 'service': 991} os.setgroups(users.values())
Obviously, python is a general-purpose, turing complete language, so each of these options can be written in other ways. But it would be nice if the simple, readable versions also worked :D The idea that there are future, unspecified changes to dicts() that may or may not be hampered by allowing indexing sounds like FUD to me, unless there are concrete references? Steve On Thu, Jul 9, 2020 at 3:04 PM Inada Naoki <songofacandy@gmail.com> wrote:
On Thu, Jul 9, 2020 at 12:45 PM Christopher Barker <pythonchb@gmail.com> wrote:
On Wed, Jul 8, 2020 at 7:13 PM Inada Naoki <songofacandy@gmail.com>
wrote:
I think this comparison is unfair.
well, benchmarks always lie ....
d.items()[0] vs list(d.items())[0]
Should be compared with `next(iter(d.items())`
why? the entire point of this idea is to have indexing syntax -- we can already use the iteration protocol top do this. Not that it's a bad idea to time that too, but since under the hood it's doing the same or less work, I'm not sure what the point is.
Because this code solves "take the first item in the dict".
If you need to benchmark index access, you should compare your dict.items()[0] and list index. You shouldn't create list from d.items8) every loop.
d.keys()[-1] vs list(d.keys())[-1]
Should be compared with `next(reversed(d.keys()))`, or `next(reversed(d))`.
Same point - the idea is to have indexing syntax. Though yes, it would be good to see how it compares. But I know predicting performance is usually wrong, but this is going to require a full traversal of the underlying keys in either case.
Same here. And note that dict and dict views now supports reversed().
random.choice(d.items()) vs random.choice(list(d.items()))
Should be compared with `random.choice(items_list)` with `items_list = list(d.items())` setup too.
I don't follow this one -- could you explain? what is items_list ?
I explained `item_list = list(d.items())`. Do it in setup (e.g. before loop.) ("setup" is term used by timeit module.)
But what this didn't check is how bad the performance could be for what
I expect would be a bad performance case -- indexing teh keys repeatedly:
for i in lots_of_indexes: a_dict.keys[i]
vs:
keys_list = list(a_dict.keys) for it in lots_of_indexes: keys_list[i]
You should do this.
I suspect it wouldn't take all that many indexes for making a list a better option.
If you need to index access many times, creating list is the recommended way. You shouldn't ignore it. That's why I said it's an unfair comparison. You should compare "current recommended way" vs "propsed way".
But again, we are badk to use cases. As Stephen pointed out no one has produced an actualy production code use case.
I agree.
Regards,
-- Inada Naoki <songofacandy@gmail.com>
On Thu, Jul 9, 2020 at 10:29 AM Stestagg <stestagg@gmail.com> wrote:
The idea that there are future, unspecified changes to dicts() that may or may not be hampered by allowing indexing sounds like FUD to me, unless there are concrete references?
IIRC, first, dicts started preserving order as an implementation detail (implementation borrowed from PyPy?). Then there was a discussion about whether it should be declared that it was a feature that could be counted on -- and it was decided that yes, it was. So yes -- we can count on it, and so don't be worried about a future re-implementation. -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On 10 Jul 2020, at 05:33, Christopher Barker <pythonchb@gmail.com> wrote:
On Thu, Jul 9, 2020 at 10:29 AM Stestagg <stestagg@gmail.com <mailto:stestagg@gmail.com>> wrote: The idea that there are future, unspecified changes to dicts() that may or may not be hampered by allowing indexing sounds like FUD to me, unless there are concrete references?
IIRC, first, dicts started preserving order as an implementation detail (implementation borrowed from PyPy?). Then there was a discussion about whether it should be declared that it was a feature that could be counted on -- and it was decided that yes, it was.
So yes -- we can count on it, and so don't be worried about a future re-implementation.
Adding indexing to views adds another requirement to the dict implementation: indexing for sequences at least suggests that access is O(1). That makes it impossible to use, as an example, a linked list to preserve insertion order. Ronald — Twitter / micro.blog: @ronaldoussoren Blog: https://blog.ronaldoussoren.net/ <https://blog.ronaldoussoren.net/>
Adding indexing to views adds another requirement to the dict implementation:
Yes, that's the proposed change
indexing for sequences at least suggests that access is O(1).
That makes it impossible to use, as an example, a linked list to preserve
insertion order.
The docs for the Sequence abc explicitly talk about situations where index is not O(1), and say that while not ideal, this is valid (mentioning linked-lists by name). The current existing `dict` implementation does not support O(1) index lookup from a view, (as discussed previously), so a change to linked-lists would not change this runtime O characteristic. Steve
On 10 Jul 2020, at 14:08, Stestagg <stestagg@gmail.com> wrote:
Adding indexing to views adds another requirement to the dict implementation:
Yes, that's the proposed change
indexing for sequences at least suggests that access is O(1). That makes it impossible to use, as an example, a linked list to preserve insertion order.
The docs for the Sequence abc explicitly talk about situations where index is not O(1), and say that while not ideal, this is valid (mentioning linked-lists by name).
Its a quality of implementation issue. A sequence where item access is not O(1) is not invalid, but surprising and likely to cause performance problems because users tend to expect that item access is O(1).
The current existing `dict` implementation does not support O(1) index lookup from a view, (as discussed previously), so a change to linked-lists would not change this runtime O characteristic.
The current views don’t support indexing at all, that’s the whole point of this thread. Ronald — Twitter / micro.blog: @ronaldoussoren Blog: https://blog.ronaldoussoren.net/
On Fri, Jul 10, 2020 at 1:28 PM Ronald Oussoren <ronaldoussoren@mac.com> wrote:
On 10 Jul 2020, at 14:08, Stestagg <stestagg@gmail.com> wrote:
Adding indexing to views adds another requirement to the dict implementation:
Yes, that's the proposed change
indexing for sequences at least suggests that access is O(1).
That makes it impossible to use, as an example, a linked list to preserve
insertion order.
The docs for the Sequence abc explicitly talk about situations where index is not O(1), and say that while not ideal, this is valid (mentioning linked-lists by name).
Its a quality of implementation issue. A sequence where item access is not O(1) is not invalid, but surprising and likely to cause performance problems because users tend to expect that item access is O(1).
Yes, as the docs say. But this has been raised a few times in the thread, and it orthogonal to the 'we may want to make arbitrary changes to how dicts work in the future' debate.
The current existing `dict` implementation does not support O(1) index lookup from a view, (as discussed previously), so a change to linked-lists would not change this runtime O characteristic.
The current views don’t support indexing at all, that’s the whole point of this thread.
The intent of my statement was: The current implementation of dict does not allow any reasonable implementation of dict_keys/dict_values/dict_items with O(1) index behaviour. I.e. the performance scalability of any future changes to the dict_* types that add direct indexing is *not* likely to be adversely affected by changes to the implementation of dict(), unless somehow iteration in O(n) time becomes impossible.
On Fri, 10 Jul 2020 at 13:47, Stestagg <stestagg@gmail.com> wrote:
The intent of my statement was: The current implementation of dict does not allow any reasonable implementation of dict_keys/dict_values/dict_items with O(1) index behaviour. I.e. the performance scalability of any future changes to the dict_* types that add direct indexing is *not* likely to be adversely affected by changes to the implementation of dict(), unless somehow iteration in O(n) time becomes impossible.
So you're saying that you'd like to be able to index dict.keys()/dict.values()/dict.items(), but are fine with O(n) performance. OK, list(d.items())[n] does *precisely* what you are asking for. So you're asking for a less verbose spelling of that. That would be something we could discuss, except for the fact that the_items = list(d.items()) the_items[n] has *better* amortised performance than that, if I can index multiple times using the same snapshot of the list. It's pretty much certain that you're not going to succeed if you are requesting a new dictionary feature that's guaranteed to have worse performance than the current idiomatic Python way of achieving the same result... So either you are failing very badly at explaining what you want and why it's useful, or you should probably give up. Paul
All valid points, I'd recommend catching up on the entire thread, it'll be a lot quicker than re-iterating them here. The highly condensed tl;dr version is that list(d.items())[n] is sometimes the fastest option, but sometimes not, it's not always faster (sometimes many thousands of times slowe) nor the most readable version, but is more readable than the next(iter(dict.keys()) alternative that's been mooted. On Fri, Jul 10, 2020 at 2:20 PM Paul Moore <p.f.moore@gmail.com> wrote:
On Fri, 10 Jul 2020 at 13:47, Stestagg <stestagg@gmail.com> wrote:
The intent of my statement was: The current implementation of dict does not allow any reasonable implementation of dict_keys/dict_values/dict_items with O(1) index behaviour. I.e. the performance scalability of any future changes to the dict_* types that add direct indexing is *not* likely to be adversely affected by changes to the implementation of dict(), unless somehow iteration in O(n) time becomes impossible.
So you're saying that you'd like to be able to index dict.keys()/dict.values()/dict.items(), but are fine with O(n) performance.
OK,
list(d.items())[n]
does *precisely* what you are asking for. So you're asking for a less verbose spelling of that. That would be something we could discuss, except for the fact that
the_items = list(d.items()) the_items[n]
has *better* amortised performance than that, if I can index multiple times using the same snapshot of the list.
It's pretty much certain that you're not going to succeed if you are requesting a new dictionary feature that's guaranteed to have worse performance than the current idiomatic Python way of achieving the same result...
So either you are failing very badly at explaining what you want and why it's useful, or you should probably give up. Paul
On Fri, 10 Jul 2020 at 14:26, Stestagg <stestagg@gmail.com> wrote:
All valid points, I'd recommend catching up on the entire thread, it'll be a lot quicker than re-iterating them here.
I have been reading the thread (admittedly skimming). I've seen nothing yet to suggest that there's a useful change to Python in here.
The highly condensed tl;dr version is that list(d.items())[n] is sometimes the fastest option, but sometimes not, it's not always faster (sometimes many thousands of times slowe) nor the most readable version, but is more readable than the next(iter(dict.keys()) alternative that's been mooted.
I don't dispute any of that (except maybe the readability comment, which is subjective). But that's still just saying there are different ways *in current Python* to do this, with different performance characteristics. I guess if that's all you were saying, then I can just respond "yes". Paul
On 7/10/2020 9:20 AM, Paul Moore wrote:
On Fri, 10 Jul 2020 at 13:47, Stestagg <stestagg@gmail.com> wrote:
The intent of my statement was: The current implementation of dict does not allow any reasonable implementation of dict_keys/dict_values/dict_items with O(1) index behaviour. I.e. the performance scalability of any future changes to the dict_* types that add direct indexing is *not* likely to be adversely affected by changes to the implementation of dict(), unless somehow iteration in O(n) time becomes impossible. So you're saying that you'd like to be able to index dict.keys()/dict.values()/dict.items(), but are fine with O(n) performance.
OK,
list(d.items())[n]
Or next(itertools.islice(d.items(), n, None)) There are multiple O(n) ways of doing this. If we're not going to require O(1), then I don't see the point of adding another ways of achieving the same thing. And I don't think we should require O(1): I wouldn't want to constrain any implementation for a niche use case. Eric
On 10 Jul 2020, at 15:38, Eric V. Smith <eric@trueblade.com> wrote:
On 7/10/2020 9:20 AM, Paul Moore wrote:
On Fri, 10 Jul 2020 at 13:47, Stestagg <stestagg@gmail.com> wrote:
The intent of my statement was: The current implementation of dict does not allow any reasonable implementation of dict_keys/dict_values/dict_items with O(1) index behaviour. I.e. the performance scalability of any future changes to the dict_* types that add direct indexing is *not* likely to be adversely affected by changes to the implementation of dict(), unless somehow iteration in O(n) time becomes impossible. So you're saying that you'd like to be able to index dict.keys()/dict.values()/dict.items(), but are fine with O(n) performance.
OK,
list(d.items())[n]
Or next(itertools.islice(d.items(), n, None))
There are multiple O(n) ways of doing this. If we're not going to require O(1), then I don't see the point of adding another ways of achieving the same thing. And I don't think we should require O(1): I wouldn't want to constrain any implementation for a niche use case.
I agree. Ronald — Twitter / micro.blog: @ronaldoussoren Blog: https://blog.ronaldoussoren.net/
On Thu, Jul 09, 2020 at 06:26:41PM +0100, Stestagg wrote:
As for use-cases, I'll admit that I see this as a fairly minor quality-of-life issue.
Thank you for that comment. I too have sometimes proposed what I think of as "minor quality-of-life" enhancements, and had them shot down. It stings a bit, and can be frustrating, but remember it's not personal. The difficulty is that our QOL enhancement is someone else's bloat. Every new feature is something that has to be not just written once, but maintained, documented, tested and learned. Every new feature steepens the learning curve for the language; every new feature increases the size of the language, increases the time it takes to build, increases the time it takes for the tests to run. This one might only be one new method on three classes, but it all adds up, and we can't add *everything*. (I recently started writing what was intended to be a fairly small class, and before I knew it I was up to six helper classes, nearly 200 methods, and approaching 1500 LOC, for what was conceptually intended to be a *lightweight* object. I've put this aside to think about it for a while, to decide whether to start again from scratch with a smaller API, or just remove the word "lightweight" from the description :-) So each new feature has to carry its own weight. Even if the weight in effort to write, effort to learn, code, tests and documentation is small, the benefit gained must be greater or it will likely be rejected. "Nice to have" is unlikely to be enough, unless you happen to be one of the most senior core devs scratching your own itch, and sometimes not even then.
import numpy as np mapping_table = np.array(BIG_LOOKUP_DICT.items()) [[1, 99], [2, 23], ... ]
That worked in Python 2 by making a copy of the dict items into a list. It will equally work in Python 3 by making a copy of the items into a list. And I expect that even if dict.items() was indexable, numpy would still have to copy the items. I don't know how numpy works in detail, but I doubt that it will be able to use a view of a hash table internals as a fast array without copying. Bottom line here is that adding indexing to dict views won't save you either time or memory or avoid making a copy in this example. All it will save you is writing an explicit call to `list`. And we know what the Zen says about being explicit.
import sqlite3 conn = sqlite3.connect(":memory:") params = {'a': 1, 'b': 2} placeholders = ', '.join(f':{p}' for p in params) statement = f"select {placeholders}" print(f"Running: {statement}") Running: select :a, :b cur=conn.execute(statement, params.values()) cur.fetchall() [(1, 2)]
Why are you passing a view to a values when you could pass the dict itself? Is there some reason you don't do this? # statement = "select :a, :b" py> cur=conn.execute(statement, params) py> cur.fetchall() [(1, 2)] I'm not an expert on sqlite, so I might be missing something here, but I would have expected that this is the prefered solution. It matches the example in the docs, which uses a dict.
# This currently works, but is deprecated in 3.9
dict(random.sample({'a': 1, 'b': 2}.items(), 2)) {'b': 2, 'a': 1}
I suspect that even if dict items were indexable, Raymond Hettinger would not be happy with random.sample on dict views.
def min_max_keys(d): min_key, min_val = d.items()[0] max_key, max_val = min_key, min_val for key, value in d.items():
Since there's no random access to the items required, there's not really any need for indexing. You only need the first item, then iteration. So the natural way to write that is with iter() and next(). I suspect that the difference in perspective here is that (perhaps?) you still thing of concrete sequences and indexing as fundamental, while Python 3 has moved in the direction of making the iterator protocol and iterators as fundamental. You have a hammer (indexing), so you want views to be nails so you can hammer them. But views are screws, and need a screwdriver (iter and next). I'm not disputing that sometimes a hammer is the right solution, I'm just dubious that having views be a mutant nail+screw hybrid is worth it, just so you can pass a view to random.sample.
The idea that there are future, unspecified changes to dicts() that may or may not be hampered by allowing indexing sounds like FUD to me, unless there are concrete references?
Nobody is suggesting that dicts will cease to be order preserving in the future. But order preserving does not necessarily require there to be easy indexing. It just requires that the order of iteration is the same as order of insertion, not that we can jump to the 350th key without stepping through the previous 349 keys. Dicts have gone through a number of major redesigns and many careful tweaks over the years to get the best possible performance. The last major change was to add *order-preserving* behaviour, not indexing. The fact that they can be indexed in reasonable time is not part of the design, just an accident of implementation, and being an accident, it could change in the future. This feature would require upgrading that accident of implementation to a guarantee. If the Python world were awash with dozens of compelling, strong use-cases for indexing dicts, then we would surely be willing to make that guarantee. But the most compelling use-case we've seen so far is awfully weak indeed: choose a random item from a dict. So the cost-benefit calculation goes (in my opinion) something like this. 1. Risk of eliminating useful performance enhancements in the future: small. 2. Benefit gained: even smaller. That's not FUD. It's just a simple cost-benefit calculation. You can counter it by finding good use-cases that are currently difficult and annoying to solve. Using an explicit call to list is neither difficult nor annoying :-) -- Steven
I too have sometimes proposed what I think of as "minor quality-of-life" enhancements, and had them shot down. It stings a bit, and can be frustrating, but remember it's not personal.
I don't mind the shooting down, as long as the arguments make sense :D. It seems like we're both in agreement that the cost of implementing & maintaining the change is non-zero. I'm asserting that the end benefit of this change is also non-zero, and in my opinion higher than the cost. But I also acknowledge that the benefit may not be enough to overcome the inertia behind getting a change made. The reason I'm persevering is to try and weed out the immaterial or incorrect reasons for not making this change, so hopefully we're left with a good understanding of the pros-cons.
The difficulty is that our QOL enhancement is someone else's bloat. Every new feature is something that has to be not just written once, but maintained, documented, tested and learned. Every new feature steepens the learning curve for the language; every new feature increases the size of the language, increases the time it takes to build, increases the time it takes for the tests to run.
Yeah, I can see that more code = more code overhead, and that's got to be justified. I don't believe that this feature would steepen the language learning curve however, but actually help to shallow it slightly (Explained more below)
This one might only be one new method on three classes, but it all adds up, and we can't add *everything*.
(I recently started writing what was intended to be a fairly small class, and before I knew it I was up to six helper classes, nearly 200 methods, and approaching 1500 LOC, for what was conceptually intended to be a *lightweight* object. I've put this aside to think about it for a while, to decide whether to start again from scratch with a smaller API, or just remove the word "lightweight" from the description :-)
Absolute method count is seldom a standalone indicator of a dead end. Often classes with many methods, (especially if they're accompanied by lots of helpers) are a side-effect of some abstraction failure. Usually when I'm consulting on fixing projects with these characteristics, it's a case of the developers not correctly choosing their abstractions, or letting them leak. It sounds like you let that one get away from you, chalk it up to a learning experience. The "It walks like a zoo, squaks/lows/grunts/chitters like a zoo" problem is very real. This is more of a "It used to be a duck. Now it walks like a duck, but doesn't sound like a duck because it's a coot" problem
So each new feature has to carry its own weight. Even if the weight in effort to write, effort to learn, code, tests and documentation is small, the benefit gained must be greater or it will likely be rejected.
"Nice to have" is unlikely to be enough, unless you happen to be one of the most senior core devs scratching your own itch, and sometimes not even then.
import numpy as np mapping_table = np.array(BIG_LOOKUP_DICT.items()) [[1, 99], [2, 23], ... ]
That worked in Python 2 by making a copy of the dict items into a list. It will equally work in Python 3 by making a copy of the items into a list.
And I expect that even if dict.items() was indexable, numpy would still have to copy the items. I don't know how numpy works in detail, but I doubt that it will be able to use a view of a hash table internals as a fast array without copying.
Bottom line here is that adding indexing to dict views won't save you either time or memory or avoid making a copy in this example. All it will save you is writing an explicit call to `list`. And we know what the Zen says about being explicit.
What making dict_* types a Sequence will do is make this code (as written) behave: 1. like it used to do 2. like most people seem to expect it to. Currently numpy does something that I consider unexpected (I'm sure, given your previous responses, you'd disagree with this, but from canvassing Python devs, I feel like many people share my opinions here) with that code.
import sqlite3 conn = sqlite3.connect(":memory:") params = {'a': 1, 'b': 2} placeholders = ', '.join(f':{p}' for p in params) statement = f"select {placeholders}" print(f"Running: {statement}") Running: select :a, :b cur=conn.execute(statement, params.values()) cur.fetchall() [(1, 2)]
Why are you passing a view to a values when you could pass the dict itself? Is there some reason you don't do this?
# statement = "select :a, :b" py> cur=conn.execute(statement, params) py> cur.fetchall() [(1, 2)]
I'm not an expert on sqlite, so I might be missing something here, but I would have expected that this is the prefered solution. It matches the example in the docs, which uses a dict.
You're right, this was a version of code that I've written before for a different database driver (which didn't support named parameters), and sqlite3 does support that, so that my mistake. As mentioned elsewhere, producing bullet-proof use-cases on demand can be tough.
# This currently works, but is deprecated in 3.9
dict(random.sample({'a': 1, 'b': 2}.items(), 2)) {'b': 2, 'a': 1}
I suspect that even if dict items were indexable, Raymond Hettinger would not be happy with random.sample on dict views.
I don't know why? I can understand deprecating sets here as they're unordered, so the results when seed() has been called are not consistent. I don't see why Raymond would object to allowing sampling an ordered container, one from which the results will be reproducible.
def min_max_keys(d): min_key, min_val = d.items()[0] max_key, max_val = min_key, min_val for key, value in d.items():
Since there's no random access to the items required, there's not really any need for indexing. You only need the first item, then iteration. So the natural way to write that is with iter() and next().
Yeah, it's possible to write this in any number of ways. I canvassed some opinions from python developers on how to do this sort of thing, and of 4 out of 5 different responses I got wouldn't currently work because their suggested implementations relied on `.keys() being indexable, or being Sequences.
I suspect that the difference in perspective here is that (perhaps?) you still thing of concrete sequences and indexing as fundamental, while Python 3 has moved in the direction of making the iterator protocol and iterators as fundamental.
This is the proposal, I want to make these things Sequences. These things (the results of dict.keys() for example) used to look like, and act, like nails. Then suddenly they looked like and acted like screws (for good reasons), but let's say screws with smooth heads, (as many people think they are still nails), now there's a simple way to make them act like nails again, using "but they're screws, so can't be nails" as counter doesn't hold water.
You have a hammer (indexing), so you want views to be nails so you can hammer them. But views are screws, and need a screwdriver (iter and next).
I have a proposal: make these things indexable, so people can hammer them in if they desire.
...not that we can jump to the 350th key without stepping through the previous 349 keys.
The existing dictionary memory layout doesn't support direct indexing (without stepping), so this functionality is not being added as a requirement.
Dicts have gone through a number of major redesigns and many careful tweaks over the years to get the best possible performance. The last major change was to add *order-preserving* behaviour, not indexing. The fact that they can be indexed in reasonable time is not part of the design, just an accident of implementation, and being an accident, it could change in the future.
To throw the request back, what's the use case you're considering here? Why would dictionary iteration be made slower in the future?
This feature would require upgrading that accident of implementation to a guarantee. If the Python world were awash with dozens of compelling, strong use-cases for indexing dicts, then we would surely be willing to make that guarantee. But the most compelling use-case we've seen so far is awfully weak indeed: choose a random item from a dict.
So the cost-benefit calculation goes (in my opinion) something like this.
1. Risk of eliminating useful performance enhancements in the future: small.
No use cases on how/why this would be a thing
2. Benefit gained: even smaller.
Some weak use cases on why this would help.
That's not FUD. It's just a simple cost-benefit calculation. You can counter it by finding good use-cases that are currently difficult and annoying to solve. Using an explicit call to list is neither difficult nor annoying :-)
Definitely -1 after reading all the discussion. The strongest argument I've seen is: `list(d.items())` adds six characters. So far, there's absolutely nothing described that cannot easily be accomplished with list(). Moreover, even apart from the work of maintaining the feature itself, the attractive nuisance of getting O(N) behavior rather than O(1) seems like a strong anti-feature. On Fri, Jul 10, 2020, 1:07 PM Stestagg <stestagg@gmail.com> wrote:
I too have sometimes proposed what I think of as "minor quality-of-life"
enhancements, and had them shot down. It stings a bit, and can be frustrating, but remember it's not personal.
I don't mind the shooting down, as long as the arguments make sense :D.
It seems like we're both in agreement that the cost of implementing & maintaining the change is non-zero. I'm asserting that the end benefit of this change is also non-zero, and in my opinion higher than the cost. But I also acknowledge that the benefit may not be enough to overcome the inertia behind getting a change made.
The reason I'm persevering is to try and weed out the immaterial or incorrect reasons for not making this change, so hopefully we're left with a good understanding of the pros-cons.
The difficulty is that our QOL enhancement is someone else's bloat. Every new feature is something that has to be not just written once, but maintained, documented, tested and learned. Every new feature steepens the learning curve for the language; every new feature increases the size of the language, increases the time it takes to build, increases the time it takes for the tests to run.
Yeah, I can see that more code = more code overhead, and that's got to be justified. I don't believe that this feature would steepen the language learning curve however, but actually help to shallow it slightly (Explained more below)
This one might only be one new method on three classes, but it all adds up, and we can't add *everything*.
(I recently started writing what was intended to be a fairly small class, and before I knew it I was up to six helper classes, nearly 200 methods, and approaching 1500 LOC, for what was conceptually intended to be a *lightweight* object. I've put this aside to think about it for a while, to decide whether to start again from scratch with a smaller API, or just remove the word "lightweight" from the description :-)
Absolute method count is seldom a standalone indicator of a dead end. Often classes with many methods, (especially if they're accompanied by lots of helpers) are a side-effect of some abstraction failure. Usually when I'm consulting on fixing projects with these characteristics, it's a case of the developers not correctly choosing their abstractions, or letting them leak. It sounds like you let that one get away from you, chalk it up to a learning experience.
The "It walks like a zoo, squaks/lows/grunts/chitters like a zoo" problem is very real. This is more of a "It used to be a duck. Now it walks like a duck, but doesn't sound like a duck because it's a coot" problem
So each new feature has to carry its own weight. Even if the weight in effort to write, effort to learn, code, tests and documentation is small, the benefit gained must be greater or it will likely be rejected.
"Nice to have" is unlikely to be enough, unless you happen to be one of the most senior core devs scratching your own itch, and sometimes not even then.
import numpy as np mapping_table = np.array(BIG_LOOKUP_DICT.items()) [[1, 99], [2, 23], ... ]
That worked in Python 2 by making a copy of the dict items into a list. It will equally work in Python 3 by making a copy of the items into a list.
And I expect that even if dict.items() was indexable, numpy would still have to copy the items. I don't know how numpy works in detail, but I doubt that it will be able to use a view of a hash table internals as a fast array without copying.
Bottom line here is that adding indexing to dict views won't save you either time or memory or avoid making a copy in this example. All it will save you is writing an explicit call to `list`. And we know what the Zen says about being explicit.
What making dict_* types a Sequence will do is make this code (as written) behave: 1. like it used to do 2. like most people seem to expect it to.
Currently numpy does something that I consider unexpected (I'm sure, given your previous responses, you'd disagree with this, but from canvassing Python devs, I feel like many people share my opinions here) with that code.
import sqlite3 conn = sqlite3.connect(":memory:") params = {'a': 1, 'b': 2} placeholders = ', '.join(f':{p}' for p in params) statement = f"select {placeholders}" print(f"Running: {statement}") Running: select :a, :b cur=conn.execute(statement, params.values()) cur.fetchall() [(1, 2)]
Why are you passing a view to a values when you could pass the dict itself? Is there some reason you don't do this?
# statement = "select :a, :b" py> cur=conn.execute(statement, params) py> cur.fetchall() [(1, 2)]
I'm not an expert on sqlite, so I might be missing something here, but I would have expected that this is the prefered solution. It matches the example in the docs, which uses a dict.
You're right, this was a version of code that I've written before for a different database driver (which didn't support named parameters), and sqlite3 does support that, so that my mistake. As mentioned elsewhere, producing bullet-proof use-cases on demand can be tough.
# This currently works, but is deprecated in 3.9
dict(random.sample({'a': 1, 'b': 2}.items(), 2)) {'b': 2, 'a': 1}
I suspect that even if dict items were indexable, Raymond Hettinger would not be happy with random.sample on dict views.
I don't know why? I can understand deprecating sets here as they're unordered, so the results when seed() has been called are not consistent. I don't see why Raymond would object to allowing sampling an ordered container, one from which the results will be reproducible.
def min_max_keys(d): min_key, min_val = d.items()[0] max_key, max_val = min_key, min_val for key, value in d.items():
Since there's no random access to the items required, there's not really any need for indexing. You only need the first item, then iteration. So the natural way to write that is with iter() and next().
Yeah, it's possible to write this in any number of ways.
I canvassed some opinions from python developers on how to do this sort of thing, and of 4 out of 5 different responses I got wouldn't currently work because their suggested implementations relied on `.keys() being indexable, or being Sequences.
I suspect that the difference in perspective here is that (perhaps?) you still thing of concrete sequences and indexing as fundamental, while Python 3 has moved in the direction of making the iterator protocol and iterators as fundamental.
This is the proposal, I want to make these things Sequences.
These things (the results of dict.keys() for example) used to look like, and act, like nails. Then suddenly they looked like and acted like screws (for good reasons), but let's say screws with smooth heads, (as many people think they are still nails), now there's a simple way to make them act like nails again, using "but they're screws, so can't be nails" as counter doesn't hold water.
You have a hammer (indexing), so you want views to be nails so you can hammer them. But views are screws, and need a screwdriver (iter and next).
I have a proposal: make these things indexable, so people can hammer them in if they desire.
...not that we can jump to the 350th key without stepping through the previous 349 keys.
The existing dictionary memory layout doesn't support direct indexing (without stepping), so this functionality is not being added as a requirement.
Dicts have gone through a number of major redesigns and many careful tweaks over the years to get the best possible performance. The last major change was to add *order-preserving* behaviour, not indexing. The fact that they can be indexed in reasonable time is not part of the design, just an accident of implementation, and being an accident, it could change in the future.
To throw the request back, what's the use case you're considering here? Why would dictionary iteration be made slower in the future?
This feature would require upgrading that accident of implementation to a guarantee. If the Python world were awash with dozens of compelling, strong use-cases for indexing dicts, then we would surely be willing to make that guarantee. But the most compelling use-case we've seen so far is awfully weak indeed: choose a random item from a dict.
So the cost-benefit calculation goes (in my opinion) something like this.
1. Risk of eliminating useful performance enhancements in the future: small.
No use cases on how/why this would be a thing
2. Benefit gained: even smaller.
Some weak use cases on why this would help.
That's not FUD. It's just a simple cost-benefit calculation. You can counter it by finding good use-cases that are currently difficult and annoying to solve. Using an explicit call to list is neither difficult nor annoying :-)
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/TSWIG7... Code of Conduct: http://python.org/psf/codeofconduct/
I'm -1 too. Making ordered containers to sequence-like only for random.choice is a wrong idea. If random.choice should support non-sequence ordered container, just propose it to random.choice. And if the proposal was rejected, you can do it by yourself with helper functions.
import random from itertools import islice def nth(iterable, n): ... return next(islice(iterable, n, None)) def choice_from_container(c): ... return nth(c, random.randrange(len(c))) d = dict.fromkeys(range(10000)) choice_from_container(d) 61 choice_from_container(d) 9858 choice_from_container(d.keys()) 2436 choice_from_container(d.items()) (7685, None)
-- Inada Naoki <songofacandy@gmail.com>
I had a nice note almost written yesterday, but now there've been a bunch more discussion, so I'm going to try to hit a few points that have been recently made. TL;DR: I personally think it would be a nice feature to add indexing to the dict views. But to be fair, the only real use case I've seen is random.choice(), so it's really not very compelling. And it seems the consensus is coming down on the side of no. However, I find I disagree with many of the "no" arguments, other than: "it's too much churn for little gain", so I'm going to make my points, thinking that maybe that trade-off will be rethought. So if yiure firmly on the side of no, I guess there's no point in reading this, but I do hope it will be considered. On Fri, Jul 10, 2020 at 12:45 PM David Mertz <mertz@gnosis.cx> wrote:
The strongest argument I've seen is: `list(d.items())` adds six characters.
That's a misrepresentation: the reason to prefer not to use the list call is twofold: 1) Matching our mental model / usability: if I want the nth item (or a random item) from a dict, I want to ask for that -- I don't want to make a list, just to index it and throw it away. the list(d.items()) idiom is the right one if I actually need a list -- it's a bit awkward to have to make a list, just to throw it away. 2) Performance: making an entire list just to get one item out is a potentially expensive operation. Again, for the limited use cases, probably not a big deal, I'm having a really hard time imagining a application where that would be a bottleneck, but it is *a* reason, if not a compelling one.
Moreover, even apart from the work of maintaining the feature itself, the attractive nuisance of getting O(N) behavior rather than O(1) seems like a strong anti-feature.
Yes, this is the only anti-feature I've seen described in this thread. But it's only an anti-feature for the use case of making multiple indexing operations from the same dict view, without changes to the dict. It's a feature if you need to make only one (or very few) indexing operations from the same non-mutated dict. After all, that's exactly why we have the dict views in the first place: you don't want to have to make an unnecessary copy if don't need to. That clearly applies to iteration and membership: why not to the "getting one item out" case? But of course, it is indeed an attractive nuisance in some cases, which is different than the other view use cases: they are the same or more efficient than the old "make a list" approach, whereas this would be more efficient in some cases, and less in others -- so users would need to evaluate the trade offs, and many wouldn't even know they should think about that. Overall though, I think that folks would still need to make a list if they wanted to do any other MutableSequence operations (or be clearly working with a copy), so I don't think there's all that much danger is this feature being accidentally used. On Fri, Jul 10, 2020, 1:07 PM Stestagg <stestagg@gmail.com> wrote:
I don't mind the shooting down, as long as the arguments make sense :D.
I agree here for sure: I've no problem with folks having a different opinion about the value of the trade offs, but I think the trade offs have been misrepresented -- hence this post ... (no, I don't think anyone's misrepresenting anything on purpose -- this is about the technical issues)
It seems like we're both in agreement that the cost of implementing &
maintaining the change is non-zero.
Another note there: one of the big costs is implementation and documentation. But this is Open Source: we can all decide that a feature is a good idea, but it'll never get done unless someone(s) actually decides it is worth it, to them, to write the code and docs. If no one does, then it's not going to happen. So that part of the cost is self limiting. Granted, once written, it needs to be maintained, but that is a lesser cost, at least in this case, where it's not a whole new object or anything.
I don't believe that this feature would steepen the language learning
curve however, but actually help to shallow it slightly (Explained more below)
I agree here. Granted, it's again, only the one use case, but when my newbie students have to figure out how to get a random key from a dict, there is no question that: random.choice(the_dict.keys()) is a little easier than: random.choice(list(the_dict.keys()) and a lot easier than (untested): idx = random.randint(0, len(the_dict)) it = iter(the_dict.keys()) for _ in range(choice): choice = next(it) getting an arbitrary one is a bit easier: choice = next(iter(the_dict.keys())) In practice, I use this as a teaching opportunity -- but the fact that it IS a teaching opportunity kind makes my point. Granted, if this feature were there, there'd be the need to teach folks about why they want to avoid the attractive nuisance discussed above -- so I'll set a net-zero.
import numpy as np
mapping_table = np.array(BIG_LOOKUP_DICT.items())
one note on numpy: the numpy array() function is very much designed for Sequences: partly due to history, but also for convenience and performance -- it needs to know what the size and data type of the array it is going to create is before it creates it. And honestly, I'm not sure that array() would work with the dict views anyway if we added indexing -- we'd have to look at the logic inside array() And numpy has from_iter() for working with iterators. In short: it would work with numpy is NOT a reason to add this feature :-)
And I expect that even if dict.items() was indexable, numpy would
still have to copy the items. I don't know how numpy works in detail, but I doubt that it will be able to use a view of a hash table internals as a fast array without copying.
of course not -- but it makes a copy of the items in a list too -- so the extra copy for the list is still there. (numpy works with homogenous lower level data types -- the actual bytes of the C datatype -- so it is always copying the values when it makes an array out of Python types. (except for the numpy object dtype, but that's a special case)
What making dict_* types a Sequence will do is make this code (as written) behave:
For my part, I'm not asking for the dict views to be full blown Sequences -- I think that *would* be an attractive nuisance. I'm thinking only adding indexing. still think of concrete sequences and indexing as fundamental, while
Python 3 has moved in the direction of making the iterator protocol and iterators as fundamental.
That is indeed a change in Python over the years, but i don't think it was a practicality-driven change: in short: don't make copies you don't need to make. So I don't think we should use "Iterators are fundamental to Python" as a reason to NOT add Sequence-like behavior. You have a hammer (indexing), so you want views to be nails so you can
hammer them. But views are screws, and need a screwdriver (iter and next).
But there are, in carpentry, many places where you can use either a screw or a nail, and some of us have even been known to hammer a screw in, even if we had a screwdriver handy, and knew what the heck we were doing. That is the argument here: when the screw can be well used, in a particular case, by hitting it with a hammer, then why not let me do that. To take the analogy way too far: don't take the hammer out of my toolbox just because there are some screwdrivers in there.
The existing dictionary memory layout doesn't support direct indexing (without stepping), so this functionality is not being added as a requirement.
But it does make it much more efficient if the stepping is done inside the dict object by code that knows its internal structure. Both because it can be in C, and can be done without any additional references or copying. yes, it's all O(n) but a very different constant.
The
fact that they can be indexed in reasonable time is not part of the design, just an accident of implementation, and being an accident, it could change in the future.
It *could*, but I can't imagine how you could have an efficient order-preserving data structure that could not be indexed reasonably -- in particular, more efficiently than making a full list copy first. And even so -- fine: performance characteristics are not guaranteed anyway.
If random.choice should support non-sequence ordered container, just propose it to random.choice.
That would indeed solve the usability issue, and so may be a good idea, The problem here is that there is no way for random.choice to efficiently work with generic Mappings. This whole discussion started because now that dicts preserve order, there is both a logical reason, and a practical implementation for indexing. But if that is not exposed, then random.choice(), nor any other function, can take advantage of it. Which would lead to adding a random_choice protocol -- but THAT sure seems like overkill. (OK, you could have the builtin random.choice check for an actual dict, and then use custom code to make a random selection, but that would really be a micro-optimization!)
but they can't be Sequences, since they are already Sets. They would have to be a hybrid of the two, and that, I feel, comes with more baggage than just being one or the other.
I Think this is where I fundamentally disagree, as far as language design and Python philosophy is concerned. I've been using Python for 20+ years (terrifying!) and I have always really like the Duck typing concept. in fact, even one better, it doesn't have to look, walk, and quack like a duck to be a duck -- if I only need it to quack, I don't care how it looks and walks. Since those pre-2.0 days, Python has grown a lot more "structure" to its typing, notably ABCs and now facilities for static type checking. So far, those *enable* more formal typing, but don't *require* it. But as more folks start to use them, I'm going to have to start writing more strictly typed code if I want to use other libraries -- I"m hoping it won't come to that, but we'll see. To bring this back to the case at hand: I haven't looked at the code, but I"m pretty sure that random.choice() does not check for the Sequence ABC: it simply tries to get the length, and then index the object to get a random item. If that works, then it works -- This is proven by passing it a dict with integer indexes in the right range: In [28]: d Out[28]: {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9} In [29]: random.choice(d) Out[29]: 9 I LIKE this -- so the argument that dict views shouldn't support indexing because they are a Set and can't be a proper Sequence is exactly backwards from how I think Python should work: If a feature is useful, and doesn't conflict with another feature, then we can add it. In the end though, while I think there is very little reason NOT to add indexing to dict views, unless someone comes up with a good use case beyond random.choice(), it may not be worth the churn. -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Sat, Jul 11, 2020 at 3:45 PM Christopher Barker <pythonchb@gmail.com> wrote:
On Fri, Jul 10, 2020 at 12:45 PM David Mertz <mertz@gnosis.cx> wrote:
The strongest argument I've seen is: `list(d.items())` adds six characters.
1) Matching our mental model / usability: if I want the nth item (or a random item) from a dict, I want to ask for that -- I don't want to make a list, just to index it and throw it away. the list(d.items()) idiom is the right one if I actually need a list -- it's a bit awkward to have to make a list, just to throw it away. 2) Performance: making an entire list just to get one item out is a potentially expensive operation. Again, for the limited use cases, probably not a big deal, I'm having a really hard time imagining a application where that would be a bottleneck, but it is *a* reason, if not a compelling one.
For the single case of wanting JUST ONE random key/val pair from a dictionary EVER, I agree random.choice(d.items()) looks a little better. I even agree that it's something that probably seems obvious to try the first time. However, once you want to get multiple items, the amortized costs starts to matter. I think that latter situation is likely to be VASTLY more common in real code... but I'm just saying that without evidence. Still, this is such a narrow situation, I just don't see it as justifying a language change (the "get an item, only once" case). Code that I can easily imagine writing (although I'm not sure I actually have): items = dict.items() while True: k, v = random.choice(items) if satisfies_need(k, v): do_stuff(k, v) break The hypothetical view-sequence would be an attractive nuisance that would hurt performance in all such similar code. -- The dead increasingly dominate and strangle both the living and the not-yet born. Vampiric capital and undead corporate persons abuse the lives and control the thoughts of homo faber. Ideas, once born, become abortifacients against new conceptions.
On Sat, Jul 11, 2020 at 3:45 PM Christopher Barker <pythonchb@gmail.com> wrote:
random.choice(the_dict.keys())
is a little easier than:
random.choice(list(the_dict.keys())
Ummm... don't you mean: random.choice(list(the_dict)) If it's keys you care about I've saved you one character over your proposed style, while also reading better to me. It's only for .items() where it doesn't work. And honestly, just looking up the value from the random key is not hard. In any case, if "reservoir sampling" is the goal here, we should just add a function `random.reservoir_sample()` to accommodate using iterators rather than sequences (https://en.wikipedia.org/wiki/Reservoir_sampling) -- The dead increasingly dominate and strangle both the living and the not-yet born. Vampiric capital and undead corporate persons abuse the lives and control the thoughts of homo faber. Ideas, once born, become abortifacients against new conceptions.
On Sat, Jul 11, 2020 at 4:33 PM David Mertz <mertz@gnosis.cx> wrote:
In any case, if "reservoir sampling" is the goal here, we should just add a function `random.reservoir_sample()` to accommodate using iterators rather than sequences (https://en.wikipedia.org/wiki/Reservoir_sampling)
Doh! I mean *iterables* not *iterators* of course. This would actually be a reasonable and useful function to have, and sampling from dict.items() would fall out transparently from that. -- The dead increasingly dominate and strangle both the living and the not-yet born. Vampiric capital and undead corporate persons abuse the lives and control the thoughts of homo faber. Ideas, once born, become abortifacients against new conceptions.
On Sat, Jul 11, 2020 at 1:33 PM David Mertz <mertz@gnosis.cx> wrote:
On Sat, Jul 11, 2020 at 3:45 PM Christopher Barker <pythonchb@gmail.com> wrote:
random.choice(the_dict.keys())
is a little easier than:
random.choice(list(the_dict.keys())
Ummm... don't you mean:
random.choice(list(the_dict))
well, sure, though I have to say that I think that that's an unfortunate confusing thing about python dicts. IN fact, I doubt there are many uses at all for dict.keys() -- most uses can jsut use the dict. I certainly see newbies write: if this in dict.keys(): pretty often. maybe adding indexing to the dict views will give the dict_keys object more reason to exist :-) -CHB
If it's keys you care about I've saved you one character over your proposed style, while also reading better to me. It's only for .items() where it doesn't work. And honestly, just looking up the value from the random key is not hard.
In any case, if "reservoir sampling" is the goal here, we should just add a function `random.reservoir_sample()` to accommodate using iterators rather than sequences (https://en.wikipedia.org/wiki/Reservoir_sampling)
-- The dead increasingly dominate and strangle both the living and the not-yet born. Vampiric capital and undead corporate persons abuse the lives and control the thoughts of homo faber. Ideas, once born, become abortifacients against new conceptions.
-- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On 13 Jul 2020, at 05:55, Christopher Barker <pythonchb@gmail.com> wrote:
well, sure, though I have to say that I think that that's an unfortunate confusing thing about python dicts. IN fact, I doubt there are many uses at all for dict.keys() -- most uses can jsut use the dict.
I use key() all the time to sort the keys before printing. for name in sorted(dict.key()): print(name, dict[name) Barry
On Mon, 13 Jul 2020 at 18:42, Barry Scott <barry@barrys-emacs.org> wrote:
On 13 Jul 2020, at 05:55, Christopher Barker <pythonchb@gmail.com> wrote:
well, sure, though I have to say that I think that that's an unfortunate confusing thing about python dicts. IN fact, I doubt there are many uses at all for dict.keys() -- most uses can jsut use the dict.
I use key() all the time to sort the keys before printing.
for name in sorted(dict.key()): print(name, dict[name)
Why not just use sorted(dict)? Paul
I doubt there are many uses at all for dict.keys() -- most uses can jsut use the dict.
I use key() all the time to sort the keys before printing.
for name in sorted(dict.key()): print(name, dict[name)
Why not just use sorted(dict)?
Thanks for so nicely making my point :-) Though I do like the keys() version: it seems more obvious to me. -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On 13 Jul 2020, at 18:56, Paul Moore <p.f.moore@gmail.com> wrote:
On Mon, 13 Jul 2020 at 18:42, Barry Scott <barry@barrys-emacs.org> wrote:
On 13 Jul 2020, at 05:55, Christopher Barker <pythonchb@gmail.com> wrote:
well, sure, though I have to say that I think that that's an unfortunate confusing thing about python dicts. IN fact, I doubt there are many uses at all for dict.keys() -- most uses can jsut use the dict.
I use key() all the time to sort the keys before printing.
for name in sorted(dict.key()): print(name, dict[name)
Why not just use sorted(dict)?
Ooo shiny I do like that version of sorted(d). I got curious and found that iter(d) is the same as iter(d.keys()) Barry
On 13/07/2020 18:42, Barry Scott wrote:
On 13 Jul 2020, at 05:55, Christopher Barker <pythonchb@gmail.com <mailto:pythonchb@gmail.com>> wrote:
well, sure, though I have to say that I think that that's an unfortunate confusing thing about python dicts. IN fact, I doubt there are many uses at all for dict.keys() -- most uses can jsut use the dict.
I use key() all the time to sort the keys before printing.
for name in sorted(dict.key()): print(name, dict[name)
Barry
But you could use items() instead: for name, val in sorted(dict.items()): print(name, val) Rob Cliffe
On Mon, Jul 13, 2020 at 11:15 AM Rob Cliffe <rob.cliffe@btinternet.com> wrote:
On 13 Jul 2020, at 05:55, Christopher Barker <pythonchb@gmail.com> wrote:
In fact, I doubt there are many uses at all for dict.keys() -- most uses can jsut use the dict.
But you could use items() instead: for name, val in sorted(dict.items()): print(name, val)
Sure. But my point was that the dict_keys object is very often unnecessary, not that all the views aren't useful. And it really is mostly only there for symmetry with dict_items(), which IS needed. I'm on the fence though -- I think it adds clarity, even if it's unnecessary. -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
re-awakening this thread, because I just happened on an actual real-world use case: I need the first 255 items in a dict, as a dict. Order is important, and I can count on an order-preserving dict. I ended up with: from itertools import islice smaller_dict = dict(islice(large_dict.items(), 0, 255)) which works, and isn't doing an unnecessary copying but it's pretty darn ugly, as far as I'm concerned. I could also do it the old fashioned way: smaller_dict = {} for k, v in dict.items(): if i > 255: break smaller_dict[k] = v which is worse. I'd much rather: dict(large_dict.items[:255]) I'd even more prefer: dict[:255] which would, in theory be possible, as slices are not hashable, so no one is doing this in current code as an actual dict key, but yeah, that's too much magic to ask for. To be fair, this isn't that much data, and it's in memory already, so: dict(tuple(large_dict.items())[:255]) would work just fine. but as Python has moved toward a lot more iterables, and lazy evaluation, it seems like we shouldn't have to make a full copy, just to pull a fraction of something out. Note that this is related to an earlier thread I started about having a way to lazy-slice a Sequence -- essentially islice, but with slice syntax. This is also related a bit to another thread about keywords in indexing -- there were examples there where it would be nice to have slice syntax in more places, like function calls. then this could be: So: anyone have a cleaner way to accomplish this without any changes to Python? This is still an oddball use case, so may not be a reason to change Python -- but it IS a real world, operational code, use case :-) -CHB On Sun, Jul 12, 2020 at 9:55 PM Christopher Barker <pythonchb@gmail.com> wrote:
On Sat, Jul 11, 2020 at 1:33 PM David Mertz <mertz@gnosis.cx> wrote:
On Sat, Jul 11, 2020 at 3:45 PM Christopher Barker <pythonchb@gmail.com> wrote:
random.choice(the_dict.keys())
is a little easier than:
random.choice(list(the_dict.keys())
Ummm... don't you mean:
random.choice(list(the_dict))
well, sure, though I have to say that I think that that's an unfortunate confusing thing about python dicts. IN fact, I doubt there are many uses at all for dict.keys() -- most uses can jsut use the dict.
I certainly see newbies write:
if this in dict.keys():
pretty often.
maybe adding indexing to the dict views will give the dict_keys object more reason to exist :-)
-CHB
If it's keys you care about I've saved you one character over your proposed style, while also reading better to me. It's only for .items() where it doesn't work. And honestly, just looking up the value from the random key is not hard.
In any case, if "reservoir sampling" is the goal here, we should just add a function `random.reservoir_sample()` to accommodate using iterators rather than sequences (https://en.wikipedia.org/wiki/Reservoir_sampling)
-- The dead increasingly dominate and strangle both the living and the not-yet born. Vampiric capital and undead corporate persons abuse the lives and control the thoughts of homo faber. Ideas, once born, become abortifacients against new conceptions.
-- Christopher Barker, PhD
Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
-- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
Christopher Barker writes:
from itertools import islice
smaller_dict = dict(islice(large_dict.items(), 0, 255))
which works, and isn't doing an unnecessary copying but it's pretty darn ugly, as far as I'm concerned.
In your application, I think that's just pretty, myself. The only thing that's missing is slice notation. But that's probably not hard to implement in terms of the islice function, just completely redundant.
.iloc[] is the Pandas function for accessing by integer-location: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.... """ Purely integer-location based indexing for selection by position. .iloc[] is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array. Allowed inputs are: - An integer, e.g. 5. - A list or array of integers, e.g. [4, 3, 0]. - A slice object with ints, e.g. 1:7. - A boolean array. - A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above). This is useful in method chains, when you don’t have a reference to the calling object, but would like to base your selection on some value. .iloc will raise IndexError if a requested indexer is out-of-bounds, except slice indexers which allow out-of-bounds indexing (this conforms with python/numpy slice semantics). """ https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#select... I don't remember why df.iloc[3] is preferred over directly getitem'ing df[3] ? - Because `3` could be a key or an integer-location; to avoid ambiguity - Because changing between df[3] and df.loc[3] and df[3:5] and df.iloc[3:5] is unnecessary cognitive burden: it's easier to determine that the code is intentionally accessing by integer-location from '.iloc' than from ':' (indicating slice notation)
odict.iloc[3]
Would this be the only way to access only item 4 from an odict of length greater than 4 with slice notation for dict views generated from selection by integer-location?
odict[3:4]
What does this do? Confusing to a beginner:
odict[3,]
On Wed, Jul 29, 2020, 1:28 AM Stephen J. Turnbull < turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
Christopher Barker writes:
from itertools import islice
smaller_dict = dict(islice(large_dict.items(), 0, 255))
which works, and isn't doing an unnecessary copying but it's pretty darn ugly, as far as I'm concerned.
In your application, I think that's just pretty, myself. The only thing that's missing is slice notation. But that's probably not hard to implement in terms of the islice function, just completely redundant. _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/LPHMWN... Code of Conduct: http://python.org/psf/codeofconduct/
On Wed, Jul 29, 2020 at 7:23 AM Wes Turner <wes.turner@gmail.com> wrote:
.iloc[] is the Pandas function for accessing by integer-location:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame....
I'm not sure I quite get the analogy here, but ...
odict.iloc[3]
Would this be the only way to access only item 4 from an odict of length greater than 4 with slice notation for dict views generated from selection by integer-location?
are suggesting that we add another attribute to dicts to provide "index" access? I'm not sure the advantage of that, as we already have the dict views, so I guess it's nice not to have the type the parentheses: odict.items()[3] IS a bit klunkier than odict.iloc[3] (Side note: why ARE the views provided by methods, rather than properties? But I digress ...)
odict[3:4]
that would be possible. What does this do? Confusing to a beginner:
odict[3,]
no more confusing than it is for Sequences, is it? In [4]: l[3,] --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-4-ca8ec5ca5475> in <module> ----> 1 l[3,] TypeError: list indices must be integers or slices, not tuple But yes, if we allowed dicts to be indexable with slices, they still couldn't be indexed by single indexes (unless that value happened to be a key) so there would be no way to get a single item by index, as a length-1 slice would presumably return a dict with one item. So yeah, if indexing, slicing were to happen, it would have to be in the views. But this does bring up that there are a lot of places where slice notation could be used outside of [] -- and that would be handy for use cases like pandas, even if not for dicts. -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
Are you suggesting that dict views would become subscriptable? In [3]: %doctest_mode Exception reporting mode: Plain Doctest mode is: ON
x = dict(a=2, b=3, c=4) x.items()[3] Traceback (most recent call last): File "<ipython-input-5-519a9a9c5ae6>", line 1, in <module> x.items()[3] TypeError: 'dict_items' object is not subscriptable
On Thu, Jul 30, 2020, 12:05 PM Christopher Barker <pythonchb@gmail.com> wrote:
On Wed, Jul 29, 2020 at 7:23 AM Wes Turner <wes.turner@gmail.com> wrote:
.iloc[] is the Pandas function for accessing by integer-location:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame....
I'm not sure I quite get the analogy here, but ...
odict.iloc[3]
Would this be the only way to access only item 4 from an odict of length greater than 4 with slice notation for dict views generated from selection by integer-location?
are suggesting that we add another attribute to dicts to provide "index" access? I'm not sure the advantage of that, as we already have the dict views, so I guess it's nice not to have the type the parentheses:
odict.items()[3] IS a bit klunkier than odict.iloc[3]
(Side note: why ARE the views provided by methods, rather than properties? But I digress ...)
odict[3:4]
that would be possible.
What does this do? Confusing to a beginner:
odict[3,]
no more confusing than it is for Sequences, is it?
In [4]: l[3,]
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-4-ca8ec5ca5475> in <module> ----> 1 l[3,]
TypeError: list indices must be integers or slices, not tuple
But yes, if we allowed dicts to be indexable with slices, they still couldn't be indexed by single indexes (unless that value happened to be a key) so there would be no way to get a single item by index, as a length-1 slice would presumably return a dict with one item.
So yeah, if indexing, slicing were to happen, it would have to be in the views.
But this does bring up that there are a lot of places where slice notation could be used outside of [] -- and that would be handy for use cases like pandas, even if not for dicts.
-CHB
-- Christopher Barker, PhD
Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Thu, Jul 30, 2020 at 22:23 Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
On 31/07/20 4:04 am, Christopher Barker wrote:
(Side note: why ARE the views provided by methods, rather than properties?)
Probably because they construct objects, and are therefore somewhat more expensive than is usually expected for an attribute access.
Also for compatibility (even if imperfect) with Python 2. It’s water over the dam now. —Guido -- --Guido (mobile)
On Tue, Jul 28, 2020 at 10:26 PM Stephen J. Turnbull < turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
Christopher Barker writes:
from itertools import islice
smaller_dict = dict(islice(large_dict.items(), 0, 255))
which works, and isn't doing any unnecessary copying, but it's pretty darn ugly, as far as I'm concerned.
In your application, I think that's just pretty, myself.
well, beauty is in the eye of the beholder, of course. But really, you think it's pretty to import itertools, then make a function call, for what's very much a slicing operation? Python has been shifting away from a sequence-focused to an iteration focused approach for a while -- and while I do see the benefits, I also see the drawbacks. And it seems to be coupled with what might be called a shift to a functional style This reminds me a bit of a thread on this list a couple years(?) back, about a built in "group by" operation. It petered out, but one point of disagreement was whether folks wanted an API that was more "functional", e.g. map-like, or comprehension based. Comprehensions are functional, and at least generator comprehensions (expressions) are all about iterables, but there is a different style as to whether you are more using expressions or passing functions around. I personally far prefer the comprehension style, but was surprised by how many folks prefered the "functional" style. This is not exactly the same, but it "feels" the similar to me -- I really prefer things like expressions and slice notation to a pile of nested function calls. Think about it -- if Python had started out being built around iterables, and itertools had been there from the beginning, would Sequences simply be iterables that happened to have a length? And would people be expressing that: itertools.islice(a_list, 0, 100, 2) was a pretty way to get a portion of a list -- why would we want "slicing" notation at all? That's being dramatic, but to bring this back around, the original title of this thread is:"Access (ordered) dict by ..." The point being that dicts are now ordered, so maybe it makes sense to add some operations that make it natural and easy to work with them in an ordered way -- you know, like slicing :-) I don't expect this to be accepted, folks didn't seem to like the idea much, and it's not a very big deal, as we've seen this is only the second use case anyone has come up with (the other being selecting a random item out of a dict). But it did strike me when I had the use-case. And I'm well aware of islice, and yet it still took me a LOT longer to write that code than it would have to do a slice :-) that's missing is slice notation. But that's probably not hard to
implement in terms of the islice function, just completely redundant.
I don't follow here -- the slice notation is very important here -- I'm not sure what you mean by "implement in terms of the islice function", but that does bring up another point: islice does not provide a way to do the equivalent of negative indexes -- because it's designed to work with iterables that don't have a length. But if you ARE using it on a Sequence -- that's a missing feature. -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Thu, Jul 30, 2020 at 11:55 AM Christopher Barker <pythonchb@gmail.com> wrote:
smaller_dict = dict(islice(large_dict.items(), 0, 255)) well, beauty is in the eye of the beholder, of course. But really, you think it's pretty to import itertools, then make a function call, for what's very much a slicing operation?
In your post that introduced that,that is *exactly* what I thought of when you first described the issue, before I read down to your solution. It took a fraction of a second to think of for me. It *is* slightly verbose. What do you think about this in current Python:
large = {i:i**2 for i in range(1000)} import sys from itertools import islice class IterSlicer: ... def __getitem__(self, what): ... it, sl = what ... return islice(it, sl.start or 0, sl.stop or sys.maxsize, sl.step or 1) ... IS = IterSlicer() dict(IS[large.items(), 3:10:2]) {3: 9, 5: 25, 7: 49, 9: 81} from itertools import count set(IS[count(), 10:100:9]) {64, 37, 73, 10, 46, 82, 19, 55, 91, 28}
Potentially IterSlicer, or IS, could live in itertools (or more likely more-itertools) just to provide a short way to use slice notation with an arbitrary iterable... not limited to dict.items(). -- The dead increasingly dominate and strangle both the living and the not-yet born. Vampiric capital and undead corporate persons abuse the lives and control the thoughts of homo faber. Ideas, once born, become abortifacients against new conceptions.
On Thu, Jul 30, 2020 at 08:52:34AM -0700, Christopher Barker wrote:
The point being that dicts are now ordered, so maybe it makes sense to add some operations that make it natural and easy to work with them in an ordered way -- you know, like slicing :-)
Dicts are not sequences and cannot support ordering operations like insert, sort, prepend, reverse or slice assignment. They merely preserve insertion order. You can't insert a key in a specific position. If I have this dict: mydict = {'a': 1, 'c': 3, 'd': 4, 'e': 5} I can't insert 'b':2 between keys 'a' and 'c', except by creating a new dict. Nor can I sort the keys, or reverse them, except by putting them into a list and sorting or reversing the list, and only then putting them into a dict. We could read the key and/or value at index N, that is straight forward if we can decide on a colour of the bike shed. But I don't believe that there is an straight forward way of replacing the key at index N, or of inserting keys, or swapping them, or directly changing their order. Dicts are, sort of, like a stack: you can "insert" (append) new keys at the end, but not in the middle. So let's please stop over-generalising by saying dicts are "ordered" without the qualifier that they only preserve insertion order, just like collections.OrderedDict. The minimal API we could support would be a getter that returns the key at index N. Once you have the key, you can get or set the value. So all we really need is that one single getter. Anything else is gravy. dict.getkey(n) # returns the key at index n by insertion order I wouldn't object to: dict.getitem(n) # returns (key, value) at index n since it's hardly more trouble to return both the key and the value at the same time. But more than that seems difficult to justify, and supporting slices seems like overkill. You can always get them with a comprehension: [mydict.getkey(n) for n in range(3, 49, 7)] -- Steven
On Thu, 30 Jul 2020 at 19:24, Steven D'Aprano <steve@pearwood.info> wrote:
You can't insert a key in a specific position. If I have this dict:
mydict = {'a': 1, 'c': 3, 'd': 4, 'e': 5}
I can't insert 'b':2 between keys 'a' and 'c', except by creating a new dict.
Not sure about this. In C code, dicts are a hashtable and an array of items. In theory, nothing prevents you from inserting a new key in a specific position of the key array instead of at the end.
On Thu, Jul 30, 2020, 3:37 PM Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
On Thu, 30 Jul 2020 at 19:24, Steven D'Aprano <steve@pearwood.info> wrote:
You can't insert a key in a specific position. If I have this dict:
mydict = {'a': 1, 'c': 3, 'd': 4, 'e': 5}
I can't insert 'b':2 between keys 'a' and 'c', except by creating a new dict.
Not sure about this. In C code, dicts are a hashtable and an array of items. In theory, nothing prevents you from inserting a new key in a specific position of the key array instead of at the end.
Nothing but the cost of shifting successive elements by 1 and sometimes copying the entire array to a new, larger array. Would these be the *non-mutating* methods desired of insertion-ordered dicts? .iloc[sli:ce] .iloc[int] .iloc[[list,]] .iloc[callable] .iloc[bitmask] .index(key) _______________________________________________
Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/PX6OBI... Code of Conduct: http://python.org/psf/codeofconduct/
On Thu, Jul 30, 2020 at 05:03:58PM -0400, Wes Turner wrote:
Would these be the *non-mutating* methods desired of insertion-ordered dicts?
.iloc[sli:ce] .iloc[int] .iloc[[list,]] .iloc[callable] .iloc[bitmask]
.index(key)
No. "iloc" sounds like a new Apple product to keep strangers out of your home, or maybe something to do with iterators. Both slicing and "[list,]" arguments can be easily performed using a comprehension. I have no idea what calling it with a callable or bitmask is intended to do, or why they would be useful, but bitmasks are usually ints, so how would it distinguish between the item at index 3 and the item with bitmask 3? Is there a use-case for retrieving the insertion position of a key, or are we just adding it because we can? -- Steven
What new syntax do you propose? On Thu, Jul 30, 2020 at 5:55 PM Steven D'Aprano <steve@pearwood.info> wrote:
On Thu, Jul 30, 2020 at 05:03:58PM -0400, Wes Turner wrote:
Would these be the *non-mutating* methods desired of insertion-ordered dicts?
.iloc[sli:ce] .iloc[int] .iloc[[list,]] .iloc[callable] .iloc[bitmask]
.index(key)
No.
"iloc" sounds like a new Apple product to keep strangers out of your home, or maybe something to do with iterators.
Both slicing and "[list,]" arguments can be easily performed using a comprehension.
I have no idea what calling it with a callable or bitmask is intended to do, or why they would be useful, but bitmasks are usually ints, so how would it distinguish between the item at index 3 and the item with bitmask 3?
Is there a use-case for retrieving the insertion position of a key, or are we just adding it because we can?
-- Steven _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/XWQZD4... Code of Conduct: http://python.org/psf/codeofconduct/
On Thu, Jul 30, 2020 at 06:17:24PM -0400, Wes Turner wrote:
What new syntax do you propose?
I don't propose new syntax. If this functionality is needed, regular method syntax will do the job without misleading people to th8ink that dicts are sequences that ought to support the full sequence API or the full set of list methods. One or the other of these methods should be sufficient for the uses I've seen: dict.getkey(index) # return key dict.getitem(index) # return (key, value) I'm not even convinced that there is a compelling use-case for this functionality at all, but one or both of the above seem more than sufficient. Once you have the key, you can do everything else. If there is a need to get the insertion index of a key, that could also be a method, but let's hold off adding that unless there is a good reason. -- Steven
On Thu, Jul 30, 2020 at 09:35:31PM +0200, Marco Sulla wrote:
On Thu, 30 Jul 2020 at 19:24, Steven D'Aprano <steve@pearwood.info> wrote:
You can't insert a key in a specific position. If I have this dict:
mydict = {'a': 1, 'c': 3, 'd': 4, 'e': 5}
I can't insert 'b':2 between keys 'a' and 'c', except by creating a new dict.
Not sure about this. In C code, dicts are a hashtable and an array of items. In theory, nothing prevents you from inserting a new key in a specific position of the key array instead of at the end.
Of course anything is possible in theory, but I think that would require shifting the existing keys and values in the array, and updating the entries in the hash table to point to their key in the new position, and probably other complications I haven't thought of. Of course it is possible to have a data structure with O(1) insertions and updates by key, and O(1) insertions and updates by index, and fully reorderable keys. But it probably won't be small and lean and fast. Just because something takes constant time doesn't mean it will be a *fast* constant time. In any case, if people want to propose turning dicts into fully-fledged reorderable, slicable sequences as well as mappings, they will probably need a PEP :-) -- Steven
On Thu, 30 Jul 2020 at 23:42, Steven D'Aprano <steve@pearwood.info> wrote:
Of course anything is possible in theory, but I think that would require shifting the existing keys and values in the array, and updating the entries in the hash table to point to their key in the new position, and probably other complications I haven't thought of.
Also list must shift the internal array. You don't have to update the hashtable. For example, when you pop an item, the key is not removed from the hashtable completely, but it's marked as dummy. I think the main problem here is set, since set is not ordered. In theory, dict views could be an "ordered" set, so you can do: d.keys()[5] d.values()[1:] d.items()[3] = "omg" In any case, if people want to propose turning dicts into fully-fledged
reorderable, slicable sequences as well as mappings, they will probably need a PEP :-)
I'm just brainstorming. Not sure about the real use-case.
# Dicts and DataFrames - Src: https://github.com/westurner/pythondictsanddataframes/blob/master/dicts_and_... - Binder: https://mybinder.org/v2/gh/westurner/pythondictsanddataframes/master?filepat... - (interactive Jupyter Notebook hosted by https://mybinder.org/ ) ## Question / Problem / Challenge Question: Now that dicts are insertion-ordered (since Python 3.6), can we lookup items by integer-location? - As of CPython 3.10: No; the public Python `dict` API neither: - (1) offers any method to access keys, values, or items by integer-location; nor - (2) exposes anything from the underlying C code like `dict._array` which could be used for such a method. `dict._array` would be considered an implementation detail that could be different in Python versions and implementations How could lookup of `dict` keys, values, and items *by integer-location* be implemented? - This is the subject of this document. ## Background - The CPython dict is an insertion-ordered Hashmap: https://docs.python.org/3/library/stdtypes.html#mapping-types-dict - https://github.com/python/cpython/blob/master/Objects/dictobject.c - https://github.com/python/cpython/blob/master/Objects/odictobject.c - The Pandas Series and DataFrames are insertion-ordered columnar data structures - https://pandas.pydata.org/pandas-docs/stable/reference/series.html - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.htm... - https://github.com/pandas-dev/pandas/blob/master/pandas/core/series.py - https://pandas.pydata.org/pandas-docs/stable/reference/frame.html - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.... - https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py ```python import itertools import pandas as pd import pytest import random try: display # IPython except NameError: def display(*args): print(args) pd.set_option('display.notebook_repr_html', False) ``` ### CPython dict basics ```python odict = {'a':'A', 'b':'B', 'c':'C', 'd': 'D'} odict = dict(a='A', b='B', c='C', d='D') odict = dict((x, x.upper()) for x in 'abcd') # list comprehension odict = {x:x.upper() for x in 'abcd'} # dict comprehension odict = dict.fromkeys('abcd') [odict.__setitem__(x, x.upper()) for x in 'abcd'] display(odict) assert list(odict.keys()) == list('abcd') == ['a', 'b', 'c', 'd'] assert random.seed(1) or list(odict.keys()) == random.seed(2**10) or list(odict.keys()) assert list(odict.values()) == list('ABCD') assert list(odict.items()) == [('a', 'A'), ('b', 'B'), ('c', 'C'), ('d', 'D')] ``` {'a': 'A', 'b': 'B', 'c': 'C', 'd': 'D'} `itertools.islice(dict.items())` is suboptimal for a number of cases because we don't need to iterate through the items at the beginning: we could directly address the underlying array. ```python # This would not call next() unnecessarily: with pytest.raises(AttributeError): # 'dict' object has no attribute '_array' odict._array[3] # _array[3] ``` ## Possible Solutions ### No changes to CPython #### Make an unnecessary copy of the whole dict in order to only take(3) ```python assert list(odict.items())[3] == ('d', 'D') ``` #### Call `itertools.islice(dict.items())` - Does this call `next()`, `next()`, `next()`, `next()` unnecessarily? - `itertools.islice(dict.items())` is suboptimal for a number of cases because we don't need to iterate through the items at the beginning. Directly addressing the underlying array would be much faster but unlikely to happen because the underlying array is an implementation detail. ```python list(itertools.islice(odict.items(), 0)) ``` [] ```python list(itertools.islice(odict.items(), 3)) ``` [('a', 'A'), ('b', 'B'), ('c', 'C')] ```python list(itertools.islice(odict.items(), 3, 3+1)) ``` [('d', 'D')] ### Change CPython #### Expose something like `dict._array` - Again, the underlying array is a \[CPython,] implementation detail - So, there must be methods that provide access to the [(key, item),] data that hide the implementation details. #### Overload `dict.__getitem__` (`dict[]`) `dict[]` (`dict.__getitem__`) is already used for lookup by key value, so `dict[3]` could either be lookup by value or lookup by integer-location: which would be dangerously abiguous at runtime and frustratingly vague to review. (See also the note below regarding the fate of the Pandas `.ix.__getitem__` method). #### Make all iterators subscriptable: define `iterator.__getitem__` `dict.items()[3]` fails with a `TypeError: 'dict_items' object is not subscriptable`:`dict.items()` returns a view (instead of an unnecessarily copied list like in Python 2) which is not subscriptable. Could we define `iterator.__getitem__` such that: ```python obj = list('abcd') iter(obj)[3] => islice(obj, 3, 3+1) iter(obj)[3:4] => islice(obj, 3, 4) iter(obj)[0:4:2] => islice(obj, 1, 4, 2) ``` - This would require a change to the python grammar. - This would result in implicit iteration/traversal, which may be destructive and opaquely resource-prohibitive. - This would not break existing code. #### Add `dict.getkey(index)` and `dict.getitem(index)` ```python def getkey(self: dict, index: int): pass def getitem(self: dict, index: int): pass ``` - This does not support slices - This still has to unnecessarily iterate with `islice` without something like `dict._array` #### Make `dict.keys()`, `dict.values()`, and `dict.items()` subscriptable ```python obj = dict.fromkeys('abcd') obj.keys()[3] => next(islice(obj, 3)) ``` - This would require a change to the python grammar. - This would be a special case; and then we'd all be asking for subscriptable iterators, too. - This would not break existing code. ```python # iterator[3] does not call islice(iterator, 3): with pytest.raises(TypeError): # 'dict_keys' object is not subscriptable odict.keys()[3] with pytest.raises(TypeError): # 'dict_values' object is not subscriptable odict.values()[3] with pytest.raises(TypeError): # 'dict_items' object is not subscriptable odict.items()[3] ``` #### Define `dict.keys.__getitem__` and handle slices ```python obj = dict.fromkeys('abcd') obj.keys[3] => obj._array[3] ``` - This would be backward incompatible because `dict.keys()` is a method and not a property. To not break all existing code, it would have to be: ```python obj = dict.fromkeys('abcd') obj.keys()[3] => obj._array[3] ``` Which brings us back to the previous question. - Would this be confusing to newcomers? Why don't other iterators work that way? #### Copy the Pandas Series / DataFrame `.iloc` API? ##### How is the Pandas DataFrame API at all relevant to dicts? - Pandas DataFrames (and Series, which have an index and one column; more like dicts) support key/value lookup just like dicts - IIUC, OP is asking for lookup by integer-location; which is supported by `DataFrame.iloc` (and `Series.iloc`) - Pandas already handles the presented use cases with an `.iloc` method that handles slices and tuples (and callables). - Pandas [used to have `.ix`]( https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DataFr...), which was: > A primarily label-location based indexer, with integer position fallback. > > Warning: Starting in 0.20.0, the .ix indexer is deprecated, in favor of the more strict .iloc and .loc indexers. - The Pandas API is now also implemented by Dask, Modin, CuDF. It may be the most widely-used DataFrame API. - Granted, a DataFrame is not the same as an OrderedHashmap; because we often find that data is multi-columnar and we usually don't want to have to `lookup`/`seek()` to the n-th field of each hashmap value. But we do want indexed data with fast sequential reads and the option to return one or more items by integer-location. ##### Add an `.iloc` property with a `__getitem__` to `dict` that returns just the key (slice) or the `(key, item)` (slice) ```python obj = dict.fromkeys('abcd') obj.iloc[3] => obj._array[3][0] # key-only? or obj.iloc[3] => obj._array[3] # (key, item)? ``` ##### Add an `.iloc` property to the `.keys`, `.values`, and `.items` methods? ```python obj.keys.iloc[3] => obj._array[3][0] obj.values.iloc[3] => obj._array[3][1] obj.items.iloc[3] => obj._array[3] ``` - Is there any precedent for adding a property to a method in the CPython standard library? - It can be done with *slow* init-time binding and/or metaclassery. - See: `IlocIndexer` below #### Add `.keys_iloc()`, `.values_iloc()`, and `.items_iloc()` to `dict` - Does not require binding `IlocIndexer` to `.keys.iloc`, `.values.iloc`, and `.items.iloc` at `dict.__init__` time or metaclassery. - Supports slices ```python def _dict_view_islice(iterable, start: int=None, stop: int=None, step: int=None): # This implementation calls islice() # which, AFAIU, calls next() a bunch of times unnecessarily _slice = slice(start, stop, step) if not isinstance(start, slice) else start return itertools.islice(iterable, _slice.start, _slice.stop, _slice.step) def keys_iloc(self: dict, start: int=None, stop: int=None, step: int=None): return _dict_view_islice(self, start, stop, step) def values_iloc(self: dict, start: int=None, stop: int=None, step: int=None): return _dict_view_islice(self.values(), start, stop, step) def items_iloc(self: dict, start: int=None, stop: int=None, step: int=None): return _dict_view_islice(self.items(), start, stop, step) assert next(keys_iloc(odict, 3)) == 'd' assert next(values_iloc(odict, 3)) == 'D' assert next(items_iloc(odict, 3)) == ('d', 'D') assert list(items_iloc(odict, 0, 4, 2)) == [('a', 'A'), ('c', 'C')] assert list(items_iloc(odict, slice(0, 4, 2))) == [('a', 'A'), ('c', 'C')] # ... def _dict_view_ilookup(obj: dict, start: int=None, stop: int=None, step: int=None): # This implementation would access the underlying dict array values directly # (without calling iter() and next() on dict views) _slice = slice(start, stop, step) if not isinstance(start, slice) else start return obj._array[_slice] ``` ### Support passing slices to methods so that `.keys(1:2)` or `.keys_iloc(1:2)` works - AFAIU, this would require a change to the Python grammar. ### Add `start`, `stop`, and `step` arguments to `.keys()`, were `start` can optionally be a `slice()` - This would not be break any existing code, but alone would not support normal slice syntax. ```python class IlocIndexer: def __init__(self, obj): self.obj = obj def __getitem__(self, obj): if isinstance(obj, int): _slice = slice(obj, obj+1) elif isinstance(obj, slice): _slice = obj else: _slice = slice(obj) return itertools.islice(self.obj(), _slice.start, _slice.stop, _slice.step) # return self.obj._array[_slice] class Dict(dict): def keys(self): return super().keys() def values(self): return super().values() def items(self): return super().items() def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self.keys.__func__.iloc = IlocIndexer(self.keys) self.values.__func__.iloc = IlocIndexer(self.values) self.items.__func__.iloc = IlocIndexer(self.items) d = Dict.fromkeys('abcd') d = Dict(odict) assert list(d.keys()) == list('abcd') assert list(d.keys.iloc[0]) == list('a') assert list(d.keys.iloc[2:4]) == list('cd') assert list(d.values()) == list('ABCD') assert list(d.values.iloc[0]) == list('A') assert list(d.values.iloc[2:4]) == list('CD') assert list(d.items()) == [(x,x.upper()) for x in 'abcd'] assert list(d.items.iloc[0]) == [('a', 'A')] assert list(d.items.iloc[2:4]) == [('c', 'C'), ('d', 'D')] ``` ## If you give a mouse a cookie **Once we've added the ability to lookup from (now ordered) dicts by integer-location, what else will people ask for?** Probably features from `pandas.Series`, `pandas.DataFrame`, `pandas.MultiIndex`. - Slices - Multidimensional lookup - Lookup of integer-locations ### Slices ```python df = pd.DataFrame.from_dict(odict, orient='index') display(df) df.columns = [0] assert list(df.iloc[2:4]) == [0] df.columns = ['letter'] assert list(df.iloc[2:4]) == ['letter'] ``` 0 a A b B c C d D ### Multidimensional lookup: `.iloc[0, 1, 0]` ```python df = pd.DataFrame.from_dict(odict, orient='index') df.columns = ['letter'] display(df) assert df.iloc[2, 0] == 'C' assert df.iloc[3, 0] == 'D' assert list(df.iloc[2:4, 0]) == ['C', 'D'] df = df.T display(df) assert list(df.iloc[0, 2:4]) == ['C', 'D'] ``` letter a A b B c C d D a b c d letter A B C D ### Lookup of integer-locations #### Python `list.index(value)` ```python alist = list('abcd') assert alist.index('d') == 3 ``` #### Python `dict.get(key, default=None)` ```python assert odict.get('d') == 'D' ``` Something like this would sort next to `.get()` when tab-completing: ```python assert odict.getkeypos('d') == 3 ``` #### Pandas ```python df = pd.DataFrame.from_dict(odict, orient='index') display(df) # Lookup ordinal integer-location by index value: assert df.index.get_loc('d') == 3 assert df.iloc[df.index.get_loc('d')].tolist() == ['D'] # Lookup index values by column value: assert df[df[0] == 'D'].index.tolist() == ['d'] df.columns = ['letters'] assert df[df['letters'] == 'D'].index.tolist() == ['d'] == \ df.query('letters == "D"').index.tolist() # Lookup ordinal integer-location(s) of value: assert [df.index.get_loc(idxval) for idxval in df[df['letters'] == 'D'].index.tolist()] == [3] import numpy as np assert list(np.where(df['letters'].values == 'D')[0]) == [3] ``` 0 a A b B c C d D ```python assert \ df[df['letters'].eq('D')].index == \ df.index[df['letters'].eq('D')] == \ df.query('letters == "D"').index == \ df[df['letters'] == 'D'].index == \ df.eval('x = letters == "D"').query("x").index ``` ### Why would I use Series or DataFrame here instead of dict? **How do and why would I refactor code to use `Series` or `DataFrame` instead of `dict`?** - Why: - performance & scalability (Cython, profiling, dask.distributed scales the DataFrame API to multiple processes and/or multiple machines with datasets larger than can fit into RAM) - Maintainability & security: nobody wants to figure out your poor-man's in-memory datastore implemented with only the standard library; months later you realize you need to optimize a query and something like `df.query` would take your team years to get partially working, so you just dump everything into a database and add SQL vulnerabilities and a whole dict/object layer, and all you needed was a join and merge that you *could* just do with coreutils join and merge (but that would lose type safety and unescaped newlines would then be as dangerous as unparametrized SQL queries). And then it's time to write docs for what we did here. - De/serialization (`to_numpy`, `to_parquet`, `to_json`, `to_csv`, `to_html`, `to_markdown`) - How: - Add a dependency on an external library that needs to be compiled or installed and kept upgraded. - Load the data into a DataFrame - Read the docs and use the API; which supports lookup by value, by integer-position (`.iloc`), by complex chained queries, etc. # Pandas DataFrame reference - There are many excellent Pandas tutorials. - There was a docs sprint. - `Series` have an `.index` and no `.columns` (only one column; more like `dict`) - `DataFrames` have an `.index` and a `.columns` - There are a number of ways to load a dict into a DataFrame and lookup values: ```python df = pd.DataFrame.from_records([odict]) display(df) assert list(df.index) == [0] assert list(df.columns) == list('abcd') assert list(df.iloc[0]) == list('ABCD') assert list(df.iloc[0].index) == list(df.columns) == list('abcd') assert list(df.iloc[0]) == list(df.loc[0]) assert list(df.loc[0].index) == list(df.columns) == list('abcd') assert list(df.loc[0]) == list('ABCD') assert df.iloc[0]['a'] == 'A' == df.loc[0]['a'] ``` a b c d 0 A B C D ```python df = pd.DataFrame.from_records(list(odict.items())) display(df) assert list(df.index) == [0, 1, 2, 3] assert list(df.columns) == [0, 1] assert list(df.iloc[0]) == list(df.loc[0]) assert list(df.iloc[0]) == ['a', 'A'] assert list(df.iloc[0].index) == list(df.columns) == [0, 1] ``` 0 1 0 a A 1 b B 2 c C 3 d D ```python with pytest.raises(ValueError): df = pd.DataFrame.from_dict(odict) ``` ```python df = pd.DataFrame.from_dict(odict.items()) display(df) assert list(df.index) == [0, 1, 2, 3] assert list(df.columns) == [0, 1] assert df[0][0] == 'a' == df.iloc[0].iloc[0] ``` 0 1 0 a A 1 b B 2 c C 3 d D ```python df = pd.DataFrame.from_dict(odict, orient='index') display(df) assert list(df.index) == list('abcd') assert list(df.columns) == [0] assert df.loc['a'][0] == 'A' == df.iloc[0][0] ``` 0 a A b B c C d D ```python df.columns = ['letter'] display(df) assert df.loc['a']['letter'] == 'A' == df.iloc[0][0] == df.loc['a'].iloc[0] == df.iloc[0].loc['letter'] ``` letter a A b B c C d D ```python df.index = ['red', 'green', 'blue', 'orange'] display(df) ``` letter red A green B blue C orange D ```python assert type(df.iloc[0][0]) == str df.iloc[0][0] = dict.fromkeys('abc') assert type(df.iloc[0][0]) == dict display(df) ``` letter red {'a': None, 'b': None, 'c': None} green B blue C orange D On Fri, Jul 31, 2020 at 1:35 AM Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
On Thu, 30 Jul 2020 at 23:42, Steven D'Aprano <steve@pearwood.info> wrote:
Of course anything is possible in theory, but I think that would require shifting the existing keys and values in the array, and updating the entries in the hash table to point to their key in the new position, and probably other complications I haven't thought of.
Also list must shift the internal array. You don't have to update the hashtable. For example, when you pop an item, the key is not removed from the hashtable completely, but it's marked as dummy.
I think the main problem here is set, since set is not ordered. In theory, dict views could be an "ordered" set, so you can do:
d.keys()[5] d.values()[1:] d.items()[3] = "omg"
In any case, if people want to propose turning dicts into fully-fledged
reorderable, slicable sequences as well as mappings, they will probably need a PEP :-)
I'm just brainstorming. Not sure about the real use-case. _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/EF3FRD... Code of Conduct: http://python.org/psf/codeofconduct/
On Fri, Jul 31, 2020, 2:48 AM Wes Turner <wes.turner@gmail.com> wrote:
# Dicts and DataFrames - Src: https://github.com/westurner/pythondictsanddataframes/blob/master/dicts_and_... - Binder: https://mybinder.org/v2/gh/westurner/pythondictsanddataframes/master?filepat... - (interactive Jupyter Notebook hosted by https://mybinder.org/ )
The punchline of Wes Turner's notebook (very well put together, thank you!) seems to partly be that if you find yourself wanting to work with the position of items in a dict, you might want to consider using a pandas.Series (with it's .iloc method). A difficulty that immediately came to mind with this advice is type hinting support. I was just googling yesterday for "how to type hint using pandas" and the only thing I found is to use pd.Series and pd.DataFrame directly. But those don't support type hinting comparable to: Dict[str, float] Or: class Vector(TypedDict): i: float j: float This is a big downside of the advice "just use pandas". Although I love using pandas and use it all the time.
So maybe we need to add dict.ordered() which returns a view on the items that is a Sequence rather than a set? Or ordereditems(), orderedkeys() and orderedvalues()? On Fri, Jul 31, 2020 at 05:29 Ricky Teachey <ricky@teachey.org> wrote:
On Fri, Jul 31, 2020, 2:48 AM Wes Turner <wes.turner@gmail.com> wrote:
# Dicts and DataFrames - Src: https://github.com/westurner/pythondictsanddataframes/blob/master/dicts_and_... - Binder: https://mybinder.org/v2/gh/westurner/pythondictsanddataframes/master?filepat... - (interactive Jupyter Notebook hosted by https://mybinder.org/ )
The punchline of Wes Turner's notebook (very well put together, thank you!) seems to partly be that if you find yourself wanting to work with the position of items in a dict, you might want to consider using a pandas.Series (with it's .iloc method).
A difficulty that immediately came to mind with this advice is type hinting support. I was just googling yesterday for "how to type hint using pandas" and the only thing I found is to use pd.Series and pd.DataFrame directly.
But those don't support type hinting comparable to:
Dict[str, float]
Or:
class Vector(TypedDict): i: float j: float
This is a big downside of the advice "just use pandas". Although I love using pandas and use it all the time.
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/C7HJFK... Code of Conduct: http://python.org/psf/codeofconduct/
-- --Guido (mobile)
On Fri, Jul 31, 2020 at 10:31 AM Guido van Rossum <guido@python.org> wrote:
So maybe we need to add dict.ordered() which returns a view on the items that is a Sequence rather than a set? Or ordereditems(), orderedkeys() and orderedvalues()?
I for one would be +1 on this idea. On the bike shed hew: oitems(), okeys(), and ovalues() would be much shorter, and shouldn't be too difficult to read once they get established (though i am sure many will differ on this point). --- Ricky. "I've never met a Kentucky man who wasn't either thinking about going home or actually going home." - Happy Chandler
+1 on (lazy) Sequence views optimized for sequential reads and random access (which hide the dict implementation which varies across which platforms?) I'm somewhat indifferent on the name: # Sequence views names: ## method() -> Sequence - orderedkeys(), orderedvalues(), ordereditems() - okeys(), ovalues(), oitems() - keys(), keysordered(), values(), valuesordered(), items(), itemsordered() - What does this do with tab-completion? # Get key, value, item by integer-location method names: ## method(start: int, stop: int, step: int) -> Any - keys_iloc(), values_iloc(), items_iloc() ## __getitem__(slice_or_int: Union[slice, int]) -> Any - keys[0], values.iloc[1:4:2], items.iloc[1:4, 2:3] - keys_iloc[], values_iloc[], items_iloc[] - keysiloc[], valuesiloc[], itemsiloc[] - keys.iloc[], values.iloc[], items.iloc[] - keypos(), ## method(index: int) -> Any: - getkey(), getvalue(), getitem( # Get integer-location of a key, of a value: ## method(value) -> int, method(value) -> - index(value) - this is compatible with list.index and str.index but incompatible with Series/Dataframe.index (which may or may not be irrelevant) - orderedkeys.index(), orderedvalues.index() - okeys.index(), ovalues.index() - keys.ordered.index(), values.ordered.index() - getkeypos(), getvaluepos() - getkeyiloc(), getvalueiloc() ... Thoughts: - what did i miss? - is there an opportunity to optimize performance of e.g getvaluepos() / values.ordered.index()? - performance considerations between islice(dict_view) and Sequence.__getitem__? On Fri, Jul 31, 2020 at 11:17 AM Ricky Teachey <ricky@teachey.org> wrote:
On Fri, Jul 31, 2020 at 10:31 AM Guido van Rossum <guido@python.org> wrote:
So maybe we need to add dict.ordered() which returns a view on the items that is a Sequence rather than a set? Or ordereditems(), orderedkeys() and orderedvalues()?
I for one would be +1 on this idea.
On the bike shed hew: oitems(), okeys(), and ovalues() would be much shorter, and shouldn't be too difficult to read once they get established (though i am sure many will differ on this point).
--- Ricky.
"I've never met a Kentucky man who wasn't either thinking about going home or actually going home." - Happy Chandler
On Fri, Jul 31, 2020 at 11:55 AM Wes Turner <wes.turner@gmail.com> wrote:
+1 on (lazy) Sequence views optimized for sequential reads and random access (which hide the dict implementation which varies across which platforms?)
I'm somewhat indifferent on the name:
# Sequence views names: ## method() -> Sequence - orderedkeys(), orderedvalues(), ordereditems() - okeys(), ovalues(), oitems() - keys(), keysordered(), values(), valuesordered(), items(), itemsordered() - What does this do with tab-completion?
Perhaps add these to the list of possibilities also....? keyseq(), valueseq(), itemseq() -- Ricky. "I've never met a Kentucky man who wasn't either thinking about going home or actually going home." - Happy Chandler
keyseq(), valueseq(), itemseq()
Yeah, that's a good one On Fri, Jul 31, 2020, 12:18 PM Ricky Teachey <ricky@teachey.org> wrote:
On Fri, Jul 31, 2020 at 11:55 AM Wes Turner <wes.turner@gmail.com> wrote:
+1 on (lazy) Sequence views optimized for sequential reads and random access (which hide the dict implementation which varies across which platforms?)
I'm somewhat indifferent on the name:
# Sequence views names: ## method() -> Sequence - orderedkeys(), orderedvalues(), ordereditems() - okeys(), ovalues(), oitems() - keys(), keysordered(), values(), valuesordered(), items(), itemsordered() - What does this do with tab-completion?
Perhaps add these to the list of possibilities also....?
keyseq(), valueseq(), itemseq()
-- Ricky.
"I've never met a Kentucky man who wasn't either thinking about going home or actually going home." - Happy Chandler
Is it not more simple to add some sequence methods to the dict views (if there's a real need)? If you do tuple(dict.keys()), you get the sequence of keys in the same insertion order of the dict. It seems to me that the duck already exists and quacks.
On Fri, Jul 31, 2020 at 3:52 PM Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
Is it not more simple to add some sequence methods to the dict views (if there's a real need)? If you do tuple(dict.keys()), you get the sequence of keys in the same insertion order of the dict. It seems to me that the duck already exists and quacks.
The problem with that is these view objects are in fact NOT ducks (sequences). They are gooses (sets). Gooses don't quack. They honk. --- Ricky. "I've never met a Kentucky man who wasn't either thinking about going home or actually going home." - Happy Chandler
On Sat, Aug 1, 2020 at 6:07 AM Ricky Teachey <ricky@teachey.org> wrote:
On Fri, Jul 31, 2020 at 3:52 PM Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
Is it not more simple to add some sequence methods to the dict views (if there's a real need)? If you do tuple(dict.keys()), you get the sequence of keys in the same insertion order of the dict. It seems to me that the duck already exists and quacks.
The problem with that is these view objects are in fact NOT ducks (sequences). They are gooses (sets). Gooses don't quack. They honk.
They are guaranteed to iterate in the same order as the dict, though, unless I'm mistaken. There's been a weaker guarantee for a long time (that iteration order for keys/values/items is consistent and will not change without a dict mutation), and I can't find anywhere that it's now guaranteed to be the same as dict order, but I would be very surprised if list(d) was different from list(d.keys()). So, constructing a tuple or list from the keys or items WILL give you a sequence. Whether that's good enough is another question. For instance, it requires taking a full copy of the dict, even if all you want is a single element from it. ChrisA
On Fri, 31 Jul 2020 at 22:14, Chris Angelico <rosuav@gmail.com> wrote:
So, constructing a tuple or list from the keys or items WILL give you a sequence.
Yes. Since now dicts are ordered by insertion, also keys, values and items are ordered the same way. It seems to me more simple to add some sequence methods to dict views, like subscript, slicing and index(), instead of creating other 3 methods and 3 data types. What I have not understood well is when you need to index a dict by position or slice it.
On Fri, Jul 31, 2020 at 2:56 PM Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
On Fri, 31 Jul 2020 at 22:14, Chris Angelico <rosuav@gmail.com> wrote:
So, constructing a tuple or list from the keys or items WILL give you a sequence.
Yes. Since now dicts are ordered by insertion, also keys, values and items are ordered the same way.
It seems to me more simple to add some sequence methods to dict views, like subscript, slicing and index(), instead of creating other 3 methods and 3 data types.
Maybe it is common in numpy and pandas to keep adding operations to the same object until it breaks, but the key and items views already implement the Set ABC, and I'd rather refrain from having them *also* implement the Sequence ABC. For one thing, when they want to pass one of those views as an argument to another function *as a set or as a sequence*, this would force the user to show the reader what they meant, and this will make the code more readable. It would also clarify whether some other mapping supports the same operation (the presence of the new method would indicate it).
What I have not understood well is when you need to index a dict by position or slice it.
I'm guessing that indexing by 0, if it were possible, would be a convenient idiom to implement the "first item" operation that has been requested numerous times (at least for dicts). Slicing would be useful to get the first N items of a huge dict without materializing the full list of items as a list object, which brought Chris B to request this in the first place. The dict would have to be huge for this materialization to matter, but Chris B did encounter the situation. What I don't understand yet is how *frequently* the latter operation would be useful -- if it's infrequently, Chris B's own solution using islice() on the items() view looked pretty decent to me, and not that hard to come up with for someone who made it that far. For the former I expect that sooner or later someone will write a PEP and it will be accepted (assuming the PEP doesn't overreach). -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-change-the-world/>
Ok... I wrong. The array of items have a NULL key and value if one item is deleted. Sorry for the confusion. On Sat, 1 Aug 2020 at 01:09, Guido van Rossum <guido@python.org> wrote:
the key and items views already implement the Set ABC, and I'd rather refrain from having them *also* implement the Sequence ABC.
I'm not pro or against the proposal, but maybe count it's not useful at all for keys and items. Furthermore, dict implements __reversed__, that is not a Mapping method, but it's a Sequence method. I think that it's useful to implement methods of other APIs without changing their name or implement the full other API... if it's useful ^^ But maybe I'm missing something in the general picture.
On Fri, Jul 31, 2020 at 04:08:43PM -0700, Guido van Rossum wrote:
Maybe it is common in numpy and pandas to keep adding operations to the same object until it breaks, but the key and items views already implement the Set ABC, and I'd rather refrain from having them *also* implement the Sequence ABC.
+1 I'm not a huge fan of pandas' APIs, I don't think that "pandas does it this way" is something we ought to emulate.
I'm guessing that indexing by 0, if it were possible, would be a convenient idiom to implement the "first item" operation that has been requested numerous times (at least for dicts).
Indeed, that seems to be the only use-case which seems to be even remotely common. `dict.popitem` would do it, of course, but it also mutates the dict. The other simple solution is `next(iter(mydict.items()))`. The bottom line here seems to me, as far as I can tell, is that being able to fetch a key and/or value by index is of only marginal use.
Slicing would be useful to get the first N items of a huge dict without materializing the full list of items as a list object, which brought Chris B to request this in the first place.
The first request in this thread was from Hans: https://mail.python.org/archives/list/python-ideas@python.org/message/S7UMTW... He is using a dict to hold an array of columns indexed by name `{column_name: column}` and wanted to re-order and insert columns at arbitrary positions. -- Steven
On Fri, Jul 31, 2020 at 7:59 PM Steven D'Aprano <steve@pearwood.info> wrote:
On Fri, Jul 31, 2020 at 04:08:43PM -0700, Guido van Rossum wrote: [...]
I'm guessing that indexing by 0, if it were possible, would be a convenient idiom to implement the "first item" operation that has been requested numerous times (at least for dicts).
Indeed, that seems to be the only use-case which seems to be even remotely common. `dict.popitem` would do it, of course, but it also mutates the dict.
The other simple solution is `next(iter(mydict.items()))`.
That one always makes me uncomfortable, because the StopIteration it raises when the dict is empty might be misinterpreted. Basically I never want to call next() unless there's a try...except StopIteration: around it, and that makes this a lot less simple.
The bottom line here seems to me, as far as I can tell, is that being able to fetch a key and/or value by index is of only marginal use.
Agreed.
Slicing would be useful to get the first N items of a huge dict without materializing the full list of items as a list object, which brought Chris B to request this in the first place.
The first request in this thread was from Hans:
https://mail.python.org/archives/list/python-ideas@python.org/message/S7UMTW...
He is using a dict to hold an array of columns indexed by name `{column_name: column}` and wanted to re-order and insert columns at arbitrary positions.
A bare dict is just not the data structure for that problem. -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-change-the-world/>
On Fri, Jul 31, 2020 at 08:08:58PM -0700, Guido van Rossum wrote:
The other simple solution is `next(iter(mydict.items()))`.
That one always makes me uncomfortable, because the StopIteration it raises when the dict is empty might be misinterpreted. Basically I never want to call next() unless there's a try...except StopIteration: around it, and that makes this a lot less simple.
Acknowledged. But there are ways to solve that which perhaps aren't as well known as they should be. * Use a default: `next(iter(mydict.items()), MISSING)` * Use a helper to convert StopIteration to something else. Some years ago, someone (I think it was Nick Coghlan?) proposed a standard solution for this issue, a context manager + decorator function that guarded against a specific exception. Nothing much came of it, but I did experiment with the idea, and got something which you could use like this: with exception_guard(StopIteration): first = next(iter(mydict.items())) or like this: safenext = exception_guard(StopIteration)(next) first = safenext(iter(mydict.items())) I think that would be a good tool for the functools library, but I acknowledge that even if standard, it would be a little too obscure for most people thinking "I need the first key from this dict". -- Steven
On Sat, Aug 1, 2020 at 1:43 PM Steven D'Aprano <steve@pearwood.info> wrote:
Some years ago, someone (I think it was Nick Coghlan?) proposed a standard solution for this issue, a context manager + decorator function that guarded against a specific exception. Nothing much came of it, but I did experiment with the idea, and got something which you could use like this:
with exception_guard(StopIteration): first = next(iter(mydict.items()))
My understanding of this is that 'first' is unassigned if StopIteration happens.
or like this:
safenext = exception_guard(StopIteration)(next) first = safenext(iter(mydict.items()))
My understanding of this is that I am confused. What does safenext return? ChrisA
On Sat, Aug 01, 2020 at 02:08:08PM +1000, Chris Angelico wrote:
On Sat, Aug 1, 2020 at 1:43 PM Steven D'Aprano <steve@pearwood.info> wrote:
Some years ago, someone (I think it was Nick Coghlan?) proposed a standard solution for this issue, a context manager + decorator function that guarded against a specific exception. Nothing much came of it, but I did experiment with the idea, and got something which you could use like this:
with exception_guard(StopIteration): first = next(iter(mydict.items()))
My understanding of this is that 'first' is unassigned if StopIteration happens.
Sorry for not being more explicit about what was going on. I was stuck in my own head and didn't consider that others might not recall the discussion from all those many years ago, mea culpa. The exception guard doesn't merely catch and discard exceptions. It re-raises with a new exception, RuntimeError by default.
or like this:
safenext = exception_guard(StopIteration)(next) first = safenext(iter(mydict.items()))
My understanding of this is that I am confused. What does safenext return?
Nothing; it raises RuntimeError. -- Steven
On Sat, Aug 1, 2020 at 3:50 PM Steven D'Aprano <steve@pearwood.info> wrote:
On Sat, Aug 01, 2020 at 02:08:08PM +1000, Chris Angelico wrote:
On Sat, Aug 1, 2020 at 1:43 PM Steven D'Aprano <steve@pearwood.info> wrote:
Some years ago, someone (I think it was Nick Coghlan?) proposed a standard solution for this issue, a context manager + decorator function that guarded against a specific exception. Nothing much came of it, but I did experiment with the idea, and got something which you could use like this:
with exception_guard(StopIteration): first = next(iter(mydict.items()))
My understanding of this is that 'first' is unassigned if StopIteration happens.
Sorry for not being more explicit about what was going on. I was stuck in my own head and didn't consider that others might not recall the discussion from all those many years ago, mea culpa.
The exception guard doesn't merely catch and discard exceptions. It re-raises with a new exception, RuntimeError by default.
or like this:
safenext = exception_guard(StopIteration)(next) first = safenext(iter(mydict.items()))
My understanding of this is that I am confused. What does safenext return?
Nothing; it raises RuntimeError.
Ahh, okay. So it's safe in the sense that it can't accidentally leak a confusing exception. Unfortunately a lot of people are going to assume that it means "will always give a useful return value". Definitely worth being very very clear about the semantics. And it's really not the ideal semantics here anyway. What you actually want is ValueError if the dict is empty. There's no easy way to spell that generically, so it'd want a helper function, which means it would probably do well as a method. ChrisA
On Sat, Aug 01, 2020 at 03:57:34PM +1000, Chris Angelico wrote:
Ahh, okay. So it's safe in the sense that it can't accidentally leak a confusing exception. Unfortunately a lot of people are going to assume that it means "will always give a useful return value". Definitely worth being very very clear about the semantics.
What you name the function is up to you :-)
And it's really not the ideal semantics here anyway. What you actually want is ValueError if the dict is empty.
Your wish is my command: mynext = exception_guard(catch=StopIteration, throw=ValueError)(next) -- Steven
On Sat, Aug 1, 2020 at 12:40 PM Steven D'Aprano <steve@pearwood.info> wrote:
On Fri, Jul 31, 2020 at 08:08:58PM -0700, Guido van Rossum wrote:
The other simple solution is `next(iter(mydict.items()))`.
That one always makes me uncomfortable, because the StopIteration it raises when the dict is empty might be misinterpreted. Basically I never want to call next() unless there's a try...except StopIteration: around it, and that makes this a lot less simple.
Acknowledged. But there are ways to solve that which perhaps aren't as well known as they should be.
* Use a default: `next(iter(mydict.items()), MISSING)`
* Use a helper to convert StopIteration to something else.
There is a most simple solution: * `[first] = mydict.items()`, or `first, = mydict.items()` Anyway, should we add some tools to itertools, instead of "itertools recipe"? * `first(iterable, default=None)` -- same to `[first] = iterable`, but return default value instead of ValueError when iterable is empty. * `nth(iterable, n, default=None)` * `consume(iterator, n=None)` Regards, -- Inada Naoki <songofacandy@gmail.com>
[Steven D'Aprano <steve@pearwood.info>]
.... The other simple solution is `next(iter(mydict.items()))`.
[Guido]
That one always makes me uncomfortable, because the StopIteration it raises when the dict is empty might be misinterpreted. Basically I never want to call next() unless there's a try...except StopIteration: around it, and that makes this a lot less simple.
Last time this came up, this appeared to reach near-consensus: """ exactly what more-itertools has supplied for years already :-) If the iterable is empty/exhausted, by default ValueError is raised, but that can be overridden by also passing an optional argument to return instead (like dict.pop() in this respect). So, e.g., first([42]) returns 42 first([]) raises ValueError first([], 42) and first([], default=42) return 42 I don't think it belongs in the builtins. It doesn't perfectly fit anywhere, but given its prior history in the more-itertools and itertoolz packages, Python's itertools package seems the least annoying ;-) home for it. |"""
On Fri, Jul 31, 2020 at 10:59 PM Steven D'Aprano <steve@pearwood.info> wrote:
[...]
The first request in this thread was from Hans:
https://mail.python.org/archives/list/python-ideas@python.org/message/S7UMTW...
He is using a dict to hold an array of columns indexed by name `{column_name: column}` and wanted to re-order and insert columns at arbitrary positions.
Pandas solves for columnar data. SQL is one source and destination for columnar data which Pandas supports. Pandas handles NULL/None/NaN more consistently than dict.get("key", None). https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-sql https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#return... assert df.columns.tolist() == ['a', 'b', 'c'] # this creates a copy df2 = df[['b', 'c', 'a']] # this doesn't create a copy df.reindex(columns=['b', 'c', 'a']) # this inserts a column after column b df.insert(df.columns.get_loc('b'), 'newcolumn', df['c']) df.insert(df.columns.tolist().index('b'), 'newcolumn2', df['c']) https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.... https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.... If you need to reorder rows (or columns transposed (df.T)), you could select with .loc[[list, of, indexvalues]] or .iloc[[list, of, ints]] http://www.datasciencemadesimple.com/reindex-python-change-order-column-rows... ... To accomplish the same with the Python standard library, you'd need to create a data structure that is an amalgamation of list and dict: a MutableMapping with a Sequence interface for at least .keys() (or .keyseq()). Is there already a "pure python" Dataframe? Do whatever you want, but the Pandas DataFrame API is also available in Dask, Modin, and CuDF; for distributed and larger-than-RAM use.
On Fri, Jul 31, 2020 at 2:56 PM Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
Yes. Since now dicts are ordered by insertion, also keys, values and items are ordered the same way.
Preservation of oider was codified in version 3.7 (? at some point, anyway). And while it may not explicitly say that the views will present the same order, I haven't heard anyone complain about that in this discussion. So: dicts preserve order the dict_views preserve this same order Some think that we can't call them "ordered", because somehow that implies other things, like the ability to insert, etc, but while I don't get that argument, fine, let's call them "order preserving". So in terms of what is defined by the language, and what is technically possible, dict views could be indexed and sliced with clear semantics. Most other Sequence operations are not possible, so no, dicts are not Sequences, and the dict views aren't either. So we won't register them with the Sequence ABC, no problem. So why not add this? Form what I can tell from this thread, there are these reasons: 1) Because adding ANYTHING new is taking on a commitment to preserve it in the future, and it's more code to write, maintain, and document. So regardless of any other argument, it shouldn't be added without a reasonable use case(s) -- what's reasonable is subjective, of course. 2) Because it could be an "attractive nuisance" while dicts are order preserving, their internal structure is such that you cannot find the nth item without iterating through n items, making access O(n), rather than O(1) for access by key or access of the usual Sequences -- Folks expect numerical indexing to be O(1), so it could be tricky. However, the only other way to get an n'th item is to copy everything into a Sequence, which is a lot slower, or to use islice or next() to do the order N access by hand. So this is an attractive nuisance in the use case of wanting to take multiple items by index, in which case, making a sequence first would be better than directly accessing the dict_view object. 3) dicts are not Sequences, they are Mappings, so they shouldn't have Sequence features. dict_views are Sets, not Sequences, so they shouldn't have Sequence features. Point 3) I may have misrepresented it, it was explained in a lot more detail, but I'm pretty sure that's what it comes down to. But I'm going to talk about my understanding of the point, which is pretty much how I wrote it. If I really did misrepresent it, then feel free to kibitz some more.... It seems like I have a different philosophy about typing and duck typing than some in this converstaion. I appreciate the ABCs, and that they clearly nail down what types need to be in order to "BE" one of the ABCs -- and see how this is useful. But I don't see it as restrictive. An object is not a Sequence unless it fully conforms to the Sequence ABC, sure. But that doesn't mean an object can't have some, but not all, of the behavior of a Sequence without having all the rest. In this example, implementing integer indexing and slicing on dict_views would not make them a Sequence, but that doesn't mean you can't do it. It's also the case that any type can BE a particular ABC, and still have other functionality. So, in this case, the dict_views are Sets, but we can add other things to them of course. So on to "duck typing" -- the term is comes from the aphorism: "If it looks like a duck, swims like a duck, and quacks like a duck, then it probably *is* a duck" (https://en.wikipedia.org/wiki/Duck_test). So that pretty much maps to ABCs: the ABC for a duck specifies what a duck is -- it's something that looks, swims, and quacks like a duck. But I've always taken a more flexible view on duck typing -- if I want to know if it's a duck, yes, I need the ABC. But I don't usually care if it's a duck -- I care if it does the one or two things I need it to do. So if I need a quacker, then anything that quacks like a duck is fine with me, even if it can't swim. Bringing this back to this concrete example: One of the use cases brought up was being able to choose a random item from a dict. You can't pass a dict (or dict_views) to random.choice. But not because a dict isn't a Sequence, but because they aren't subscriptable: In [14]: random.choice(d.keys()) --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-14-76483ebbc518> in <module> ----> 1 random.choice(d.keys()) ~/miniconda3/envs/py3/lib/python3.8/random.py in choice(self, seq) 289 except ValueError: 290 raise IndexError('Cannot choose from an empty sequence') from None --> 291 return seq[i] 292 293 def shuffle(self, x, random=None): TypeError: 'dict_keys' object is not subscriptable Everyone on this thread knows this, but to be explicit, anything with a length and that is subscriptable with integers between 0 and that length can be used with random.choice. I think this is the minimal object that works: In [23]: class Dummy: ...: def __len__(self): ...: return 1000 ...: def __getitem__(self, idx): ...: return idx ...: In [24]: d = Dummy() In [25]: random.choice(d) Out[25]: 194 THIS is what I think is the key to Python's Dynamic, Duck Typing, and it's been there from the beginning, and I really like it. There's been a lot of movement lately toward static typing, but I sure hope it doesn't get enforced in the standard library in places like these. So: I really don't understand argument (3) above, it is at odds with what I like about Python. So back to this feature request: I can understand argument (2) -- the attractive nuisance, but I don't think it's a big deal. We simply can't avoid having less that optimum ways to do certain things, as long as you can still easily wrap list() or tuple() around the dict views when you need that, I don't see a problem. Which brings us to (1) -- is this a useful enough feature to justify the churn? And the answer to that is probably not, as far as I can tell, there have been only two use cases presented: selecting a random key from a dict selecting an ordered subset of a dict (a slice) If that's all we've come up with in this lengthy thread, it's probably not worth it -- after all, both of those can be efficiently accomplished with a little help from itertools.islice and/or next(). Which does bring us to a final point, and the reason why if it were only up to me, I'd probably add indexing to dict_views: I find the explicit use of iter(0 and next() and itertools.islice() pretty heavyweight, difficult, and, well, ugly, compared to indexing and slicing. Others disagree. But I think we all agree that those tools are less newbie-friendly -- but they should be learned at some point, so maybe that's OK (and there is always the wrap it with a list approach, which is pretty newbie friendly) Sorry for the long post that concludes with "nope, we're not going to do this" :-) What I have not understood well is when you need to index a dict by
position or slice it.
see the two use cases above -- that's all we have, but they ARE real use cases, if the only two that have been identified. -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Fri, Jul 31, 2020 at 4:12 PM Christopher Barker <pythonchb@gmail.com> wrote:
On Fri, Jul 31, 2020 at 2:56 PM Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
Yes. Since now dicts are ordered by insertion, also keys, values and items are ordered the same way.
Preservation of oider was codified in version 3.7 (? at some point, anyway). And while it may not explicitly say that the views will present the same order, I haven't heard anyone complain about that in this discussion.
It happened in the implementation in 3.6, and in 3.7 we added it as a guarantee to the language reference. But this means you can count on it since 3.6.
So: dicts preserve order the dict_views preserve this same order
Some think that we can't call them "ordered", because somehow that implies other things, like the ability to insert, etc, but while I don't get that argument, fine, let's call them "order preserving".
Meh, who cares. (The confusion is usually between "ordered" and "sorted" but we're past that here, so let's just keep saying "ordered".) So in terms of what is defined by the language, and what is technically
possible, dict views could be indexed and sliced with clear semantics. Most other Sequence operations are not possible, so no, dicts are not Sequences, and the dict views aren't either. So we won't register them with the Sequence ABC, no problem.
So why not add this? Form what I can tell from this thread, there are these reasons:
1) Because adding ANYTHING new is taking on a commitment to preserve it in the future, and it's more code to write, maintain, and document. So regardless of any other argument, it shouldn't be added without a reasonable use case(s) -- what's reasonable is subjective, of course.
2) Because it could be an "attractive nuisance" while dicts are order preserving, their internal structure is such that you cannot find the nth item without iterating through n items, making access O(n), rather than O(1) for access by key or access of the usual Sequences -- Folks expect numerical indexing to be O(1), so it could be tricky. However, the only other way to get an n'th item is to copy everything into a Sequence, which is a lot slower, or to use islice or next() to do the order N access by hand. So this is an attractive nuisance in the use case of wanting to take multiple items by index, in which case, making a sequence first would be better than directly accessing the dict_view object.
If true this would be a huge argument against adding indexing and slicing (other than the special case starting with 0). However, I don't think it's true. The dict implementation (again, starting in 3.6) actually stores the list of keys in a compact array, in the insertion order. The hash table stores indices into this array. Moreover, in order to implement the ordered property, you pretty much have to do this (since the hash table *definitely* isn't going to be ordered :-). So indexing and slicing into this array would seem to be possible, unless I am missing something. I'm CC"ing Inada Naoki because it's his code -- maybe he can enlighten us.
3) dicts are not Sequences, they are Mappings, so they shouldn't have Sequence features. dict_views are Sets, not Sequences, so they shouldn't have Sequence features.
Right, I said this (with more words) in my last email, a few minutes ago. Point 3) I may have misrepresented it, it was explained in a lot more
detail, but I'm pretty sure that's what it comes down to. But I'm going to talk about my understanding of the point, which is pretty much how I wrote it. If I really did misrepresent it, then feel free to kibitz some more....
It seems like I have a different philosophy about typing and duck typing than some in this converstaion. I appreciate the ABCs, and that they clearly nail down what types need to be in order to "BE" one of the ABCs -- and see how this is useful. But I don't see it as restrictive. An object is not a Sequence unless it fully conforms to the Sequence ABC, sure. But that doesn't mean an object can't have some, but not all, of the behavior of a Sequence without having all the rest. In this example, implementing integer indexing and slicing on dict_views would not make them a Sequence, but that doesn't mean you can't do it. It's also the case that any type can BE a particular ABC, and still have other functionality. So, in this case, the dict_views are Sets, but we can add other things to them of course.
And indeed, here our philosophies diverge (see my last email).
So on to "duck typing" -- the term is comes from the aphorism: "If it looks like a duck, swims like a duck, and quacks like a duck, then it probably *is* a duck" (https://en.wikipedia.org/wiki/Duck_test).
So that pretty much maps to ABCs: the ABC for a duck specifies what a duck is -- it's something that looks, swims, and quacks like a duck.
But I've always taken a more flexible view on duck typing -- if I want to know if it's a duck, yes, I need the ABC. But I don't usually care if it's a duck -- I care if it does the one or two things I need it to do. So if I need a quacker, then anything that quacks like a duck is fine with me, even if it can't swim.
We all used to think this way, until we discovered that it's *really* hard to tell whether something's a mapping or a sequence if the only operation you care about is `__getitem__`, since they both support that, but with very different semantics. So people started checking for the presence of a `keys` method to tell these apart, and that's really quite appalling. Hence the ABCs. I don't want to get into a similar situation with Set and Sequence, ever. And even though currently they *don't* have overlapping operations, the concrete types `list` and `set` *do* run into this kind of thing for comparison operators -- lists (and tuples) compare itemwise until a difference is found, while sets use comparisons to implement subset/superset tests. The ice really is rather thin here. I don't have anything to add to the rest of your post, so I'll leave that out (but thanks for writing it). -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-change-the-world/>
On Sat, 1 Aug 2020 at 00:32, Guido van Rossum <guido@python.org> wrote:
If true this would be a huge argument against adding indexing and slicing (other than the special case starting with 0). However, I don't think it's true. The dict implementation (again, starting in 3.6) actually stores the list of keys in a compact array, in the insertion order. The hash table stores indices into this array. Moreover, in order to implement the ordered property, you pretty much have to do this (since the hash table *definitely* isn't going to be ordered :-). So indexing and slicing into this array would seem to be possible, unless I am missing something. I'm CC"ing Inada Naoki because it's his code -- maybe he can enlighten us.
I’m not Inada Naoki :). So happy to be corrected, but I looked into this quite closely. The dict keys is compact only *until* you delete an item, at which point, a hole is left in the array, which defeats direct lookups (either by indexing or via a view). Order preservation is still kept as code that iterates over dictionaries knows to just skip the holes. I’m unsure when/if compaction is called however. So there may be a technique for forcing the keys to be compacted again. This would cause very unpredictable performance in the case of large dictionaries (a main use-case here) Steve
This is not great news. A solution could be to make dict.ordered() force compaction -- if there are no deleted keys this would be a no-op. We'd have to be able to tell in constant time whether this is the case. But yeah, this would be a dangerous thing in the hands of folks below wizard level. Another solution could be to make dict.ordered() fail if there are deleted keys. But that's not a great experience either. All in all I retract this idea. On Fri, Jul 31, 2020 at 5:31 PM Stestagg <stestagg@gmail.com> wrote:
On Sat, 1 Aug 2020 at 00:32, Guido van Rossum <guido@python.org> wrote:
If true this would be a huge argument against adding indexing and slicing (other than the special case starting with 0). However, I don't think it's true. The dict implementation (again, starting in 3.6) actually stores the list of keys in a compact array, in the insertion order. The hash table stores indices into this array. Moreover, in order to implement the ordered property, you pretty much have to do this (since the hash table *definitely* isn't going to be ordered :-). So indexing and slicing into this array would seem to be possible, unless I am missing something. I'm CC"ing Inada Naoki because it's his code -- maybe he can enlighten us.
I’m not Inada Naoki :). So happy to be corrected, but I looked into this quite closely.
The dict keys is compact only *until* you delete an item, at which point, a hole is left in the array, which defeats direct lookups (either by indexing or via a view). Order preservation is still kept as code that iterates over dictionaries knows to just skip the holes.
I’m unsure when/if compaction is called however. So there may be a technique for forcing the keys to be compacted again. This would cause very unpredictable performance in the case of large dictionaries (a main use-case here)
Steve
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/CWQMLX... Code of Conduct: http://python.org/psf/codeofconduct/
-- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-change-the-world/>
On Fri, Jul 31, 2020 at 8:44 PM Guido van Rossum <guido@python.org> wrote:
This is not great news. A solution could be to make dict.ordered() force compaction -- if there are no deleted keys this would be a no-op. We'd have to be able to tell in constant time whether this is the case. But yeah, this would be a dangerous thing in the hands of folks below wizard level. Another solution could be to make dict.ordered() fail if there are deleted keys. But that's not a great experience either.
All in all I retract this idea.
On Fri, Jul 31, 2020 at 5:31 PM Stestagg <stestagg@gmail.com> wrote:
On Sat, 1 Aug 2020 at 00:32, Guido van Rossum <guido@python.org> wrote:
If true this would be a huge argument against adding indexing and slicing (other than the special case starting with 0). However, I don't think it's true. The dict implementation (again, starting in 3.6) actually stores the list of keys in a compact array, in the insertion order. The hash table stores indices into this array. Moreover, in order to implement the ordered property, you pretty much have to do this (since the hash table *definitely* isn't going to be ordered :-). So indexing and slicing into this array would seem to be possible, unless I am missing something. I'm CC"ing Inada Naoki because it's his code -- maybe he can enlighten us.
I’m not Inada Naoki :). So happy to be corrected, but I looked into this quite closely.
The dict keys is compact only *until* you delete an item, at which point, a hole is left in the array, which defeats direct lookups (either by indexing or via a view). Order preservation is still kept as code that iterates over dictionaries knows to just skip the holes.
I’m unsure when/if compaction is called however. So there may be a technique for forcing the keys to be compacted again. This would cause very unpredictable performance in the case of large dictionaries (a main use-case here)
Steve
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/CWQMLX... Code of Conduct: http://python.org/psf/codeofconduct/
-- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-change-the-world/> _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/JY2ZJT... Code of Conduct: http://python.org/psf/codeofconduct/
Actually, I think the reverse traversal case is the worst case because: it's not possible to use negative subscripts with islice (because that would require making a full copy). This doesn't work:
islice(dict.keys(), -1, -5)
Reverse traversal did work in Python 2 but was foregone when making .keys() a view in Python 3 in order to avoid lulling users into making usually unnecessary copies. If it's not possible to support (dict_view||dict_sequence_view).__getitem__(slice_or_index) without causing a performance regression in dict, which directly affects core performance, I'm not sure that this use case is worth it either. There's always the linked list odict. There are probably tests somewhere that show what happens when we delete a dict key and value? I'm still somewhat mystified by that, tooL On Fri, Jul 31, 2020 at 8:44 PM Guido van Rossum <guido@python.org> wrote:
This is not great news. A solution could be to make dict.ordered() force compaction -- if there are no deleted keys this would be a no-op. We'd have to be able to tell in constant time whether this is the case. But yeah, this would be a dangerous thing in the hands of folks below wizard level. Another solution could be to make dict.ordered() fail if there are deleted keys. But that's not a great experience either.
All in all I retract this idea.
On Fri, Jul 31, 2020 at 5:31 PM Stestagg <stestagg@gmail.com> wrote:
On Sat, 1 Aug 2020 at 00:32, Guido van Rossum <guido@python.org> wrote:
If true this would be a huge argument against adding indexing and slicing (other than the special case starting with 0). However, I don't think it's true. The dict implementation (again, starting in 3.6) actually stores the list of keys in a compact array, in the insertion order. The hash table stores indices into this array. Moreover, in order to implement the ordered property, you pretty much have to do this (since the hash table *definitely* isn't going to be ordered :-). So indexing and slicing into this array would seem to be possible, unless I am missing something. I'm CC"ing Inada Naoki because it's his code -- maybe he can enlighten us.
I’m not Inada Naoki :). So happy to be corrected, but I looked into this quite closely.
The dict keys is compact only *until* you delete an item, at which point, a hole is left in the array, which defeats direct lookups (either by indexing or via a view). Order preservation is still kept as code that iterates over dictionaries knows to just skip the holes.
I’m unsure when/if compaction is called however. So there may be a technique for forcing the keys to be compacted again. This would cause very unpredictable performance in the case of large dictionaries (a main use-case here)
Steve
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/CWQMLX... Code of Conduct: http://python.org/psf/codeofconduct/
-- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-change-the-world/> _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/JY2ZJT... Code of Conduct: http://python.org/psf/codeofconduct/
On Sat, Aug 1, 2020 at 10:04 AM Wes Turner <wes.turner@gmail.com> wrote:
Actually, I think the reverse traversal case is the worst case because: it's not possible to use negative subscripts with islice (because that would require making a full copy).
This doesn't work:
islice(dict.keys(), -1, -5)
Reverse traversal did work in Python 2 but was foregone when making .keys() a view in Python 3 in order to avoid lulling users into making usually unnecessary copies.
dict is reversible now. You can do `islice(dict, 0, 5)`. -- Inada Naoki <songofacandy@gmail.com>
We should be reading the source: https://github.com/python/cpython/blob/master/Objects/dictobject.c AFAIU, direct subscripting / addressing was not a use case in the design phase of the current dict? Could a __getitem__(slice_or_int_index) be implemented which just skips over the NULLs? Or would that be no faster than or exactly what islice does when next()'ing through? On Fri, Jul 31, 2020 at 9:14 PM Inada Naoki <songofacandy@gmail.com> wrote:
On Sat, Aug 1, 2020 at 10:04 AM Wes Turner <wes.turner@gmail.com> wrote:
Actually, I think the reverse traversal case is the worst case because:
it's not possible to use negative subscripts with islice (because that would require making a full copy).
This doesn't work:
islice(dict.keys(), -1, -5)
Reverse traversal did work in Python 2 but was foregone when making
.keys() a view in Python 3 in order to avoid lulling users into making usually unnecessary copies.
dict is reversible now. You can do `islice(dict, 0, 5)`.
-- Inada Naoki <songofacandy@gmail.com>
On Sat, Aug 1, 2020 at 10:19 AM Wes Turner <wes.turner@gmail.com> wrote:
We should be reading the source: https://github.com/python/cpython/blob/master/Objects/dictobject.c
AFAIU, direct subscripting / addressing was not a use case in the design phase of the current dict?
Could a __getitem__(slice_or_int_index) be implemented which just skips over the NULLs? Or would that be no faster than or exactly what islice does when next()'ing through?
There are two major points to optimize. * Iterating over `next(islice(dict.items(), n, n+1))` will produce n temporary tuples. * (CPython implementation detail) dict can detect if there is no hole. index access is O(1) if there is no hole. -- Inada Naoki <songofacandy@gmail.com>
On Sat, Aug 1, 2020 at 10:19 AM Wes Turner <wes.turner@gmail.com> wrote:
AFAIU, direct subscripting / addressing was not a use case in the design phase of the current dict?
Nope, -- if it were, it would have presumably been implemented :-) But order-preserving wasn't really a design goal either, as I understand it, but rather a side effect of the implementation. As I recall the conversation, In 3.7, when it was made "official", even then it was less about how useful it was than that people WILL use it, and will count on it, even if they are told by the docs that they shouldn't. So we might as well commit to it. And it is indeed handy now and again. So the current conversion is the result that once we have order preserving dicts, maybe we can do a few other things with them, than a use case driving the decision in the first place. On Fri, Jul 31, 2020 at 6:35 PM Inada Naoki <songofacandy@gmail.com> wrote:
There are two major points to optimize.
* Iterating over `next(islice(dict.items(), n, n+1))` will produce n temporary tuples. * (CPython implementation detail) dict can detect if there is no hole. index access is O(1) if there is no hole.
Any thoughts on how much of a difference these might make? particularly the first one. the seconds of course won't help when there are holes, which would make performance harder to predict. -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Sat, Aug 1, 2020 at 9:55 AM Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
On Sat, 1 Aug 2020 at 02:30, Stestagg <stestagg@gmail.com> wrote:
The dict keys is compact only *until* you delete an item, at which point, a hole is left in the array
No, the array of items has no hole. The hole is inserted in the hashtable.
Yes, the array of items has hole. Otherwise, `del d[k]` become `O(n)`, or `del d[k]` won't preserve insertion order. Please teach me if you know any algorithm which has no hole, O(1) deletion, preserving insertion order, and efficient and fast as array. -- Inada Naoki <songofacandy@gmail.com>
On Sat, 1 Aug 2020 at 03:00, Inada Naoki <songofacandy@gmail.com> wrote:
Please teach me if you know any algorithm which has no hole, O(1) deletion, preserving insertion order, and efficient and fast as array.
:) About the hole, I was thinking that in theory the problem can be circumvented using a modified version of lookdict. lookdict searches for a key and returns its position in the ma_keys array. I suppose it's possible to do the contrary: search for the index and return the key. What do you think (theoretically speaking)?
On Sat, Aug 1, 2020 at 2:28 AM Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
On Sat, 1 Aug 2020 at 03:00, Inada Naoki <songofacandy@gmail.com> wrote:
Please teach me if you know any algorithm which has no hole, O(1) deletion, preserving insertion order, and efficient and fast as array.
I would think the goal here would be to re-order once in a while to remove the holes. But that would take time, of course, so you wouldn't want to do it on every deletion. But when? One option: maybe too specialized, but it could re-pack the array when an indexing operation is made -- since that operation is O(N) anyway. And that would then address the issue of performance for multiple indexing operations -- if you made a bunch of indexing operation in a row without deleting (which would be the case, if this is an alternative to making a copy in a Sequence first), then the first one would repack the internal array (presumably faster than making a copy) and the rest would have O(1) access. Given that this use case doesn't appear to be very important, I doubt it's worth it, but it seems it would be possible. Another thought -- could the re-packing happen whenever the entire dict is iterated through? Though maybe there's no way to know when that's going to happen -- all you get are the individual calls for the next one, yes?
About the hole, I was thinking that in theory the problem can be circumvented using a modified version of lookdict.
lookdict searches for a key and returns its position in the ma_keys array. I suppose it's possible to do the contrary: search for the index and return the key. What do you think (theoretically speaking)?
but isn't searching for the index going to require iterating through the array until you find it? i.e. that O(N) operation we're trying to avoid? -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Sun, Aug 2, 2020 at 2:34 AM Christopher Barker <pythonchb@gmail.com> wrote:
On Sat, Aug 1, 2020 at 2:28 AM Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
On Sat, 1 Aug 2020 at 03:00, Inada Naoki <songofacandy@gmail.com> wrote:
Please teach me if you know any algorithm which has no hole, O(1) deletion, preserving insertion order, and efficient and fast as array.
I would think the goal here would be to re-order once in a while to remove the holes. But that would take time, of course, so you wouldn't want to do it on every deletion. But when?
One option: maybe too specialized, but it could re-pack the array when an indexing operation is made -- since that operation is O(N) anyway. And that would then address the issue of performance for multiple indexing operations -- if you made a bunch of indexing operation in a row without deleting (which would be the case, if this is an alternative to making a copy in a Sequence first), then the first one would repack the internal array (presumably faster than making a copy) and the rest would have O(1) access.
Repacking is mutation, and mutating dict while iterating it breaks the iterator. But `d.items()[42]` don't looks like mutation.
Given that this use case doesn't appear to be very important, I doubt it's worth it, but it seems it would be possible.
Another thought -- could the re-packing happen whenever the entire dict is iterated through? Though maybe there's no way to know when that's going to happen -- all you get are the individual calls for the next one, yes?
You are right. it couldn't. -- Inada Naoki <songofacandy@gmail.com>
Is there any reason that these features couldn't be added to OrderedDict (which is a linked list)? https://github.com/python/cpython/blob/master/Objects/odictobject.c On Sat, Aug 1, 2020, 9:13 PM Inada Naoki <songofacandy@gmail.com> wrote:
On Sun, Aug 2, 2020 at 2:34 AM Christopher Barker <pythonchb@gmail.com> wrote:
On Sat, Aug 1, 2020 at 2:28 AM Marco Sulla <Marco.Sulla.Python@gmail.com>
On Sat, 1 Aug 2020 at 03:00, Inada Naoki <songofacandy@gmail.com>
wrote:
Please teach me if you know any algorithm which has no hole, O(1) deletion, preserving insertion order, and efficient and fast as array.
I would think the goal here would be to re-order once in a while to remove the holes. But that would take time, of course, so you wouldn't want to do it on every deletion. But when?
One option: maybe too specialized, but it could re-pack the array when an indexing operation is made -- since that operation is O(N) anyway. And
wrote: that would then address the issue of performance for multiple indexing operations -- if you made a bunch of indexing operation in a row without deleting (which would be the case, if this is an alternative to making a copy in a Sequence first), then the first one would repack the internal array (presumably faster than making a copy) and the rest would have O(1) access.
Repacking is mutation, and mutating dict while iterating it breaks the iterator. But `d.items()[42]` don't looks like mutation.
Given that this use case doesn't appear to be very important, I doubt it's worth it, but it seems it would be possible.
Another thought -- could the re-packing happen whenever the entire dict is iterated through? Though maybe there's no way to know when that's going to happen -- all you get are the individual calls for the next one, yes?
You are right. it couldn't.
-- Inada Naoki <songofacandy@gmail.com> _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/ED2GRW... Code of Conduct: http://python.org/psf/codeofconduct/
https://github.com/python/cpython/blob/master/Objects/odictobject.c : """ In order to preserve O(1) performance for node removal (finding nodes), we must do better than just looping through the linked-list. Here are options we've considered: 1. use a second dict to map keys to nodes (a la the pure Python version). 2. keep a simple hash table mirroring the order of dict's, mapping each key to the corresponding node in the linked-list. 3. use a version of shared keys (split dict) that allows non-unicode keys. 4. have the value stored for each key be a (value, node) pair, and adjust __getitem__(), get(), etc. accordingly. The approach with the least performance impact (time and space) is #2, mirroring the key order of dict's dk_entries with an array of node pointers. While lookdict() and friends (dk_lookup) don't give us the index into the array, we make use of pointer arithmetic to get that index. **An alternative would be to refactor lookdict() to provide the index, explicitly exposing the implementation detail.** We could even just use a custom lookup function for OrderedDict that facilitates our need. However, both approaches are significantly more complicated than just using pointer arithmetic. The catch with mirroring the hash table ordering is that we have to keep the ordering in sync through any dict resizes. However, that order only matters during node lookup. We can simply defer any potential resizing until we need to do a lookup. """ On Sat, Aug 1, 2020 at 10:06 PM Wes Turner <wes.turner@gmail.com> wrote:
Is there any reason that these features couldn't be added to OrderedDict (which is a linked list)? https://github.com/python/cpython/blob/master/Objects/odictobject.c
On Sat, Aug 1, 2020, 9:13 PM Inada Naoki <songofacandy@gmail.com> wrote:
On Sun, Aug 2, 2020 at 2:34 AM Christopher Barker <pythonchb@gmail.com> wrote:
On Sat, Aug 1, 2020 at 2:28 AM Marco Sulla <
On Sat, 1 Aug 2020 at 03:00, Inada Naoki <songofacandy@gmail.com>
wrote:
Please teach me if you know any algorithm which has no hole, O(1) deletion, preserving insertion order, and efficient and fast as array.
I would think the goal here would be to re-order once in a while to remove the holes. But that would take time, of course, so you wouldn't want to do it on every deletion. But when?
One option: maybe too specialized, but it could re-pack the array when an indexing operation is made -- since that operation is O(N) anyway. And
Marco.Sulla.Python@gmail.com> wrote: that would then address the issue of performance for multiple indexing operations -- if you made a bunch of indexing operation in a row without deleting (which would be the case, if this is an alternative to making a copy in a Sequence first), then the first one would repack the internal array (presumably faster than making a copy) and the rest would have O(1) access.
Repacking is mutation, and mutating dict while iterating it breaks the iterator. But `d.items()[42]` don't looks like mutation.
Given that this use case doesn't appear to be very important, I doubt it's worth it, but it seems it would be possible.
Another thought -- could the re-packing happen whenever the entire dict is iterated through? Though maybe there's no way to know when that's going to happen -- all you get are the individual calls for the next one, yes?
You are right. it couldn't.
-- Inada Naoki <songofacandy@gmail.com> _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/ED2GRW... Code of Conduct: http://python.org/psf/codeofconduct/
On Sat, Aug 1, 2020 at 6:10 PM Inada Naoki <songofacandy@gmail.com> wrote:
Repacking is mutation, and mutating dict while iterating it breaks the iterator. But `d.items()[42]` don't looks like mutation.
Pardon my ignorance, but IS repacking a mutation? Clearly it's mutating the internal state, but at the logical structure of the dict will not have changed. Though I suppose if it's being iterated over, then the iterator is keeping an index into the internal array, which would change on repacking? which means that it's worse than not looking like a mutation, but it could make active iterators return results that are actually incorrect. I have to think that this could be accommodated somehow, but with ugly special case code, so yeah, not worth it. Though you could keep track of if there are any active views (ir even active iterators) and not repack in that case. I'm sure most dict iterators are most commonly used right away. Is repacking ever currently done with dicts? If so, then how is this issue worked around? -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Tue, Aug 4, 2020 at 3:35 AM Christopher Barker <pythonchb@gmail.com> wrote:
On Sat, Aug 1, 2020 at 6:10 PM Inada Naoki <songofacandy@gmail.com> wrote:
Repacking is mutation, and mutating dict while iterating it breaks the iterator. But `d.items()[42]` don't looks like mutation.
Pardon my ignorance, but IS repacking a mutation? Clearly it's mutating the internal state, but at the logical structure of the dict will not have changed.
You are totally right. Repacking mutates internal state so it breaks iterator. But it doesn't look like mutation from users' point of view. It is the main problem. We can not repack silently.
Though I suppose if it's being iterated over, then the iterator is keeping an index into the internal array, which would change on repacking?
Yes.
which means that it's worse than not looking like a mutation, but it could make active iterators return results that are actually incorrect.
In most cases, iterator can detect it and raise RuntimeError.
I have to think that this could be accommodated somehow, but with ugly special case code, so yeah, not worth it. Though you could keep track of if there are any active views (ir even active iterators) and not repack in that case. I'm sure most dict iterators are most commonly used right away.
Is repacking ever currently done with dicts? If so, then how is this issue worked around?
Repacking happens only when insertion resize; when inserting an item but there is no space to insert. dict.clear() also creates clean empty dict. Currently, del/pop doesn't cause repacking. https://bugs.python.org/issue32623 Regards, -- Inada Naoki <songofacandy@gmail.com>
On Sat, 1 Aug 2020 at 19:34, Christopher Barker <pythonchb@gmail.com> wrote:
I would think the goal here would be to re-order once in a while to remove the holes. But that would take time, of course, so you wouldn't want to do it on every deletion. But when?
You should have a separate process that does this, so "normal" operations will be not blocked. The process could also cache objects and do garbage collection. A sort of domestic worker. Don't know if this is possible.
About the hole, I was thinking that in theory the problem can be circumvented using a modified version of lookdict.
lookdict searches for a key and returns its position in the ma_keys array. I suppose it's possible to do the contrary: search for the index and return the key. What do you think (theoretically speaking)?
but isn't searching for the index going to require iterating through the array until you find it? i.e. that O(N) operation we're trying to avoid?
O(n) is an+b. If a and b are small, the algorithm is fast for not-so-big n. To sum up the alternatives: 1. rearrange the items array at positional lookup. As Inada said, it will mutate the dict while reading, that is quite unexpected 2. rearrange the items array at deletion. It will slow down deletion. Probably not an option... a NAO 3. no positional indexing. It seems the more simple solution. A number of alternatives are offered (by Tim Peters too) 4. something else.
I wrote some (better than the previously shared) benchmarks for this change a while ago. They are run on cPython with a patch applied that implements dict_views __getitem__() using a method similar to `lookdict` to perform indexing on keys/values/etc. Irrespective of where in the api this logic should exist, the implementation won't be algorithmically different, (I think, even with a `.ordered` view, as the view would have to cope with changes to the underlying dictionary over its lifetime, and external tracking of changes to dicts is not, afaik, feasible. Unlike for-loop constructs which are inherently scoped, I feel like you wouldn't get away with forbidding modifying a dict() if there's a view on keys/values/items still alive, as these things are first-class objects that can be stored/passed around) Therefore, all index based lookups would have to use a variant of this logic (unless someone can come up with a magic O(1) solution ;) Or explicit compaction is used (If anyone has a patch that adds tracking 'compactness' over the dict_keys, I can run the tests using it, to measure the impact - However I'm not personally sure yet if the overheads of this more invasive change are justified just for enabling indexing). The cPython patch can be found here: https://github.com/stestagg/dict_index/blob/master/changes.patch, and the benchmark results are linked below. The tl/dr from my perspective is that these results make the change challenging to continue proposing without a better implementation than the obvious one. (I was weakly +1 before these results). Personally, I'm happy that the numbers give good evidence for this change being more complex than it at-first seems. Some notes about the benchmarks, I've adapted an existing, not-related, test runner for this, so there may be some compromises. I've tried to be reasonable about capturing OK timing data, but the intent here isn't to spot single-% changes in performance, rather it's looking at significant changes in runtime performance over vastly varying sizes of dicts. The repo including the test runner, patches and makefile are here: https://github.com/stestagg/dict_index and I'm accepting issues/PRs there if anyone feels that there's an omission or error that's worth correcting. The numbers are raw, and *do not* have any interpretation layered on them, there are many snippets of code that are not best-practice or ideal ways of achieving things, this is as much because I wanted to see what the impact of these non-optimal patterns would be on common operations, please take the time to understand the exact test (check the full source if you need) before making any meaningful decisions based on this data. Graphs are, by default, plotted on log-log axes, so beware when just looking at the shapes of the lines that the real-world difference in run-time is much larger than the absolute line differences suggest. The solution that uses direct indexing is always coloured green in the graphs. As the code involves a tight loop over a simple structure which is very CPU dependent (and because I can), I've run the benchmarks on a Raspberry pi4 (ARMv7l), and on an AMD pc: ARM Results: https://stestagg.github.io/dict_index/pi4.html PC Results: https://stestagg.github.io/dict_index/pc.html Thanks Steve On Sat, Aug 1, 2020 at 10:25 AM Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
On Sat, 1 Aug 2020 at 03:00, Inada Naoki <songofacandy@gmail.com> wrote:
Please teach me if you know any algorithm which has no hole, O(1) deletion, preserving insertion order, and efficient and fast as array.
:)
About the hole, I was thinking that in theory the problem can be circumvented using a modified version of lookdict. lookdict searches for a key and returns its position in the ma_keys array. I suppose it's possible to do the contrary: search for the index and return the key. What do you think (theoretically speaking)?
On Sat, Aug 01, 2020 at 07:25:42PM +0100, Stestagg wrote:
Irrespective of where in the api this logic should exist, the implementation won't be algorithmically different, (I think, even with a `.ordered` view, as the view would have to cope with changes to the underlying dictionary over its lifetime, and external tracking of changes to dicts is not, afaik, feasible. Unlike for-loop constructs which are inherently scoped, I feel like you wouldn't get away with forbidding modifying a dict() if there's a view on keys/values/items still alive, as these things are first-class objects that can be stored/passed around)
Forbidding mutation of the dict while a view exists is missing the point of having a view in the first place: updating the owning object should update the view as well. -- Steven
On Sat, 1 Aug 2020 at 20:25, Stestagg <stestagg@gmail.com> wrote:
I wrote some (better than the previously shared) benchmarks for this change a while ago.
I think that you could speed up the algorithm if you check if dict->ma_keys->dk_lookup == lookdict_unicode_nodummy. If so, the dict is a combined dict with only string keys (quite common), and no deletion was done before, so there's no hole in ma_keys.
On Fri, Jul 31, 2020 at 04:30:22PM -0700, Guido van Rossum wrote:
So: dicts preserve order the dict_views preserve this same order
Some think that we can't call them "ordered", because somehow that implies other things, like the ability to insert, etc, but while I don't get that argument, fine, let's call them "order preserving".
Meh, who cares. (The confusion is usually between "ordered" and "sorted" but we're past that here, so let's just keep saying "ordered".)
I care, and I think other people should care too, because calling dicts "ordered" with no qualifying "by insertion order" provably confuses users. See for example the original poster who started this thread, who imagined that because dicts are now "ordered" he should be able to reorder the keys and insert keys in arbitrary positions. (That doesn't mean that every single reference to a dict's order needed to explicitly refer back to insertion order. I'm not unreasonably pedantic about this :-) [Christopher Barker]
1) Because adding ANYTHING new is taking on a commitment to preserve it in the future, and it's more code to write, maintain, and document. So regardless of any other argument, it shouldn't be added without a reasonable use case(s) -- what's reasonable is subjective, of course.
Agreed. It is much harder to remove unneeded functions than to add it, so we should be conservative about adding new functions. We should beware bloat and API creep.
2) Because it could be an "attractive nuisance" while dicts are order preserving, their internal structure is such that you cannot find the nth item without iterating through n items, making access O(n), rather than O(1) for access by key or access of the usual Sequences -- Folks expect numerical indexing to be O(1), so it could be tricky. However, the only other way to get an n'th item is to copy everything into a Sequence, which is a lot slower, or to use islice or next() to do the order N access by hand. So this is an attractive nuisance in the use case of wanting to take multiple items by index, in which case, making a sequence first would be better than directly accessing the dict_view object.
I'm not entirely sure that's a strong argument against this. I never really bought the argument that having O(N) indexing was bad because repeated indexing would then be O(N**2). That's true, but for small enough N everything is fast, and even O(N**3) or worse can be "fast enough" for small N. The builtins have plenty of other functions which are O(N), e.g. list.count, str.find etc. While one might build O(N**2) operations on top of those, typically people don't. So I think that, provided the underlying indexing access is "fast enough", which I daresay it would be, I don't think we should care too much about the risk of people writing technically O(N**2) code like: for i in range(len(mydict)): key = mydict.getkey(i) # get key by index This is no worse than what we can already write using existing list or string methods. Beginners can and do write much worse code, and professionals will learn better. -- Steven
On Sat, Aug 1, 2020 at 8:14 AM Christopher Barker <pythonchb@gmail.com> wrote:
If that's all we've come up with in this lengthy thread, it's probably not worth it -- after all, both of those can be efficiently accomplished with a little help from itertools.islice and/or next().
100% agree.
But I think we all agree that those tools are less newbie-friendly -- but they should be learned at some point, so maybe that's OK (and there is always the wrap it with a list approach, which is pretty newbie friendly)
I don't agree with it. It is somewhat newbie-frindly, but somewhat newbie unfriendly. Newbie will use random.sample(d.items()) or get-by-index without knowing it is O(n) even when they can create a list once and use it repeatedly. When people learn islice, they will learn it is not O(1) operation too. If dict views support direct indexing, it is very difficult to notice it is O(n) operation. They just think "Oh Python is fucking slow!". So it is newbie-unfriendly at some point. Regards, -- Inada Naoki <songofacandy@gmail.com>
On Fri, Jul 31, 2020 at 7:34 AM Guido van Rossum <guido@python.org> wrote:
So maybe we need to add dict.ordered() which returns a view on the items that is a Sequence rather than a set? Or ordereditems(), orderedkeys() and orderedvalues()?
I'm still confused as to when "ordered" became synonymous with "Sequence" -- so wouldn't we want to call these dict.as_sequence() or something like that? And is there a reason that the regular dict views couldn't be both a Set and a Sequence? Looking at the ABCs, I don't see a conflict -- __getitem__, index() and count() would need to be added, and Set's don't have any of those. (and count could be optimized to always return 0 or 1 for dict.keys() ;-) ) But anyway, naming aside, I'm still wondering whether we necessarily want the entire Sequence protocol. For the use cases at hand, isn't indexing and slicing enough? Which brings us to the philosophy of duck typing. I wrote an earlier post about that -- so here's some follow up thoughts. I suggested that I like the "if I only need it to quack, I don't care if it's a duck" approach -- I try to use the quack() method, and I'm happy it if works, and raise an Exception (Or let whatever Exception happens be raised bubble up) if it doesn't. Guido pointed out that having a quack() method isn't enough -- it also needs to actually behave as you expect -- which is the nice thing about ABCs -- if you know something is a Sequence, you don't just know that you can index it, you know that indexing it will do what you expect. Which brings us back to the random.choice() function. It's really simple, and uses exactly the approach I outlined above. def choice(self, seq): """Choose a random element from a non-empty sequence.""" try: i = self._randbelow(len(seq)) except ValueError: raise IndexError('Cannot choose from an empty sequence') from None return seq[i] It checks the length of the object, picks a random index within that length, and then tries to use that index to get a random item. so anything with a __len__ and a __getitem__ that accepts integers will work. And this has worked "fine" for decades. Should it be checking that seq is actually a sequence? I don't think so -- I like that I can pass in any object that's indexable by an integer. But there's is a potential problem here -- all it does is try to pass an integer to __getitem__. So all Sequences should work. But Mappings also have a __getitem__, but with slightly different semantics -- a Sequence should accept an integer (or object with an __index__) in the range of its size, but a Mapping can accept any valid key. So for the most part, passing a Mapping to random.choice() fails as it should, with a KeyError. But if you happen to have a key that is an integer, it might succeed, but it would not be doing "the right thing" (unless the Mapping happened to be constructed exactly the right way -- but then it should probably just be a Sequence). So: do we need a solution to this? I don't think so, it's simply the nature of a dynamic typing as far as I'm concerned, but if we wanted it to be more robust, we could require (maybe only with a static type declaration) that the object passed in is a Sequence. But I think that would be a shame -- this function doesn't need a full Sequence, it only needs a Sized and __getitem__. In fact, the ABCs are designed to accommodate much of this -- for example, the Sized ABC only requires one feature: __len__. And Contains only __contains__. As far as I know there are no built-ins (or commonly used third party) objects that are ONLY Sized, or ONLY Contains. In fact, at least in the collection.abc, every ABC that has __contains__ also has __len__. And I can't think of anything that could support "in" that didn't have a size -- which could be a failure of imagination on my part. But you could type check for Contains is all you wanted to do was know that you could use it with "in". So there are ABCs there simply to support a single method. Which means that we could solve the "problem" of random.choice with a "Getitemable" ABC. Ahh -- but here's the rub -- while the ABCs only require certain methods -- in fact, it's implied that they have particular behavior as well. And this is the problem at hand. Both Sequences and Mappings have a __getitem__, but they have somewhat different meanings, and that meaning is embedded in the ABC itself, rather than the method: Sequences will take an integer, and raise a IndexError if its out of range, and Mappings take any hashable, and will raise a KeyError if it's not there. So maybe what is needed is an Indexable ABC that implies the Sequence-like indexing behavior. Then if we added indexing to dict views, they would be an Indexable, but not a Sequence. -CHB
On Fri, Jul 31, 2020 at 05:29 Ricky Teachey <ricky@teachey.org> wrote:
On Fri, Jul 31, 2020, 2:48 AM Wes Turner <wes.turner@gmail.com> wrote:
# Dicts and DataFrames - Src: https://github.com/westurner/pythondictsanddataframes/blob/master/dicts_and_... - Binder: https://mybinder.org/v2/gh/westurner/pythondictsanddataframes/master?filepat... - (interactive Jupyter Notebook hosted by https://mybinder.org/ )
The punchline of Wes Turner's notebook (very well put together, thank you!) seems to partly be that if you find yourself wanting to work with the position of items in a dict, you might want to consider using a pandas.Series (with it's .iloc method).
A difficulty that immediately came to mind with this advice is type hinting support. I was just googling yesterday for "how to type hint using pandas" and the only thing I found is to use pd.Series and pd.DataFrame directly.
But those don't support type hinting comparable to:
Dict[str, float]
Or:
class Vector(TypedDict): i: float j: float
This is a big downside of the advice "just use pandas". Although I love using pandas and use it all the time. _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/C7HJFK... Code of Conduct: http://python.org/psf/codeofconduct/
-- --Guido (mobile) _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/VIPBHJ... Code of Conduct: http://python.org/psf/codeofconduct/
-- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
Yeah, it is totally doable to refactor the collection ABCs to have something in between `Collection` and `Sequence` that just supports `__getitem__`. But I would take Marco's research (and Inada's musings) seriously -- we don't actually want to support `__getitem__`, because of the unpredictable performance characteristics. I'm no longer in favor of adding .ordered() -- I think it's better to add something to itertools, for example first() to get the first item (see Tim Peters' post), and something related to get the first N items. On Sat, Aug 1, 2020 at 12:28 PM Christopher Barker <pythonchb@gmail.com> wrote:
On Fri, Jul 31, 2020 at 7:34 AM Guido van Rossum <guido@python.org> wrote:
So maybe we need to add dict.ordered() which returns a view on the items that is a Sequence rather than a set? Or ordereditems(), orderedkeys() and orderedvalues()?
I'm still confused as to when "ordered" became synonymous with "Sequence" -- so wouldn't we want to call these dict.as_sequence() or something like that?
And is there a reason that the regular dict views couldn't be both a Set and a Sequence? Looking at the ABCs, I don't see a conflict -- __getitem__, index() and count() would need to be added, and Set's don't have any of those. (and count could be optimized to always return 0 or 1 for dict.keys() ;-) )
But anyway, naming aside, I'm still wondering whether we necessarily want the entire Sequence protocol. For the use cases at hand, isn't indexing and slicing enough?
Which brings us to the philosophy of duck typing. I wrote an earlier post about that -- so here's some follow up thoughts. I suggested that I like the "if I only need it to quack, I don't care if it's a duck" approach -- I try to use the quack() method, and I'm happy it if works, and raise an Exception (Or let whatever Exception happens be raised bubble up) if it doesn't.
Guido pointed out that having a quack() method isn't enough -- it also needs to actually behave as you expect -- which is the nice thing about ABCs -- if you know something is a Sequence, you don't just know that you can index it, you know that indexing it will do what you expect.
Which brings us back to the random.choice() function. It's really simple, and uses exactly the approach I outlined above.
def choice(self, seq): """Choose a random element from a non-empty sequence.""" try: i = self._randbelow(len(seq)) except ValueError: raise IndexError('Cannot choose from an empty sequence') from None return seq[i]
It checks the length of the object, picks a random index within that length, and then tries to use that index to get a random item. so anything with a __len__ and a __getitem__ that accepts integers will work.
And this has worked "fine" for decades. Should it be checking that seq is actually a sequence? I don't think so -- I like that I can pass in any object that's indexable by an integer.
But there's is a potential problem here -- all it does is try to pass an integer to __getitem__. So all Sequences should work. But Mappings also have a __getitem__, but with slightly different semantics -- a Sequence should accept an integer (or object with an __index__) in the range of its size, but a Mapping can accept any valid key. So for the most part, passing a Mapping to random.choice() fails as it should, with a KeyError. But if you happen to have a key that is an integer, it might succeed, but it would not be doing "the right thing" (unless the Mapping happened to be constructed exactly the right way -- but then it should probably just be a Sequence).
So: do we need a solution to this? I don't think so, it's simply the nature of a dynamic typing as far as I'm concerned, but if we wanted it to be more robust, we could require (maybe only with a static type declaration) that the object passed in is a Sequence.
But I think that would be a shame -- this function doesn't need a full Sequence, it only needs a Sized and __getitem__.
In fact, the ABCs are designed to accommodate much of this -- for example, the Sized ABC only requires one feature: __len__. And Contains only __contains__. As far as I know there are no built-ins (or commonly used third party) objects that are ONLY Sized, or ONLY Contains. In fact, at least in the collection.abc, every ABC that has __contains__ also has __len__. And I can't think of anything that could support "in" that didn't have a size -- which could be a failure of imagination on my part. But you could type check for Contains is all you wanted to do was know that you could use it with "in".
So there are ABCs there simply to support a single method. Which means that we could solve the "problem" of random.choice with a "Getitemable" ABC.
Ahh -- but here's the rub -- while the ABCs only require certain methods -- in fact, it's implied that they have particular behavior as well. And this is the problem at hand. Both Sequences and Mappings have a __getitem__, but they have somewhat different meanings, and that meaning is embedded in the ABC itself, rather than the method: Sequences will take an integer, and raise a IndexError if its out of range, and Mappings take any hashable, and will raise a KeyError if it's not there.
So maybe what is needed is an Indexable ABC that implies the Sequence-like indexing behavior.
Then if we added indexing to dict views, they would be an Indexable, but not a Sequence.
-CHB
On Fri, Jul 31, 2020 at 05:29 Ricky Teachey <ricky@teachey.org> wrote:
On Fri, Jul 31, 2020, 2:48 AM Wes Turner <wes.turner@gmail.com> wrote:
# Dicts and DataFrames - Src: https://github.com/westurner/pythondictsanddataframes/blob/master/dicts_and_... - Binder: https://mybinder.org/v2/gh/westurner/pythondictsanddataframes/master?filepat... - (interactive Jupyter Notebook hosted by https://mybinder.org/ )
The punchline of Wes Turner's notebook (very well put together, thank you!) seems to partly be that if you find yourself wanting to work with the position of items in a dict, you might want to consider using a pandas.Series (with it's .iloc method).
A difficulty that immediately came to mind with this advice is type hinting support. I was just googling yesterday for "how to type hint using pandas" and the only thing I found is to use pd.Series and pd.DataFrame directly.
But those don't support type hinting comparable to:
Dict[str, float]
Or:
class Vector(TypedDict): i: float j: float
This is a big downside of the advice "just use pandas". Although I love using pandas and use it all the time. _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/C7HJFK... Code of Conduct: http://python.org/psf/codeofconduct/
-- --Guido (mobile) _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/VIPBHJ... Code of Conduct: http://python.org/psf/codeofconduct/
-- Christopher Barker, PhD
Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
-- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-change-the-world/>
first() https://more-itertools.readthedocs.io/en/stable/api.html#more_itertools.firs... https://toolz.readthedocs.io/en/latest/api.html#toolz.itertoolz.first last() https://more-itertools.readthedocs.io/en/stable/api.html#more_itertools.last https://toolz.readthedocs.io/en/latest/api.html#toolz.itertoolz.last take() https://more-itertools.readthedocs.io/en/stable/api.html#more_itertools.take https://toolz.readthedocs.io/en/latest/api.html#toolz.itertoolz.take tail() https://more-itertools.readthedocs.io/en/stable/api.html#more_itertools.tail https://toolz.readthedocs.io/en/latest/api.html#toolz.itertoolz.tail ... pluck(ind, seqs, default='__no__default__')
plucks an element or several elements from each item in a sequence. https://toolz.readthedocs.io/en/latest/api.html#toolz.itertoolz.pluck
On Sat, Aug 1, 2020, 4:59 PM Guido van Rossum <guido@python.org> wrote:
Yeah, it is totally doable to refactor the collection ABCs to have something in between `Collection` and `Sequence` that just supports `__getitem__`.
But I would take Marco's research (and Inada's musings) seriously -- we don't actually want to support `__getitem__`, because of the unpredictable performance characteristics.
I'm no longer in favor of adding .ordered() -- I think it's better to add something to itertools, for example first() to get the first item (see Tim Peters' post), and something related to get the first N items.
On Sat, Aug 1, 2020 at 12:28 PM Christopher Barker <pythonchb@gmail.com> wrote:
On Fri, Jul 31, 2020 at 7:34 AM Guido van Rossum <guido@python.org> wrote:
So maybe we need to add dict.ordered() which returns a view on the items that is a Sequence rather than a set? Or ordereditems(), orderedkeys() and orderedvalues()?
I'm still confused as to when "ordered" became synonymous with "Sequence" -- so wouldn't we want to call these dict.as_sequence() or something like that?
And is there a reason that the regular dict views couldn't be both a Set and a Sequence? Looking at the ABCs, I don't see a conflict -- __getitem__, index() and count() would need to be added, and Set's don't have any of those. (and count could be optimized to always return 0 or 1 for dict.keys() ;-) )
But anyway, naming aside, I'm still wondering whether we necessarily want the entire Sequence protocol. For the use cases at hand, isn't indexing and slicing enough?
Which brings us to the philosophy of duck typing. I wrote an earlier post about that -- so here's some follow up thoughts. I suggested that I like the "if I only need it to quack, I don't care if it's a duck" approach -- I try to use the quack() method, and I'm happy it if works, and raise an Exception (Or let whatever Exception happens be raised bubble up) if it doesn't.
Guido pointed out that having a quack() method isn't enough -- it also needs to actually behave as you expect -- which is the nice thing about ABCs -- if you know something is a Sequence, you don't just know that you can index it, you know that indexing it will do what you expect.
Which brings us back to the random.choice() function. It's really simple, and uses exactly the approach I outlined above.
def choice(self, seq): """Choose a random element from a non-empty sequence.""" try: i = self._randbelow(len(seq)) except ValueError: raise IndexError('Cannot choose from an empty sequence') from None return seq[i]
It checks the length of the object, picks a random index within that length, and then tries to use that index to get a random item. so anything with a __len__ and a __getitem__ that accepts integers will work.
And this has worked "fine" for decades. Should it be checking that seq is actually a sequence? I don't think so -- I like that I can pass in any object that's indexable by an integer.
But there's is a potential problem here -- all it does is try to pass an integer to __getitem__. So all Sequences should work. But Mappings also have a __getitem__, but with slightly different semantics -- a Sequence should accept an integer (or object with an __index__) in the range of its size, but a Mapping can accept any valid key. So for the most part, passing a Mapping to random.choice() fails as it should, with a KeyError. But if you happen to have a key that is an integer, it might succeed, but it would not be doing "the right thing" (unless the Mapping happened to be constructed exactly the right way -- but then it should probably just be a Sequence).
So: do we need a solution to this? I don't think so, it's simply the nature of a dynamic typing as far as I'm concerned, but if we wanted it to be more robust, we could require (maybe only with a static type declaration) that the object passed in is a Sequence.
But I think that would be a shame -- this function doesn't need a full Sequence, it only needs a Sized and __getitem__.
In fact, the ABCs are designed to accommodate much of this -- for example, the Sized ABC only requires one feature: __len__. And Contains only __contains__. As far as I know there are no built-ins (or commonly used third party) objects that are ONLY Sized, or ONLY Contains. In fact, at least in the collection.abc, every ABC that has __contains__ also has __len__. And I can't think of anything that could support "in" that didn't have a size -- which could be a failure of imagination on my part. But you could type check for Contains is all you wanted to do was know that you could use it with "in".
So there are ABCs there simply to support a single method. Which means that we could solve the "problem" of random.choice with a "Getitemable" ABC.
Ahh -- but here's the rub -- while the ABCs only require certain methods -- in fact, it's implied that they have particular behavior as well. And this is the problem at hand. Both Sequences and Mappings have a __getitem__, but they have somewhat different meanings, and that meaning is embedded in the ABC itself, rather than the method: Sequences will take an integer, and raise a IndexError if its out of range, and Mappings take any hashable, and will raise a KeyError if it's not there.
So maybe what is needed is an Indexable ABC that implies the Sequence-like indexing behavior.
Then if we added indexing to dict views, they would be an Indexable, but not a Sequence.
-CHB
On Fri, Jul 31, 2020 at 05:29 Ricky Teachey <ricky@teachey.org> wrote:
On Fri, Jul 31, 2020, 2:48 AM Wes Turner <wes.turner@gmail.com> wrote:
# Dicts and DataFrames - Src: https://github.com/westurner/pythondictsanddataframes/blob/master/dicts_and_... - Binder: https://mybinder.org/v2/gh/westurner/pythondictsanddataframes/master?filepat... - (interactive Jupyter Notebook hosted by https://mybinder.org/ )
The punchline of Wes Turner's notebook (very well put together, thank you!) seems to partly be that if you find yourself wanting to work with the position of items in a dict, you might want to consider using a pandas.Series (with it's .iloc method).
A difficulty that immediately came to mind with this advice is type hinting support. I was just googling yesterday for "how to type hint using pandas" and the only thing I found is to use pd.Series and pd.DataFrame directly.
But those don't support type hinting comparable to:
Dict[str, float]
Or:
class Vector(TypedDict): i: float j: float
This is a big downside of the advice "just use pandas". Although I love using pandas and use it all the time. _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/C7HJFK... Code of Conduct: http://python.org/psf/codeofconduct/
-- --Guido (mobile) _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/VIPBHJ... Code of Conduct: http://python.org/psf/codeofconduct/
-- Christopher Barker, PhD
Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
-- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-change-the-world/> _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/PIYAZO... Code of Conduct: http://python.org/psf/codeofconduct/
On Sun, Jul 12, 2020 at 4:43 AM Christopher Barker <pythonchb@gmail.com> wrote:
The existing dictionary memory layout doesn't support direct indexing (without stepping), so this functionality is not being added as a requirement.
But it does make it much more efficient if the stepping is done inside the dict object by code that knows its internal structure. Both because it can be in C, and can be done without any additional references or copying. yes, it's all O(n) but a very different constant.
It's just a difference in proportional constants. If the performance difference is really important, dict_views must have `d.items().tolist()` (replacement for `list(d.items())`) before random indexing. It is much more used. Currently, list(iterable) doesn't have any specialized algorithm for dict views. (As far as I know, it doesn't have specialized algorithm even for dict).
If random.choice should support non-sequence ordered container, just propose it to random.choice.
That would indeed solve the usability issue, and so may be a good idea,
The problem here is that there is no way for random.choice to efficiently work with generic Mappings. This whole discussion started because now that dicts preserve order, there is both a logical reason, and a practical implementation for indexing. But if that is not exposed, then random.choice(), nor any other function, can take advantage of it.
Ditto. Iterating internal structure directly is not so important. And there is only little performance difference between current dict and previous dict implementation for iteration. I suppose new dict implementation is faster only 20~40%, when dict is clean (no item is removed yet). So I don't think preserving order is good reason to support indexing while `random.choice` is the only use case.
Which would lead to adding a random_choice protocol -- but THAT sure seems like overkill. (OK, you could have the builtin random.choice check for an actual dict, and then use custom code to make a random selection, but that would really be a micro-optimization!)
I already wrote sample code using `itertools.islice()`. It works for all containers with len() and iterator. No need for adding protocol.
If a feature is useful, and doesn't conflict with another feature, then we can add it.
I believe this is a bad idea. It leads software to be huge, unmaintainable, and unusable. A Relatively high bar must be set for adding a feature to builtin type than adding a third party package on PyPI. Regards, -- Inada Naoki <songofacandy@gmail.com>
On Fri, Jul 10, 2020 at 06:06:03PM +0100, Stestagg wrote:
I don't believe that this feature would steepen the language learning curve however, but actually help to shallow it slightly (Explained more below)
It steepens it because it's yet another feature to learn. Is it dicts or dict views that supports indexing? What's the difference between dict[1] and dict.keys()[1]? These things need to be taught. Remember that there is a cohort of Python programmers who have never, ever been exposed to any version of Python where dict.keys returned a list. They do not have your expectation "but it used to work!". And that cohort is only going to increase.
What making dict_* types a Sequence will do is make this code (as written) behave: 1. like it used to do 2. like most people seem to expect it to.
The "used to" part is easy to fix: call list. That's what dict.items etc used to implicitly do. Now it's explicit. As for the second part, I agree -- I was surprised that it didn't work, but I fear that's because numpy is, in many places, awfully un-Pythonic. numpy's design seems to me to be more closely inspired by a hodge-podge of mismatched Fortran and C libraries than Python, which is hardly surprising.
Currently numpy does something that I consider unexpected (I'm sure, given your previous responses, you'd disagree with this,
Heh, no, I consider numpy to be doing the wrong thing here. I think this exposes numpy's pre-iterator origins: >>> np.array(iter([1, 2, 3, 4])) array(<list_iterator object at 0x7f3d7d0e1be0>, dtype=object) but I'm sure that numpy fans will argue that's a good thing, if you want to pass an iterator to array, explicitly converting it to a sequence is the right thing to do. [...]
I suspect that even if dict items were indexable, Raymond Hettinger would not be happy with random.sample on dict views.
I don't know why? I can understand deprecating sets here as they're unordered, so the results when seed() has been called are not consistent. I don't see why Raymond would object to allowing sampling an ordered container, one from which the results will be reproducible.
I wish to back away from predicting what Raymond would say. When I wrote that in the wee hours of the morning (local time) it seemed perfectly obvious to me why he would dislike it, but after a good sleep I now find myself completely unable to recall why that was the case. So let me backpedal furiously, withdraw that remark, and let Raymond speak for himself :-) [...]
This is the proposal, I want to make these things Sequences.
But they can't be Sequences, since they are already Sets. They would have to be a hybrid of the two, and that, I feel, comes with more baggage than just being one or the other. It's not that I react to the very idea with horror, but it's not a clean design to mix Sequence and Set (OrderedSet?) so without a good reason for it, I'd rather hold off. Not so much "Never!!!" as "Not yet." Besides: - to make views a Sequence, they would have to also support count and index methods, so this proposal doesn't go far enough; - and in Python 2, they weren't merely Sequences, they were full-blown actual lists, which supported slicing, in-place sorting, and various other methods. So this proposal doesn't come close to either returning to the Python 2 days or offering a Sequence API. It's just a hybrid that falls short of being either a sequence or a list.
...not that we can jump to the 350th key without stepping through the previous 349 keys.
The existing dictionary memory layout doesn't support direct indexing (without stepping), so this functionality is not being added as a requirement.
Oh, I'm sorry, I misunderstood. -- Steven
On Thu, Jul 9, 2020, at 13:26, Stestagg wrote:
Obviously, python is a general-purpose, turing complete language, so each of these options can be written in other ways. But it would be nice if the simple, readable versions also worked :D
The idea that there are future, unspecified changes to dicts() that may or may not be hampered by allowing indexing sounds like FUD to me, unless there are concrete references?
Does the current implementation support indexing? It is certainly possible in principle to preserve ordering without indexing, for example if a linked list is used to support the ordering, or if items are stored in an array where deleted items leave holes permanently until the dict is resized. I don't know how dict works, but I am not sure how you would support indexing while also allowing deletion to be O(1). A specific random key function would be narrower in scope than this, and for anyone who wants full sequence support, perhaps that could be added to OrderedDict.
Christopher Barker writes:
d.keys()[-1] vs list(d.keys())[-1]
Should be compared with `next(reversed(d.keys()))`, or `next(reversed(d))`.
Same point - the idea is to have indexing syntax.
I'm a-gonna make you REAALLY MAAD. You can have it. d.popitem()[0] :-)
On Wed, Jul 08, 2020 at 05:44:00PM +0100, Stestagg wrote:
So, use-case discussions aside, you get some pretty convincing performance wins even on small-ish dictionaries if direct indexing is used.
This reminds me of an anecdote my wife tells me about the days when she was in a band in the USA. She and the roadies were travelling from a gig in one city to another when she noticed a highway sign: "Aren't we travelling the wrong way?" To which the roadie driving answered "Who cares, we're making fantastic time!" Ah, the sixties. Who cares if there's no practical use-cases for this feature, it's really fast! *wink* Assuming your timing tests are accurate (a lot of people don't time their code well, and consequently get very inaccurate numbers) and representative (timing is very sensitive to the combination of Python version, OS, hardware, and load on the local machine) your results are micro-benchmarks, the least helpful benchmarks. For a dict with 15000 items, let us accept that your suggested: mydict.items()[0] is 5000 times faster than: list(mydict.items())[0] Conceded! But if you're only calling it *once*, who cares? It's still only 3 milliseconds. My (hypothetical) script takes half a second to run, I'm not going to notice 3ms. What if I call it a thousand times? Then I ought to make a list *once* and index the list repeatedly, not keep making and throwing away the list, which gives me constant-time random access to the items: L = dict(mydict.items()) for i in indices: L[i] and I expect that list access is probably going to be faster than mydict.items()[i] access. So now the cost of building the list is amortized over a thousand calls, and becomes more or less invisible. In my opinion, the reason why this proposal is not compelling include: 1. lack of a strong use-case; 2. inelegant semantic design that adds sequence-like behaviour to views which are (imperfectly) designed to be set-like; 3. and concern that adding this to views would constrain future performance improvements to the underlying dict. (There may be others.) Unless I have missed any others, we've only seen three use-cases: (a) The original post wanted a full sequence API for dictionaries, with the ability to insert keys at a specific index, not just get the N-th key. Not going to happen. (b) You've suggested "get any item", but that's probably better written as `next(mydict.items)`. (c) And `random.choice(mydict.items())` which seems to be lacking any *concrete* use-case -- under what circumstances would we want this and care about it's performance *enough* to add this to builtin dict views? (It seems more like toy code we might write to illustrate the use of indexed access than something with a concrete use-case behid it.) If there was a really strong use-case, points 2 and 3 could be over- ruled. But lacking a good use-case, the conservative safe choice here is to keep the status quo and reject the proposal. So I think that if you really want to champion this proposal, you would be better off looking for concrete, practical use-cases and macro-benchmarks, not micro. -- Steven
On Thu, Jul 9, 2020 at 9:16 PM Steven D'Aprano <steve@pearwood.info> wrote:
Unless I have missed any others, we've only seen three use-cases:
(a) The original post wanted a full sequence API for dictionaries, with the ability to insert keys at a specific index, not just get the N-th key. Not going to happen.
(b) You've suggested "get any item", but that's probably better written as `next(mydict.items)`.
(c) And `random.choice(mydict.items())` which seems to be lacking any *concrete* use-case -- under what circumstances would we want this and care about it's performance *enough* to add this to builtin dict views?
Getting a random element from a dict (not an arbitrary one but a random one) definitely does have a use-case. I've wanted it at times. Usually if I'm doing this in Python, I just pay the price and build the list, but as proof that it's a logical and useful operation, here's Pike's general random function: http://pike.lysator.liu.se/generated/manual/modref/ex/predef_3A_3A/random.ht... If given a number, it picks a random number. If given an array, it picks a random element. And if given a mapping (dictionary), it returns (key,value), without first converting to a flat list. I don't think the use-case is strong enough, since there are many possible variants you might want (imagine using collections.Counter to define a weighted average - it'd be handy to say "pick a random element, treating this as a bag/multiset"), so it's best to just custom write it when you need it. But if it's needed, it would want to be on the dictionary itself, not on the views, IMO. ChrisA
On 09.07.20 14:25, Chris Angelico wrote:
On Thu, Jul 9, 2020 at 9:16 PM Steven D'Aprano <steve@pearwood.info> wrote:
Unless I have missed any others, we've only seen three use-cases:
(a) The original post wanted a full sequence API for dictionaries, with the ability to insert keys at a specific index, not just get the N-th key. Not going to happen.
(b) You've suggested "get any item", but that's probably better written as `next(mydict.items)`.
(c) And `random.choice(mydict.items())` which seems to be lacking any *concrete* use-case -- under what circumstances would we want this and care about it's performance *enough* to add this to builtin dict views?
Getting a random element from a dict (not an arbitrary one but a random one) definitely does have a use-case. I've wanted it at times.
This doesn't seem to be an argument for modifying dict views since the actual "issue" lies with `random.choice` and the way it can (or cannot) interact with the containers its given. Even when implemented on dict views, this still won't for work for sets for example. And it's definitely reasonable to ask "pick a random element from this set" (in terms or real world language). But from the perspective of the set API there is no way to pick specific elements and hence it must either rely on conversion to a sequence or rely on implementation details of the set data structure. So if the use case is to get a random element from a dict, the discussion should rather be centered around `random.choice` and also respect other non-sequence containers.
On Thu, Jul 09, 2020 at 10:25:46PM +1000, Chris Angelico wrote:
Getting a random element from a dict (not an arbitrary one but a random one) definitely does have a use-case. I've wanted it at times. Usually if I'm doing this in Python, I just pay the price and build the list, but as proof that it's a logical and useful operation, here's Pike's general random function:
http://pike.lysator.liu.se/generated/manual/modref/ex/predef_3A_3A/random.ht...
If given a number, it picks a random number. If given an array, it picks a random element. And if given a mapping (dictionary), it returns (key,value), without first converting to a flat list.
That's an existance proof that somebody else implemented the function, not a use-case. Getting a random key/item pair directly from a dict is a mechanism, not an explanation of why you need a random key-item pair. Analogy: "Intercal's random function accepts a module as argument and returns the value of a random global variable from that module." (In real life it doesn't, so far as I know, but it should. If not Intercal, perhaps Malbolge, or another such escoteric language designed to be impractical and frustrating.) I'll admit that I can't actually think of a reason why one should not want a random key/value pair, but that's not the same as being able to think of a reason why one *would* want a random key/value pair. Perhaps I just lack imagination. But as you say:
I don't think the use-case is strong enough, since there are many possible variants you might want (imagine using collections.Counter to define a weighted average - it'd be handy to say "pick a random element, treating this as a bag/multiset"), so it's best to just custom write it when you need it.
-- Steven
On Fri, Jul 10, 2020 at 7:20 AM Steven D'Aprano <steve@pearwood.info> wrote:
On Thu, Jul 09, 2020 at 10:25:46PM +1000, Chris Angelico wrote:
Getting a random element from a dict (not an arbitrary one but a random one) definitely does have a use-case. I've wanted it at times. Usually if I'm doing this in Python, I just pay the price and build the list, but as proof that it's a logical and useful operation, here's Pike's general random function:
http://pike.lysator.liu.se/generated/manual/modref/ex/predef_3A_3A/random.ht...
If given a number, it picks a random number. If given an array, it picks a random element. And if given a mapping (dictionary), it returns (key,value), without first converting to a flat list.
That's an existance proof that somebody else implemented the function, not a use-case. Getting a random key/item pair directly from a dict is a mechanism, not an explanation of why you need a random key-item pair.
And immediately above that part, I said that I had made use of this, and had used it in Python by listifying the dict first. Okay, so I didn't actually dig up the code where I'd done this, but that's a use case. I have *actually done this*. In both languages. ChrisA
On Fri, Jul 10, 2020 at 1:53 PM Chris Angelico <rosuav@gmail.com> wrote:
And immediately above that part, I said that I had made use of this, and had used it in Python by listifying the dict first. Okay, so I didn't actually dig up the code where I'd done this, but that's a use case. I have *actually done this*. In both languages.
Do you pick random item repeatedly from same dictionary? If so, you can create a list once in Python. Or do you pick random item from various dictionaries? If so, you application must create many dictionaries, not only picking random item. Unless randam picking part is the bottleneck, micro optimization for this part is not important. That's why the higher level use case is needed to design useful benchmark. Regards, -- Inada Naoki <songofacandy@gmail.com>
On Fri, Jul 10, 2020 at 3:41 PM Inada Naoki <songofacandy@gmail.com> wrote:
On Fri, Jul 10, 2020 at 1:53 PM Chris Angelico <rosuav@gmail.com> wrote:
And immediately above that part, I said that I had made use of this, and had used it in Python by listifying the dict first. Okay, so I didn't actually dig up the code where I'd done this, but that's a use case. I have *actually done this*. In both languages.
Do you pick random item repeatedly from same dictionary? If so, you can create a list once in Python.
I would pick repeatedly from the same dictionary but it might be mutated in between. So the list would have to be reconstructed fresh every time. ChrisA
On Fri, Jul 10, 2020 at 2:56 PM Chris Angelico <rosuav@gmail.com> wrote:
but it might be mutated in between. So the list would have to be reconstructed fresh every time.
Do you think the use case is common enough to add a feature to builtin type? I can not imagine. Anyway, if you want to benchmark that use case, the benchmark should include time to mutate dict too. Otherwise, performance benefit is exaggerated and no one trusts the benchmark. Regards, -- Inada Naoki <songofacandy@gmail.com>
On Fri, Jul 10, 2020 at 4:04 PM Inada Naoki <songofacandy@gmail.com> wrote:
On Fri, Jul 10, 2020 at 2:56 PM Chris Angelico <rosuav@gmail.com> wrote:
but it might be mutated in between. So the list would have to be reconstructed fresh every time.
Do you think the use case is common enough to add a feature to builtin type? I can not imagine.
Go back and reread my post, I already answered this and several other questions :) ChrisA
Oh, sorry. I read you did "random pick from dict", but I hadn't read you did "random pick and mutate dict" use case or not. On Fri, Jul 10, 2020 at 3:08 PM Chris Angelico <rosuav@gmail.com> wrote:
On Fri, Jul 10, 2020 at 4:04 PM Inada Naoki <songofacandy@gmail.com> wrote:
On Fri, Jul 10, 2020 at 2:56 PM Chris Angelico <rosuav@gmail.com> wrote:
but it might be mutated in between. So the list would have to be reconstructed fresh every time.
Do you think the use case is common enough to add a feature to builtin type? I can not imagine.
Go back and reread my post, I already answered this and several other questions :)
ChrisA
-- Inada Naoki <songofacandy@gmail.com>
On Fri, 10 Jul 2020 at 06:56, Chris Angelico <rosuav@gmail.com> wrote:
On Fri, Jul 10, 2020 at 3:41 PM Inada Naoki <songofacandy@gmail.com> wrote:
On Fri, Jul 10, 2020 at 1:53 PM Chris Angelico <rosuav@gmail.com> wrote:
And immediately above that part, I said that I had made use of this, and had used it in Python by listifying the dict first. Okay, so I didn't actually dig up the code where I'd done this, but that's a use case. I have *actually done this*. In both languages.
Do you pick random item repeatedly from same dictionary? If so, you can create a list once in Python.
I would pick repeatedly from the same dictionary but it might be mutated in between. So the list would have to be reconstructed fresh every time.
Maybe there could be a function in the random module for making a random choice from an arbitrary (finite) iterable: # rnd.py from random import randrange def choice_iter(iterable): iterator = iter(iterable) try: value = next(iterator) except StopIteration: raise ValueError("Empty iterable") for n, candidate in enumerate(iterator, 2): if not randrange(n): value = candidate return value You could use that to get a choice from a dict, set etc. For a large dict/set it will be slow compared to converting to a list because it's calling randrange once per item. If you have a good reason not to build the list though then you could use it. For example to get a random line from a possibly large text file:
from rnd import choice_iter with open('rnd.py') as fin: ... line = choice_iter(fin) ... line ' except StopIteration:\n'
It's not hard to generalise the function above to something that e.g. makes a selection of k values from the iterable in a single pass. Here is a version that works without replacement: def sample_iter(population, k): iterator = iter(population) values = [] for _ in range(k): try: value = next(iterator) except StopIteration: raise ValueError("Too few items") values.append(value) for n, candidate in enumerate(iterator, k+1): random_index = randrange(n) if random_index < k: values[random_index] = candidate return values # maybe shuffle first? -- Oscar