itertools addition: getitem()
I'd like to propose the following addition to itertools: A function itertools.getitem() which is basically equivalent to the following python code: _default = object() def getitem(iterable, index, default=_default): try: return list(iterable)[index] except IndexError: if default is _default: raise return default but without materializing the complete list. Negative indexes are supported too (this requires additional temporary storage for abs(index) objects). The patch is available at http://bugs.python.org/1749857 Servus, Walter
How important is it to have the default in this API? __getitem__()
doesn't have a default; instead, there's a separate API get() that
provides a default (and I find defaulting to None more manageable than
the "_default = object()" pattern).
--Guido
On 7/8/07, Walter Dörwald
I'd like to propose the following addition to itertools: A function itertools.getitem() which is basically equivalent to the following python code:
_default = object()
def getitem(iterable, index, default=_default): try: return list(iterable)[index] except IndexError: if default is _default: raise return default
but without materializing the complete list. Negative indexes are supported too (this requires additional temporary storage for abs(index) objects).
The patch is available at http://bugs.python.org/1749857
Servus, Walter _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/guido%40python.org
-- --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum schrieb:
How important is it to have the default in this API? __getitem__() doesn't have a default; instead, there's a separate API get() that provides a default (and I find defaulting to None more manageable than the "_default = object()" pattern).
getattr() has a default too, while __getattr__ hasn't... Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out.
On 7/8/07, Georg Brandl
Guido van Rossum schrieb:
How important is it to have the default in this API? __getitem__() doesn't have a default; instead, there's a separate API get() that provides a default (and I find defaulting to None more manageable than the "_default = object()" pattern).
getattr() has a default too, while __getattr__ hasn't...
Fair enough. But I still want to hear of a practical use case for the default here. -- --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum wrote:
On 7/8/07, Georg Brandl
wrote: Guido van Rossum schrieb:
How important is it to have the default in this API? __getitem__() doesn't have a default; instead, there's a separate API get() that provides a default (and I find defaulting to None more manageable than the "_default = object()" pattern).
Of course it isn't implemented this way in the C version.
getattr() has a default too, while __getattr__ hasn't...
Fair enough.
But I still want to hear of a practical use case for the default here.
In most cases foo = getitem(iterable, 0, None) if foo is not None: ... is simpler than: try: foo = getitem(iterable, 0) except IndexError: pass else: ... Here is a use case from one of my "import XML into the database" scripts: compid = getitem(root[ns.Company_company_id], 0, None) if compid: compid = int(compid) The expression root[ns.company_id] returns an iterator that produces all children of the root node that are of the element type company_id. If there is a company_id its content will be turned into an int, if not None will be used. Servus, Walter
On 7/8/07, Walter Dörwald
But I still want to hear of a practical use case for the default here.
In most cases
foo = getitem(iterable, 0, None) if foo is not None: ...
is simpler than:
try: foo = getitem(iterable, 0) except IndexError: pass else: ...
Here is a use case from one of my "import XML into the database" scripts:
compid = getitem(root[ns.Company_company_id], 0, None) if compid: compid = int(compid)
The expression root[ns.company_id] returns an iterator that produces all children of the root node that are of the element type company_id. If there is a company_id its content will be turned into an int, if not None will be used.
Ahem. I hope you have a better use case for getitem() than that (regardless of the default issue). I find it clearer to write that as try: compid = root[ns.company_id].next() except StopIteration: compid = None else: compid = int(compid) While this is more lines, it doesn't require one to know about getitem() on an iterator. This is the same reason why setdefault() was a mistake -- it's too obscure to invent a compact spelling for it since the compact spelling has to be learned or looked up. -- --Guido van Rossum (home page: http://www.python.org/~guido/)
On 7/8/07, Guido van Rossum
Ahem. I hope you have a better use case for getitem() than that (regardless of the default issue). I find it clearer to write that as
try: compid = root[ns.company_id].next() except StopIteration: compid = None else: compid = int(compid)
While this is more lines, it doesn't require one to know about getitem() on an iterator. This is the same reason why setdefault() was a mistake -- it's too obscure to invent a compact spelling for it since the compact spelling has to be learned or looked up.
Apropos of this discussion, I've occasionally wanted a faster version of the following: _nothing=object() def nth_next(seq,n,default=_nothing): ''' Return the n'th next element for seq, if it exists. If default is specified, it is return when the sequence is too short. Otherwise StopIteration is raised. ''' try: for i in xrange(n-1): seq.next() return seq.next() except StopIteration: if default is _nothing: raise return default The nice thing about this function is that it solves several problems in one: extraction of the n'th next element, testing for a minimum sequence length given a sentinel value, and just skipping n elements. It also leaves the sequence in a useful and predictable state, which is not true of the Python-version getitem code. While cute, I can't say if it is worthy of being an itertool function. Also vaguely apropos: def ilen(seq): 'Return the length of the hopefully finite sequence' n = 0 for x in seq: n += 1 return n Why? Because I find myself implementing it in virtually every project. Maybe I'm just an outlier, but many algorithms I implement need to consume iterators (for side-effects, obviously) and it is sometimes nice to know exactly how many elements were consumed. ~Kevin
On 7/8/07, Kevin Jacobs
Also vaguely apropos:
def ilen(seq): 'Return the length of the hopefully finite sequence' n = 0 for x in seq: n += 1 return n
Also known as:: sum(1 for _ in iterable) That's always been simple enough that I didn't feel a need for an ilen() function. STeVe -- I'm not *in*-sane. Indeed, I am so far *out* of sane that you appear a tiny blip on the distant coast of sanity. --- Bucky Katt, Get Fuzzy
Guido van Rossum wrote:
On 7/8/07, Walter Dörwald
wrote: [quoting Guido] But I still want to hear of a practical use case for the default here.
In most cases
foo = getitem(iterable, 0, None) if foo is not None: ...
is simpler than:
try: foo = getitem(iterable, 0) except IndexError: pass else: ...
Here is a use case from one of my "import XML into the database" scripts:
compid = getitem(root[ns.Company_company_id], 0, None) if compid: compid = int(compid)
The expression root[ns.company_id] returns an iterator that produces all children of the root node that are of the element type company_id. If there is a company_id its content will be turned into an int, if not None will be used.
Ahem. I hope you have a better use case for getitem() than that (regardless of the default issue). I find it clearer to write that as
try: compid = root[ns.company_id].next() except StopIteration: compid = None else: compid = int(compid)
While this is more lines, it doesn't require one to know about getitem() on an iterator. This is the same reason why setdefault() was a mistake -- it's too obscure to invent a compact spelling for it since the compact spelling has to be learned or looked up.
Well I have used (a Python version of) this getitem() function to implement a library that can match a CSS3 expression against an XML tree. For implementing the nth-child(), nth-last-child(), nth-of-type() and nth-last-of-type() pseudo classes (see http://www.w3.org/TR/css3-selectors/#structural-pseudos) getitem() was very useful. Servus, Walter
[Walter Dörwald]
I'd like to propose the following addition to itertools: A function itertools.getitem() which is basically equivalent to the following python code:
_default = object()
def getitem(iterable, index, default=_default): try: return list(iterable)[index] except IndexError: if default is _default: raise return default
but without materializing the complete list. Negative indexes are supported too (this requires additional temporary storage for abs(index) objects).
Why not use the existing islice() function? x = list(islice(iterable, i, i+1)) or default Also, as a practical matter, I think it is a bad idea to introduce __getitem__ style access to itertools because the starting point moves with each consecutive access: # access items 0, 2, 5, 9, 14, 20, ... for i in range(10): print getitem(iterable, i) Worse, this behavior changes depending on whether the iterable is re-iterable (a string would yield consecutive items while a generator would skip around as shown above). Besides being a bug factory, I think the getitem proposal would tend to steer people down the wrong road, away from more natural solutions to problems involving iterators. A basic step in learning the language is to differentiate between sequences and general iterators -- we should not conflate the two. Raymond
Raymond Hettinger wrote:
[Walter Dörwald]
I'd like to propose the following addition to itertools: A function itertools.getitem() which is basically equivalent to the following python code:
_default = object()
def getitem(iterable, index, default=_default): try: return list(iterable)[index] except IndexError: if default is _default: raise return default
but without materializing the complete list. Negative indexes are supported too (this requires additional temporary storage for abs(index) objects).
Why not use the existing islice() function?
x = list(islice(iterable, i, i+1)) or default
This doesn't work, because it produces a list
list(islice(xrange(10), 2, 3)) or 42 [2]
The following would work: x = (list(islice(iterable, i, i+1)) or [default])[0] However islice() doesn't support negative indexes, getitem() does.
Also, as a practical matter, I think it is a bad idea to introduce __getitem__ style access to itertools because the starting point moves with each consecutive access:
# access items 0, 2, 5, 9, 14, 20, ... for i in range(10): print getitem(iterable, i)
Worse, this behavior changes depending on whether the iterable is re-iterable (a string would yield consecutive items while a generator would skip around as shown above).
islice() has the same "problem":
from itertools import * iterable = iter(xrange(100)) for i in range(10): ... print list(islice(iterable, i, i+1)) [0] [2] [5] [9] [14] [20] [27] [35] [44] [54]
iterable = xrange(100) for i in range(10): ... print list(islice(iterable, i, i+1)) [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
Besides being a bug factory, I think the getitem proposal would tend to steer people down the wrong road, away from more natural solutions to problems involving iterators.
I don't think that (list(islice(iterable, i, i+1)) or [default])[0] is more natural than getitem(iterable, i, default)
A basic step in learning the language is to differentiate between sequences and general iterators -- we should not conflate the two.
Servus, Walter
On 7/9/07, Raymond Hettinger
Also, as a practical matter, I think it is a bad idea to introduce __getitem__ style access to itertools because the starting point moves with each consecutive access:
# access items 0, 2, 5, 9, 14, 20, ... for i in range(10): print getitem(iterable, i)
Worse, this behavior changes depending on whether the iterable is re-iterable (a string would yield consecutive items while a generator would skip around as shown above).
Besides being a bug factory, I think the getitem proposal would tend to steer people down the wrong road, away from more natural solutions to problems involving iterators. A basic step in learning the language is to differentiate between sequences and general iterators -- we should not conflate the two.
But doesn't the very same argument also apply against islice(), which you just offered as an alternative? PS. If Walter is also at EuroPython, maybe you two could discuss this in person? -- --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum wrote:
On 7/9/07, Raymond Hettinger
wrote: Also, as a practical matter, I think it is a bad idea to introduce __getitem__ style access to itertools because the starting point moves with each consecutive access:
# access items 0, 2, 5, 9, 14, 20, ... for i in range(10): print getitem(iterable, i)
Worse, this behavior changes depending on whether the iterable is re-iterable (a string would yield consecutive items while a generator would skip around as shown above).
Besides being a bug factory, I think the getitem proposal would tend to steer people down the wrong road, away from more natural solutions to problems involving iterators. A basic step in learning the language is to differentiate between sequences and general iterators -- we should not conflate the two.
But doesn't the very same argument also apply against islice(), which you just offered as an alternative?
Exactly.
PS. If Walter is also at EuroPython, maybe you two could discuss this in person?
Sorry, I won't be at EuroPython. Servus, Walter
From: "Guido van Rossum"
But doesn't the very same argument also apply against islice(), which you just offered as an alternative?
Not really. The use cases for islice() typically do not involve repeated slices of an iterator unless it is slicing off the front few elements on each pass. In contrast, getitem() is all about grabbing something other than the frontmost element and seems to be intended for repeated calls on the same iterator. And its support for negative indices seems somewhat weird in the context of general purpose iterators: getitem(genprimes(), -1). I'll study Walter's use case but my instincts say that adding getitem() will do more harm than good. Raymond
Raymond Hettinger wrote:
From: "Guido van Rossum"
But doesn't the very same argument also apply against islice(), which you just offered as an alternative?
Not really. The use cases for islice() typically do not involve repeated slices of an iterator unless it is slicing off the front few elements on each pass. In contrast, getitem() is all about grabbing something other than the frontmost element and seems to be intended for repeated calls on the same iterator.
That wouldn't make sense as getitem() consumes the iterator! ;) But seriously: perhaps the name getitem() is misleading? What about item() or pickitem()?
And its support for negative indices seems somewhat weird in the context of general purpose iterators: getitem(genprimes(), -1).
This does indeed make as much sense as sum(itertools.count()).
I'll study Walter's use case but my instincts say that adding getitem() will do more harm than good.
Here's the function in use (somewhat invisibly, as it's used by the walknode() method). This gets the oldest news from Python's homepage:
from ll.xist import parsers, xfind from ll.xist.ns import html e = parsers.parseURL("http://www.python.org", tidy=True) print e.walknode(html.h2 & xfind.hasclass("news"))[-1] Google Adds Python Support to Google Calendar Developer's Guide
getitem((line for line in open("Lib/codecs.py") if
Get the first comment line from a python file: line.startswith("#")), 0) '### Registry and builtin stateless codec functions\n' Create a new unused identifier:
def candidates(base): ... yield base ... for suffix in count(2): ... yield "%s%d" % (base, suffix) ... usedids = set(("foo", "bar")) getitem((i for i in candidates("foo") if i not in usedids), 0) 'foo2'
Servus, Walter
On 09/07/2007 21.23, Walter Dörwald wrote:
from ll.xist import parsers, xfind from ll.xist.ns import html e = parsers.parseURL("http://www.python.org", tidy=True) print e.walknode(html.h2 & xfind.hasclass("news"))[-1] Google Adds Python Support to Google Calendar Developer's Guide
Get the first comment line from a python file:
getitem((line for line in open("Lib/codecs.py") if line.startswith("#")), 0) '### Registry and builtin stateless codec functions\n'
Create a new unused identifier:
def candidates(base): ... yield base ... for suffix in count(2): ... yield "%s%d" % (base, suffix) ... usedids = set(("foo", "bar")) getitem((i for i in candidates("foo") if i not in usedids), 0) 'foo2'
You keep posting examples where you call your getitem() function with "0" as index, or -1. getitem(it, 0) already exists and it's spelled it.next(). getitem(it, -1) might be useful in fact, and it might be spelled last(it) (or it.last()). Then one may want to add first() for simmetry, but that's it: first(i for i in candidates("foo") if i not in usedids) last(line for line in open("Lib/codecs.py") if line[0] == '#') Are there real-world use cases for getitem(it, n) with n not in (0, -1)? I share Raymond's feelings on this. And by the way, if you wonder, I have these exact feelings as well for islice... :) -- Giovanni Bajo
Giovanni Bajo wrote:
On 09/07/2007 21.23, Walter Dörwald wrote:
from ll.xist import parsers, xfind from ll.xist.ns import html e = parsers.parseURL("http://www.python.org", tidy=True) print e.walknode(html.h2 & xfind.hasclass("news"))[-1] Google Adds Python Support to Google Calendar Developer's Guide
Get the first comment line from a python file:
getitem((line for line in open("Lib/codecs.py") if line.startswith("#")), 0) '### Registry and builtin stateless codec functions\n'
Create a new unused identifier:
def candidates(base): ... yield base ... for suffix in count(2): ... yield "%s%d" % (base, suffix) ... usedids = set(("foo", "bar")) getitem((i for i in candidates("foo") if i not in usedids), 0) 'foo2'
You keep posting examples where you call your getitem() function with "0" as index, or -1.
getitem(it, 0) already exists and it's spelled it.next(). getitem(it, -1) might be useful in fact, and it might be spelled last(it) (or it.last()). Then one may want to add first() for simmetry, but that's it:
first(i for i in candidates("foo") if i not in usedids) last(line for line in open("Lib/codecs.py") if line[0] == '#')
Are there real-world use cases for getitem(it, n) with n not in (0, -1)? I share Raymond's feelings on this. And by the way, if you wonder, I have these exact feelings as well for islice... :)
It useful for screen scraping HTML. Suppose you have the following HTML table: <table> <tr><td>01.01.2007</td><td>12.34</td><td>Foo</td></tr> <tr><td>13.01.2007</td><td>23.45</td><td>Bar</td></tr> <tr><td>04.02.2007</td><td>45.56</td><td>Baz</td></tr> <tr><td>27.02.2007</td><td>56.78</td><td>Spam</td></tr> <tr><td>17.03.2007</td><td>67.89</td><td>Eggs</td></tr> <tr><td> </td><td>164.51</td><td>Total</td></tr> <tr><td> </td><td>(incl. VAT)</td><td></td></tr> </table> To extract the total sum, you want the second column from the second to last row, i.e. something like: row = getitem((r for r in table if r.name == "tr"), -2) col = getitem((c for c in row if c.name == "td"), 1) Servus, Walter
participants (7)
-
Georg Brandl
-
Giovanni Bajo
-
Guido van Rossum
-
Kevin Jacobs <jacobs@bioinformed.com>
-
Raymond Hettinger
-
Steven Bethard
-
Walter Dörwald