findfirst() from findalliter(), and first
These are the implementations I added to my library: ```python def findalliter(pattern, string, flags=0): ''' like finditer(), but with return values like findall() implementation taken from cpython/Modules/_sre.c/findall() ''' for m in re.finditer(pattern, string, flags=flags): g = m.groups() if len(g) == 1: yield g[0] elif g: yield g else: yield m.group() def findfirst(pattern, string, flags=0, default=_undefined): """ Avoids using the inefficient findall(...)[0], or first(findall(...)) """ return first(findalliter(pattern, string, flags=flags), default=default) ``` Fon *first()*, maybe calling it *take_one()* will clear up misunderstandings about it's semantics. -- Juancarlo *Añez*
On Dec 27, 2019, at 09:00, Juancarlo Añez <apalala@gmail.com> wrote:
for m in re.finditer(pattern, string, flags=flags): g = m.groups() if len(g) == 1: yield g[0] elif g: yield g else: yield m.group()
I don’t think this does the same thing as findall in every case. For example, for capture groups that don’t participate in the match, you’ll get tuples like ('spam', None, '42'), when I’m pretty sure findall always has strings no matter what. I’m not sure exactly what the rule is for how it does that: maybe it’s just the same thing as m.groups(default='') would give you? At any rate, it seems like this isn’t as trivial to port from C as it looked, so this needs solid unit tests. (Maybe the ones for findall are already good enough if you just adapt them?) That’s also a great argument that it should be added to the re module, so people don’t have to try to figure out how to port C code to Python and then test the hell out of it just to get something that’s only missing in the first place for historical/naming reasons.
Good point, Andrew! I missed including the unit tests I wrote. I will now, at the bottom. For reference, this is the implementation of *findall()*: static PyObject * _sre_SRE_Pattern_findall_impl(PatternObject *self, PyObject *string, Py_ssize_t pos, Py_ssize_t endpos) /*[clinic end generated code: output=f4966baceea60aca input=5b6a4ee799741563]*/ { SRE_STATE state; PyObject* list; Py_ssize_t status; Py_ssize_t i, b, e; if (!state_init(&state, self, string, pos, endpos)) return NULL; list = PyList_New(0); if (!list) { state_fini(&state); return NULL; } while (state.start <= state.end) { PyObject* item; state_reset(&state); state.ptr = state.start; status = sre_search(&state, PatternObject_GetCode(self)); if (PyErr_Occurred()) goto error; if (status <= 0) { if (status == 0) break; pattern_error(status); goto error; } /* don't bother to build a match object */ switch (self->groups) { case 0: b = STATE_OFFSET(&state, state.start); e = STATE_OFFSET(&state, state.ptr); item = getslice(state.isbytes, state.beginning, string, b, e); if (!item) goto error; break; case 1: item = state_getslice(&state, 1, string, 1); if (!item) goto error; break; default: item = PyTuple_New(self->groups); if (!item) goto error; for (i = 0; i < self->groups; i++) { PyObject* o = state_getslice(&state, i+1, string, 1); if (!o) { Py_DECREF(item); goto error; } PyTuple_SET_ITEM(item, i, o); } break; } status = PyList_Append(list, item); Py_DECREF(item); if (status < 0) goto error; state.must_advance = (state.ptr == state.start); state.start = state.ptr; } state_fini(&state); return list; error: Py_DECREF(list); state_fini(&state); return NULL; } As far as I understand it, my implementation of *findalliter()* matches the semantics in the *switch* statement. This is my unit test. As you noted, it's not complete as it lacks a case for unmatched groups. def test_findfirst(): def check(pattern, string): try: g = re.findall(pattern, string) assert findfirst(pattern, string) == g[0], '%s %s' % (pattern, string) assert list(findalliter(pattern, string)) == g, '%s %s' % (pattern, string) except ValueError: assert len(g) == 0 with pytest.raises(ValueError): findfirst(pattern, string) s = 'xxxaxxxxbxxx' check('a', s) check('(a)', s) check('(a|b)', s) check('a.*b', s) check('(a).*(b)', s) # no match check('y', s) assert findfirst('y', s, default='THE DEFAULT') == 'THE DEFAULT' base = s for i in range(2, 10): s = base * i g = re.findall('a', s) assert len(g) > 1 check('a', s) check('(a)', s) check('(a|b)', s) check('a.*b', s) check('(a).*(b)', s) check('y', s) You're right about *findall()* *always* returning srings: [ins] In [1]: import re [ins] In [2]: re.findall(r'(a).*(b)?.*(c)', 'axxxxxxxxxxxxc') Out[2]: [('a', '', 'c')] [ins] In [3]: This is the matching implementation: for m in re.finditer(pattern, string, flags=flags): g = m.groups() if len(g) == 1: yield g[0] elif g: yield tuple(s if s else '' for s in g) else: yield m.group() Updated unit test: def test_findfirst(): def check(pattern, string): try: g = re.findall(pattern, string) assert findfirst(pattern, string) == g[0], '%s %s' % (pattern, string) assert list(findalliter(pattern, string)) == g, '%s %s' % (pattern, string) except ValueError: assert len(g) == 0 with pytest.raises(ValueError): findfirst(pattern, string) def all_checks(s): check('a', s) check('(a)', s) check('(a|b)', s) check('a.*b', s) check('(a).*(b)', s) check('(a).*(c)?.*(b)', s) # non-matching group # no match check('y', s) check('(y)?.*(z)?.*(q)', s) # empty matches check('(a).*(y?).*(b)', s) assert findfirst('y', s, default='THE DEFAULT') == 'THE DEFAULT' s = 'xxxaxxxxbxxx' all_checks(s) base = s for i in range(2, 10): s = base * i g = re.findall('a', s) assert len(g) == i all_checks(s) Cheers, On Fri, Dec 27, 2019 at 5:37 PM Andrew Barnert <abarnert@yahoo.com> wrote:
On Dec 27, 2019, at 09:00, Juancarlo Añez <apalala@gmail.com> wrote:
for m in re.finditer(pattern, string, flags=flags): g = m.groups() if len(g) == 1: yield g[0] elif g: yield g else: yield m.group()
I don’t think this does the same thing as findall in every case. For example, for capture groups that don’t participate in the match, you’ll get tuples like ('spam', None, '42'), when I’m pretty sure findall always has strings no matter what. I’m not sure exactly what the rule is for how it does that: maybe it’s just the same thing as m.groups(default='') would give you?
At any rate, it seems like this isn’t as trivial to port from C as it looked, so this needs solid unit tests. (Maybe the ones for findall are already good enough if you just adapt them?)
That’s also a great argument that it should be added to the re module, so people don’t have to try to figure out how to port C code to Python and then test the hell out of it just to get something that’s only missing in the first place for historical/naming reasons.
-- Juancarlo *Añez*
On Dec 28, 2019, at 10:12, Juancarlo Añez <apalala@gmail.com> wrote: As far as I understand it, my implementation of findalliter() matches the semantics in the switch statement.
There’s nothing outside the switch statement that converts PyNone values to empty strings, so whatever the difference is, it must be inside the switch statement. And, even though your control flow is the same, there are two obvious ways in which you aren’t using the same input as the C code, so the semantics aren’t going to be the same. The C code pulls values out of the pattern’s internal state without building a match object. My guess is that this is where the difference is—either building the match object, or somewhere inside the groups method, unmatched groups get converted into something else, which the groups method then replaces with its default parameter, which defaults to None, while the C state_getslice function is just pulling a 0-length string out of the input. But without diving into the code that’s just a guess. The C code also switches on the number of groups in the pattern (which I think is exposed from the compiled pattern object?), not the number of results in the current match. I’d guess that’s guaranteed to always be the same even in weird cases like nested groups, so isn’t relevant here, but again that’s just a guess.
This is the matching implementation:
for m in re.finditer(pattern, string, flags=flags): g = m.groups() if len(g) == 1: yield g[0] elif g: yield tuple(s if s else '' for s in g) else: yield m.group()
Why not just call groups(default='') instead of calling groups() to replace them with None and then using a genexpr to convert that None to ''? More importantly, you can’t return '', you have to return '' or b'' depending on the type of the input string, using the same rule (whatever it is) that findall and the rest of the module use. (I think that’s worked out at compile time and exposed on the compiler pattern object, but I’m not sure.) And even using a default with groups assumes I guessed right about the problem, and that it’s the only difference in behavior. If not, it may still be a hack that only sometimes gets the right answer and just nobody’s thought up a test case otherwise. I think you really do need to go through either the C code or the docs to make sure there aren’t any other edge cases.
Updated unit test:
Are there tests for findall and/or finditer in the stdlib test suite with wide coverage that you could adapt to compare list(findalliter) vs. findall or something?
All your observations are on the mark, Andrew, so I went ahead and coded the patch against *cpython*. This is the (draft) pull request. https://github.com/python/cpython/pull/17735 I did not write new tests for *findalliter() *or *findfirst()*, because the correctness of *findalliter()* is implied by having *findall()* reimplemented in terms of it, and the correctness of *findfirst()* depends entirely on the correctness of *first()*, and *first()* is an ongoing discussion. Later I may try implementing *findalliter()* in C (move the core of *findall()* to within the outer of *finditer()*?) Cheers, On Sat, Dec 28, 2019 at 4:03 PM Andrew Barnert <abarnert@yahoo.com> wrote:
On Dec 28, 2019, at 10:12, Juancarlo Añez <apalala@gmail.com> wrote:
As far as I understand it, my implementation of *findalliter()* matches the semantics in the *switch* statement.
There’s nothing outside the switch statement that converts PyNone values to empty strings, so whatever the difference is, it must be inside the switch statement. And, even though your control flow is the same, there are two obvious ways in which you aren’t using the same input as the C code, so the semantics aren’t going to be the same.
The C code pulls values out of the pattern’s internal state without building a match object. My guess is that this is where the difference is—either building the match object, or somewhere inside the groups method, unmatched groups get converted into something else, which the groups method then replaces with its default parameter, which defaults to None, while the C state_getslice function is just pulling a 0-length string out of the input. But without diving into the code that’s just a guess.
The C code also switches on the number of groups in the pattern (which I think is exposed from the compiled pattern object?), not the number of results in the current match. I’d guess that’s guaranteed to always be the same even in weird cases like nested groups, so isn’t relevant here, but again that’s just a guess.
This is the matching implementation:
for m in re.finditer(pattern, string, flags=flags): g = m.groups() if len(g) == 1: yield g[0] elif g: yield tuple(s if s else '' for s in g) else: yield m.group()
Why not just call groups(default='') instead of calling groups() to replace them with None and then using a genexpr to convert that None to ''?
More importantly, you can’t return '', you have to return '' or b'' depending on the type of the input string, using the same rule (whatever it is) that findall and the rest of the module use. (I think that’s worked out at compile time and exposed on the compiler pattern object, but I’m not sure.)
And even using a default with groups assumes I guessed right about the problem, and that it’s the only difference in behavior. If not, it may still be a hack that only sometimes gets the right answer and just nobody’s thought up a test case otherwise. I think you really do need to go through either the C code or the docs to make sure there aren’t any other edge cases.
Updated unit test:
Are there tests for findall and/or finditer in the stdlib test suite with wide coverage that you could adapt to compare list(findalliter) vs. findall or something?
-- Juancarlo *Añez*
On Fri, Dec 27, 2019, at 11:58, Juancarlo Añez wrote:
Fon *first()*, maybe calling it *take_one()* will clear up misunderstandings about it's semantics.
It may be worth mentioning that take [arbitrary n] is already present in itertools recipes. Though that gets into the debates about what should be an itertools recipe, an itertools function, or a builtin function, and we didn't really do anything productive with the last round of that.
participants (3)
-
Andrew Barnert
-
Juancarlo Añez
-
Random832