Mailman 3 Fwd: re.findfirst() - Python-ideas

Fwd: re.findfirst()

Juancarlo Añez

3 Dec 2019 3 Dec '19

5:53 a.m.

There are many ugly recipes about to handle the common use case that could be handled by: def findfirst(regex, text, default=None, flags=0): return next(finditer(regex, text, flags=flags), default=default) The matching and return value semantics would be the same as those of *re.findall()*, but the search would stop on the first match, or return "default" if there wasn't a match (*findfirst()* will return a tuple when more than one group is matched). Typically programmers will use: matched = re.findall(regex, text)[0] which is inefficient, and incorrect when there is no match. Typically, the pattern for the use case will be: m = re.search(regex, text) if m: matched = m.groups() else: matched = default nowadays: matched = m.groups() if (m := re.search(regex, text)) else default The semantics of *findall()* are nicer, and so would be those of *findfirst()*. -- Juancarlo *Añez*

Attachments:

attachment.htm (text/html — 2.4 KB)

Show replies by date

Steven D'Aprano

3 Dec 3 Dec

6:06 a.m.

I'm sorry Juancarlo, it's not clear to me what *precisely* your proposal is. Are you asking for "findfirst" to be a builtin? A regex helper function? A method on regex objects? Something else? -- Steven

Paul Moore

6:43 a.m.

On Tue, 3 Dec 2019 at 12:07, Steven D'Aprano <steve@pearwood.info> wrote:

...

I'm sorry Juancarlo, it's not clear to me what *precisely* your proposal is. Are you asking for "findfirst" to be a builtin? A regex helper function? A method on regex objects? Something else?

My impression is that he was asking for a re.findfirst(...) function to give a more discoverable name to the next(re.finditer((...)) idiom. As a single example of defining a dedicated function to replace a one-liner, I think it's marginal at best (although discoverability *is* important here). But IMO it is true that using next(some_iterator) to mean "get the first value returned" is something that's needed relatively frequently, but often overlooked by people. I'm not sure there's a good solution, though - adding an alias first() for "next() when used to get the first element" is probably overkill, and apart from dedicated syntax, it would be hard to find something much shorter than next(). Maybe it's just an education issue, people aren't sufficiently familiar with the idiom? Paul

Oscar Benjamin

7:46 p.m.

On Tue, 3 Dec 2019 at 12:48, Paul Moore <p.f.moore@gmail.com> wrote:

...

My impression is that he was asking for a re.findfirst(...) function to give a more discoverable name to the next(re.finditer((...)) idiom.

As a single example of defining a dedicated function to replace a one-liner, I think it's marginal at best (although discoverability *is* important here). But IMO it is true that using next(some_iterator) to mean "get the first value returned" is something that's needed relatively frequently, but often overlooked by people. I'm not sure there's a good solution, though - adding an alias first() for "next() when used to get the first element" is probably overkill, and apart from dedicated syntax, it would be hard to find something much shorter than next().

Maybe it's just an education issue, people aren't sufficiently familiar with the idiom?

What exactly is the idiom here? Using bare next is not a good idea because it leaks StopIteration which can have awkward side effects. So are you suggesting something like result = next(re.finditer(...), None) if result is None: # raise or something else: # use result I would be in favour of adding an alternative to next that raises a different exception when the result isn't found. -- Oscar

Random832

8:07 p.m.

On Tue, Dec 3, 2019, at 20:46, Oscar Benjamin wrote:

...

What exactly is the idiom here?

Using bare next is not a good idea because it leaks StopIteration which can have awkward side effects. So are you suggesting something like

result = next(re.finditer(...), None) if result is None: # raise or something else: # use result

I would be in favour of adding an alternative to next that raises a different exception when the result isn't found.

result, *_ = re.finditer() raises ValueError. Perhaps a way to prevent the values from being consumed and a list constructed would be useful - maybe add a "result, * =" syntax? C# has First, Single, FirstOrDefault, and SingleOrDefault methods [the OrDefault versions return null/zero instead of raising an exception, and single raises if there are multiple items] These can be envisioned roughly as def first(it): x, *_ = it return x def first_or_default(it): x, *_ = [*it] or [None] return x def single(it): x, = it return x def single_or_default(it): x, = [*it] or [None] return x

Andrew Barnert

9:27 p.m.

On Dec 3, 2019, at 17:47, Oscar Benjamin <oscar.j.benjamin@gmail.com> wrote:

...

On Tue, 3 Dec 2019 at 12:48, Paul Moore <p.f.moore@gmail.com> wrote:

...
My impression is that he was asking for a re.findfirst(...) function to give a more discoverable name to the next(re.finditer((...)) idiom.

As a single example of defining a dedicated function to replace a one-liner, I think it's marginal at best (although discoverability *is* important here). But IMO it is true that using next(some_iterator) to mean "get the first value returned" is something that's needed relatively frequently, but often overlooked by people. I'm not sure there's a good solution, though - adding an alias first() for "next() when used to get the first element" is probably overkill, and apart from dedicated syntax, it would be hard to find something much shorter than next().

Maybe it's just an education issue, people aren't sufficiently familiar with the idiom?

What exactly is the idiom here?

The OP’s proposal was for a findfirst function that takes a mandatory default value. So presumably the idiom is just next with a default value: def findfirst(pattern, text, default): return next(finditer(pattern, text), default) And this “first or default” thing is pretty common, not just restricted to the OP’s use case, and next already handles it perfectly, but apparently not enough people know about it.

...

Using bare next is not a good idea because it leaks StopIteration which can have awkward side effects.

Not the two-argument form.

...

So are you suggesting something like

result = next(re.finditer(...), None) if result is None: # raise or something else: # use result

Using None as a sentinel would work here, but as a generic idiom it’s not a good habit, because plenty of other functions can validly iterate None.

...

I would be in favour of adding an alternative to next that raises a different exception when the result isn't found.

If you need that, it’s pretty trivial to write yourself. If you think other people need it and don’t know how to write it, why not submit it to more-itertools and/or tools? If it gets enough uptake, you can always suggest adding it to itertools, or even modifying next to take a keyword argument or something.

Serhiy Storchaka

4 Dec 4 Dec

2:31 a.m.

04.12.19 03:46, Oscar Benjamin пише:

...

result = next(re.finditer(...), None) if result is None: # raise or something else: # use result

`next(re.finditer(...), None)` is a weird way of writing `re.search(...)`. `next(re.finditer(...), defaults)` is the same as `re.search(...) or defaults`.

Guido van Rossum

10:05 a.m.

On Wed, Dec 4, 2019 at 12:34 AM Serhiy Storchaka <storchaka@gmail.com> wrote:

...

`next(re.finditer(...), None)` is a weird way of writing `re.search(...)`.

`next(re.finditer(...), defaults)` is the same as `re.search(...) or defaults`.

Not so fast. re.search() returns a Match object, while re.finditer() and re.findall() return strings. For people who are just interested in strings, the Match object is just a distraction. I think I am +1 on adding re.findfirst() as proposed by the OP. -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

Random832

10:50 a.m.

On Wed, Dec 4, 2019, at 11:05, Guido van Rossum wrote:

...

Not so fast. re.search() returns a Match object, while re.finditer() and re.findall() return strings. For people who are just interested in strings, the Match object is just a distraction. I think I am +1 on adding re.findfirst() as proposed by the OP.

Er, findall returns strings, but finditer returns match objects. And as a side note, PEP 505 would allow the case of wanting the string or None to be written as re.search(...)?.group(0). (Since findall returns a list, it can be written as ....findall(...)[0], which is much better than next(iter(....findall(...))).)

Guido van Rossum

10:59 a.m.

On Wed, Dec 4, 2019 at 8:54 AM Random832 <random832@fastmail.com> wrote:

...

On Wed, Dec 4, 2019, at 11:05, Guido van Rossum wrote:

...
Not so fast. re.search() returns a Match object, while re.finditer() and re.findall() return strings. For people who are just interested in strings, the Match object is just a distraction. I think I am +1 on adding re.findfirst() as proposed by the OP.

Er, findall returns strings, but finditer returns match objects.

Sorry, my bad. Strike finditer() then, the point about findall() being different stands.

...

And as a side note, PEP 505 would allow the case of wanting the string or None to be written as re.search(...)?.group(0).

Still not a particularly discoverable solution. (Given that regular expressions are often used by less sophisticated users.)

...

(Since findall returns a list, it can be written as ....findall(...)[0], which is much better than next(iter(....findall(...))).)

Sure, but because both of these fail if there are no matches, we can't use them in general, and findfirst() is meant to address that by having a default rather than failing in that case. -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

Serhiy Storchaka

11:13 a.m.

04.12.19 18:05, Guido van Rossum пише:

...

On Wed, Dec 4, 2019 at 12:34 AM Serhiy Storchaka <storchaka@gmail.com <mailto:storchaka@gmail.com>> wrote:

`next(re.finditer(...), None)` is a weird way of writing `re.search(...)`.

`next(re.finditer(...), defaults)` is the same as `re.search(...) or defaults`.

Not so fast. re.search() returns a Match object, while re.finditer() and re.findall() return strings. For people who are just interested in strings, the Match object is just a distraction.

re.finditer() yields Match objects. re.findall() returns a list of strings or tuples. re.finditer() is more fundamental, re.findall() can be implemented using re.finditer(): def findall(pattern, string): p = re.compile(pattern) result = [] for m in p.finditer(string): if p.groups == 0: result.append(p.group(0)) elif p.groups == 1: result.append(p.group(1)) else: result.append(p.groups()) return result I suppose re.findall() is an older interface. On other hand, re.finditer() is roughly equivalent to the following code: def finditer(pattern, string): p = re.compile(pattern) pos = 0 while True: m = p.search(string, pos=pos) if m is None: break yield m pos = m.end() Actually it is a little more complex because of handling zero-width matches. Currently re.search() does not support required option, so in real finditer() cannot be implemented in Python using only search(). But this is irrelevant to the first item, `next(re.finditer(...), None)` is always equal to `re.search(...)`.

...

I think I am +1 on adding re.findfirst() as proposed by the OP.

It is not clear what it should return. A Match object, a string, a tuple, whatever? What should it return if no match found -- None, en empty string, an empty tuple, error? I suppose that different users can have different need. It is not practical to provide functions for all combinations, it is easy to write a function for your needs using re.search(). We can only add some receipts in the documentation. The concrete user code can be a little bit simpler (one-liner) if we provide an empty match object. For example: (re.search(patter.string) or EmptyMatch).groups()

Guido van Rossum

12:01 p.m.

On Wed, Dec 4, 2019 at 9:18 AM Serhiy Storchaka <storchaka@gmail.com> wrote:

...

[Guido]

...
I think I am +1 on adding re.findfirst() as proposed by the OP.

It is not clear what it should return. A Match object, a string, a tuple, whatever? What should it return if no match found -- None, en empty string, an empty tuple, error? I suppose that different users can have different need. It is not practical to provide functions for all combinations, it is easy to write a function for your needs using re.search(). We can only add some receipts in the documentation.

The concrete user code can be a little bit simpler (one-liner) if we provide an empty match object. For example:

(re.search(patter.string) or EmptyMatch).groups()

Still pretty obscure. I propose that re.findfirst(...) should return the same thing as re.findall(...)[0] *if the findall() returns a non-empty list*, and otherwise it should return a default. The default defaults to None but can be set by passing default=... to the re.findfirst() call. For simple cases (no capturing groups, or a single one) that will return a string or the default value; if there are multiple capturing groups it will return a tuple of strings or the default. If the user always wants a tuple they can do so by specifying an appropriate tuple as default value; I don't propose to try and match the shape of the tuple on a successful match. -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

Serhiy Storchaka

12:43 p.m.

04.12.19 20:01, Guido van Rossum пише:

...

On Wed, Dec 4, 2019 at 9:18 AM Serhiy Storchaka <storchaka@gmail.com <mailto:storchaka@gmail.com>> wrote:

[Guido] > I think I am +1 on > adding re.findfirst() as proposed by the OP.

It is not clear what it should return. A Match object, a string, a tuple, whatever? What should it return if no match found -- None, en empty string, an empty tuple, error? I suppose that different users can have different need. It is not practical to provide functions for all combinations, it is easy to write a function for your needs using re.search(). We can only add some receipts in the documentation.

The concrete user code can be a little bit simpler (one-liner) if we provide an empty match object. For example:

(re.search(patter.string) or EmptyMatch).groups()

Still pretty obscure. I propose that re.findfirst(...) should return the same thing as re.findall(...)[0] *if the findall() returns a non-empty list*, and otherwise it should return a default. The default defaults to None but can be set by passing default=... to the re.findfirst() call. For simple cases (no capturing groups, or a single one) that will return a string or the default value; if there are multiple capturing groups it will return a tuple of strings or the default. If the user always wants a tuple they can do so by specifying an appropriate tuple as default value; I don't propose to try and match the shape of the tuple on a successful match.

I think this is too fast. We have a single request for this feature, and we do not know whether this behavior is what the OP wand and in what context such function would be used, and how common such code. It may be that using it is suboptimal, and the code can be simpler or more efficient if use search(), finditer() or other existing functions. I do not want to add yet one function for special case to the re module. It is already too complex. In contrary, there was good reasons for adding fullmatch() because it provides functionality which is difficult to implement with other functions. Actually some uses of match() and search() can be replaced with fullmatch(), and this even can fix possible bugs. I am currently working on a big patch for this (perhaps will split it on several issues).

Guido van Rossum

1:01 p.m.

Fair enough. I’ll let the OP defend his use case. On Wed, Dec 4, 2019 at 10:51 Serhiy Storchaka <storchaka@gmail.com> wrote:

...

04.12.19 20:01, Guido van Rossum пише:

...
On Wed, Dec 4, 2019 at 9:18 AM Serhiy Storchaka <storchaka@gmail.com <mailto:storchaka@gmail.com>> wrote:

[Guido] > I think I am +1 on > adding re.findfirst() as proposed by the OP.

It is not clear what it should return. A Match object, a string, a tuple, whatever? What should it return if no match found -- None, en empty string, an empty tuple, error? I suppose that different users can have different need. It is not practical to provide functions for all combinations, it is easy to write a function for your needs using re.search(). We can only add some receipts in the documentation.

The concrete user code can be a little bit simpler (one-liner) if we provide an empty match object. For example:

(re.search(patter.string) or EmptyMatch).groups()

Still pretty obscure. I propose that re.findfirst(...) should return the same thing as re.findall(...)[0] *if the findall() returns a non-empty list*, and otherwise it should return a default. The default defaults to None but can be set by passing default=... to the re.findfirst() call. For simple cases (no capturing groups, or a single one) that will return a string or the default value; if there are multiple capturing groups it will return a tuple of strings or the default. If the user always wants a tuple they can do so by specifying an appropriate tuple as default value; I don't propose to try and match the shape of the tuple on a successful match.

I think this is too fast. We have a single request for this feature, and we do not know whether this behavior is what the OP wand and in what context such function would be used, and how common such code. It may be that using it is suboptimal, and the code can be simpler or more efficient if use search(), finditer() or other existing functions.

I do not want to add yet one function for special case to the re module. It is already too complex.

In contrary, there was good reasons for adding fullmatch() because it provides functionality which is difficult to implement with other functions. Actually some uses of match() and search() can be replaced with fullmatch(), and this even can fix possible bugs. I am currently working on a big patch for this (perhaps will split it on several issues). _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/ESQCRV... Code of Conduct: http://python.org/psf/codeofconduct/

-- --Guido (mobile)

Serhiy Storchaka

3 Dec 3 Dec

11:02 a.m.

03.12.19 13:53, Juancarlo Añez пише:

...

There are many ugly recipes about to handle the common use case that could be handled by:

def findfirst(regex, text, default=None, flags=0):

return next(finditer(regex, text, flags=flags), default=default)

Oh, this is the most strange use of finditer() that I seen! If finditer() is virtually iterative search() (with some peculiarities for zero-width matches which is irrelevant to the the first search()) why not use just search()?

...

Typically, the pattern for the use case will be:

m = re.search(regex, text)

if m:

matched = m.groups()

else:

matched = default

nowadays:

matched = m.groups() if (m := re.search(regex, text)) else default

The semantics of *findall()* are nicer, and so would be those of *findfirst()*.

Actually the semantic of the above code is different from the semantic of `re.findall(regex, text)[0]`. findall() yields strings if the pattern contains less than 2 capture groups and tuples if it more than 1 capture groups.

...

...
...
re.findall('..', 'abcd') ['ab', 'cd'] re.findall('(.).', 'abcd') ['a', 'c'] re.findall('((.).)', 'abcd') [('ab', 'a'), ('cd', 'c')]

It is not clear what behavior do you need. And I suppose that other users of findfirst() can need different behavior. search() is more powerful and allow you to get what you need. It could be easier if search() return an empty Match object instead of None if it does not find anything. However this ship is sailed, changing search() will break the code that checks `match is None`. But you can create your own Match-like object and use it to simplify expressions: matched = (m or MyMatch(default)).groups()

Andrew Barnert

12:38 p.m.

On Dec 3, 2019, at 09:05, Serhiy Storchaka <storchaka@gmail.com> wrote:

...

Actually the semantic of the above code is different from the semantic of `re.findall(regex, text)[0]`. findall() yields strings if the pattern contains less than 2 capture groups and tuples if it more than 1 capture groups.

...
...
...
re.findall('..', 'abcd') ['ab', 'cd'] re.findall('(.).', 'abcd') ['a', 'c'] re.findall('((.).)', 'abcd') [('ab', 'a'), ('cd', 'c')]

It is not clear what behavior do you need. And I suppose that other users of findfirst() can need different behavior. search() is more powerful and allow you to get what you need.

I think the point is that there are cases (like interactive exploration in the REPL) where findall and finditer are more convenient despite being less powerful, and in fact they’re more convenient because of this weird inconsistency. You know that your particular regexp has no capture groups and you just want the matched strings. Or you know that it does have capture groups, and you want the matched tuples of strings. Either way, you don’t need the match object, and having to deal with that (and, worse, with a match object or None) is just extra code that gets in the way (and that you can get wrong). And I think the OP is right that there would be similar convenience uses for findfirst, if not even more of them. And it makes more sense to build that findfirst around findall or finditer than around search. But, given how easy it is to build this on finditer, and that it uses a general pattern that works for every similar case rather than something specific to regexp, I agree that nothing needs to be done (except maybe to educate people better about next).

1849

Age (days ago)

1850

Last active (days ago)

List overview

Download

15 comments

8 participants

participants (8)

Andrew Barnert
Guido van Rossum
Juancarlo Añez
Oscar Benjamin
Paul Moore
Random832
Serhiy Storchaka
Steven D'Aprano

Fwd: re.findfirst()

tags

participants (8)