Fwd: Re: Fwd: re.findfirst()

On Wed, Dec 4, 2019 at 3:02 PM Guido van Rossum <guido@python.org> wrote:
Fair enough. I’ll let the OP defend his use case.
The OP thinks that the case for wanting just the string for a first regex match, or a verifiable default if there is no match, is way too common, that the advice on the web is not very good (it should be "write a findfirst() using next() over finditer()", and that novices default to using findall(..)[0], which is troublesome. The proposed implementation of a findfirst() would handle many common cases, and be friendly to newcomers (why do I need to deal with a Match object?), specially if the semantics are those of *findall()*: next(iter(findall(...)), default=default) BTW, a common function in extensions to *itertools* is *first():* def first(seq, default=None): return next(iter(seq), default= default) That function, *first()*, would also be a nice addition in *itertools*, and *findfirst()* could be implemented using it. *first()* avoids most use cases needing to check if a sequence or iterator is empty before using a default value. MHO is that *first()* deals with so many common cases that it should be a builtin. Note that the case for *findfirst()* is weaker if *first()* is available. Yet *findfirst()* solves the bigger problem. -- Juancarlo *Añez*

On Dec 5, 2019, at 08:53, Juancarlo Añez <apalala@gmail.com> wrote:
I think this was proposed for itertools and rejected. I don’t remember why, but generally there’s resistance to adding anything that you could write yourself (and are unlikely to get wrong) on top of itertools and builtins, unless it needs to loop and yield itself (in which case it might need the performance boost of iterating in C instead of Python), because that’s what the recipes are for. And I suppose if you see the recipe for nth you don’t learn anything from the recipe for first. But people seem more open to recipes being “everything useful” rather than only “everything useful that also teaches an important idea”, and the recipe docs even link to more-itertools for people looking to use them out of the box (and first is in more-itertools). Also, I think it’s pretty clear that people often don’t think of first when they need it, so even if they could write it if they thought of it, they don’t because they don’t. So maybe it’s worth at least adding first as a recipe, even if people don’t think it’s worth adding to the module itself? (Personally, I use first if I’ve already imported more-itertools for something else, but otherwise I just next Iter.)

On Dec 5, 2019, at 08:53, Juancarlo Añez <apalala@gmail.com> wrote:
The proposed implementation of a findfirst() would handle many common cases, and be friendly to newcomers (why do I need to deal with a Match object?), specially if the semantics are those of findall():
next(iter(findall(...)), default=default)
The problem with using findall instead of finditer or search is that it scans the whole document rather than just until the first match, and it builds a potentially huge list just to throw it away. It’s pretty common that one or both of those will be a serious performance issue. Imagine asking to find the first double consonant in the OED and it takes a minute to run and pins a gigabyte of memory. It’s unfortunate that these functions aren’t better matched. Why is there a simple-semantics find-everything and a match-semantics find-iteratively and find-one? But I don’t think adding a simple-semantics find-one that works by inefficiently finding all is the right solution. And if the point of proposing first is that novices will figure out how to write first(findall(…)) so we don’t need to add findfirst, then I think we need findfirst even more, because novices shouldn’t learn that bad idea.

05.12.19 21:07, Guido van Rossum пише:
The case for findfirst() becomes stronger! There seem plenty of ways to get this wrong.
I write several functions every day. There are many ways to get this wrong. But I do not propose to include all these functions in the stdlib. If I want to include even a single function, I try to find several examples that would benefit from adding this function in the stdlib. If I found less examples than I expected I withdraw my idea. We still do not know a use case for findfirst. If the OP would show his code and several examples in others code this could be an argument for usefulness of this feature.

Serhiy Storchaka wrote:
I'm not sure about the OP's exact use case, but using GitHub's code search for .py files that match with "first re.findall" shows a decent amount of code that uses the format ``re.findall()[0]``. It would be nice if GitHub's search properly supported symbols and regular expressions, but this presents a decent number of examples. See https://github.com/search?l=Python&q=first+re.findall&type=Code. I also spent some time looking for a few specific examples, since there were a number of false positives in the above results. Note that I didn't look much into the actual purpose of the code or judge it based on quality, I was just looking for anything that seemed remotely practical and contained something along the lines of ``re.findall()[0]``. Several of the links below contain multiple lines where findfirst would likely be a better alternative, but I only included one permalink per code file. https://github.com/MohamedAl-Hussein/my_projects/blob/15feca5254fe1b2936d393... https://github.com/MohamedAl-Hussein/FIFA/blob/2b1390fe46f94648e5b0bcfd28bc6... https://github.com/democracyworks/dog-catcher/blob/9f6200084d4505091399d36ab... https://github.com/democracyworks/dog-catcher/blob/9f6200084d4505091399d36ab... https://github.com/jessyL6/CQUPTHUB-spiders_task1/blob/db73c47c0703ed01eb2a6... https://github.com/kerinin/giscrape/blob/d398206ed4a7e48e1ef6afbf37b4f98784c... https://github.com/songweifun/parsebook/blob/529a86739208e9dc07abbb31363462e... I'm sure there are far more examples and perhaps some more "realistic" ones, I only went through the first few pages of results. On Thu, Dec 5, 2019 at 3:08 PM Serhiy Storchaka <storchaka@gmail.com> wrote:

To overcome Github's search limitations, one can use Chrome's codesearch or the public github dataset available on bigquery (note: it's only a sample from 2012 if I'm not mistaken). https://cs.chromium.org/search/?q=lang:py+re%5C.findall%5C(.*%5C)%5C%5B0%5C%5D&sq=package:chromium&type=cs returns 5 results while the following query: SELECT COUNT(*) FROM (SELECT c.id id, c.content content, f.repo_name repo_name, f.path path FROM `bigquery-public-data.github_repos.sample_files` f JOIN ( SELECT * FROM `bigquery-public-data.github_repos.sample_contents` ) c ON f.id = c.id WHERE ENDS_WITH(f.path, ".py") AND REGEXP_CONTAINS(c.content, "re\\.findall\\(.*\\)\\[0\\]") ) returns 84 entries. On Thu, Dec 5, 2019 at 6:51 PM Kyle Stanley <aeros167@gmail.com> wrote:
-- Sebastian Kreft

05.12.19 23:47, Kyle Stanley пише:
Thank you Kyle for your investigation!
https://github.com/MohamedAl-Hussein/my_projects/blob/15feca5254fe1b2936d393...
It is easy to rewrite it using re.search(). - input_processor=MapCompose(lambda x: re.findall(r'pointDRI = ([0-9]+)', x)[0], eval), + input_processor=MapCompose(lambda x: re.search(r'pointDRI = ([0-9]+)', x).group(1), eval), I also wonder if it is worth to replace eval with more efficient and safe int.
https://github.com/MohamedAl-Hussein/FIFA/blob/2b1390fe46f94648e5b0bcfd28bc6...
It is the same code differently formatted.
https://github.com/democracyworks/dog-catcher/blob/9f6200084d4505091399d36ab...
- clerk_name = name_re.findall(clerk)[0] + clerk_name = name_re.search(clerk).group(1)
https://github.com/democracyworks/dog-catcher/blob/9f6200084d4505091399d36ab...
- official_name = name_re.findall(town)[0].title() + official_name = name_re.search(town).group().title()
https://github.com/jessyL6/CQUPTHUB-spiders_task1/blob/db73c47c0703ed01eb2a6...
- first_1_results = re.findall(first_1,all_list9)[0] + first_1_results = re.findall(first_1,all_list9).group(1)
https://github.com/kerinin/giscrape/blob/d398206ed4a7e48e1ef6afbf37b4f98784c...
It is a complex example which performs multiple searches with different regular expressions. It is all can be replaced with a single more efficient regular expression. - if re.search('^(\w+) (\w+)$', parcel.owner): - last, first = re.findall( '(\w+) (\w+)',parcel.owner )[0] - elif re.search('^(\w+) (\w+) (\w+)$', parcel.owner): - last, first, middle = re.findall( '(\w+) (\w+) (\w+)',parcel.owner )[0] - elif re.search('^(\w+) (\w+) & (\w+)$', parcel.owner): - last, first = re.findall( '(\w+) (\w+)',parcel.owner )[0] - elif re.search('^(\w+) (\w+) (\w+) &: (\w+)$', parcel.owner): - last, first, middle = re.findall( '(\w+) (\w+) (\w+)',parcel.owner )[0] - elif re.search('^(\w+) (\w+) & (\w+) (\w+)$', parcel.owner): - last, first = re.findall( '(\w+) (\w+)',parcel.owner )[0] - elif re.search('^(\w+) (\w+) (\w+) &: (\w+) (\w+)$', parcel.owner): - last, first, middle = re.findall( '(\w+) (\w+) (\w+)',parcel.owner )[0] - elif re.search('^(\w+) (\w+) & (\w+) (\w+) (\w+)$', parcel.owner): - last, first = re.findall( '(\w+) (\w+)',parcel.owner )[0] - elif re.search('^(\w+) (\w+) (\w+) &: (\w+) (\w+) (\w+)$', parcel.owner): - last, first, middle = re.findall( '(\w+) (\w+) (\w+)', parcel.owner )[0] + m = re.fullmatch('(\w+) (\w+)(?: (\w+))?(?: &(?: \w+){1,3})?', parcel.owner) + if m: + last, first, middle = m.groups()
https://github.com/songweifun/parsebook/blob/529a86739208e9dc07abbb31363462e...
This is the only example which checks if findall() returns an empty list. It calls findall() twice! Fortunately it can be easily optimized using a fact that the Match object support subscription. I used group() above because it is more explicit and works in older Python. - self.item.first_tutor_name = REGPX_A.findall(value)[0] if REGPX_A.findall(value) else '' + self.item.first_tutor_name = (REGPX_A.search(value) or [''])[0] It seems that in most cases the author just do not know about re.search(). Adding re.findfirst() will not fix this.

Serhiy Storchaka wrote:
Thank you Kyle for your investigation!
No problem, this seemed like an interesting feature proposal and I was personally curious about the potential use cases. Thanks for the detailed analysis, I learned a few new things from it. (: Serhiy Storchaka wrote:
- clerk_name = name_re.findall(clerk)[0] + clerk_name = name_re.search(clerk).group(1)
This pattern seems to be common across most of the above examples (minus the last two), specifically replacing ``re.findall()[0]`` with ``re.findall().group(1)`` when there are subgroups within the regex or ``re.findall().group()`` without subgroups. Serhiy Storchaka wrote:
It seems that in most cases the author just do not know about re.search(). Adding re.findfirst() will not fix this.
That's definitely possible, but it might be just as likely that they saw re.findall() as being more simple to use compared to re.search(). Although it has worse performance by a substantial amount when parsing decent amounts of text (assuming the first match isn't at the end), ``re.findall()[0]`` *consistently* returns the first string that was matched, as long as no subgroups were used. This allows them to circumvent the usage of match objects entirely, which makes it a bit easier to learn. Especially for those who are less familiar with OOP, or are already familiar with other popular flavors of regex (such as JS). I'll admit this is mostly speculation, but I think there's an especially large number of re users (compared to other modules) that aren't necessarily developers, and might just be someone who wants to write a script to quickly parse some documents. These types of users are the ones who would likely benefit the most from the proposed re.findfirst(), particularly if it directly returns a string as Guido is suggesting. I think at the end of the day, the critical question to answer is this: *Do we want to add a new helper function that's easy to use, consistent, and provides good performance for finding the first match, even if the functionality already exists within the module?* Personally, I lean a bit more towards "yes", but I think that "no" would also be a reasonable response. From my perspective, a significant reason why Python is appealing to so many users that aren't professional developers is that it's much easier to pick up the basics. Python allows users write a quick script with *decent* performance without having to learn too much, compared to most other mainstream programming languages. IMO, the addition of an re.findfirst() helps to reinforce that reason. Another option to consider might be adding a boolean parameter to re.search() that changes the behavior to directly return a string instead of a match object, similar to re.findall() when there are not multiple subgroups. For example:
The above would have the same exact return value as ``pattern.findall()[0]``, but it's more efficient since it would only parse the text until the first match is found, and it doesn't need to create a list. For backwards compatibility, this parameter would default to True. Feel free to change the name if you like the idea, "match_obj" was simply the first one that came to my head. The cons of this solution is that it might be excessively overloading re.search(), and that it not be as noticeable or easy to find as the addition of a new function. But, it could provide the same functionality as the proposed re.findfirst(), without adding an entirely new function for behavior that already exists. On Fri, Dec 6, 2019 at 2:47 AM Serhiy Storchaka <storchaka@gmail.com> wrote:

06.12.19 23:20, Kyle Stanley пише:
My concern is that this will add complexity to the module documentation which is already too complex. re.findfirst() has more complex semantic (if no capture groups returns this, if one capture group return that, and in other cases return even something of different type) than re.search() which just returns a match object or None. This will increase chance that the user miss the appropriate function and use suboptimal functions like findall()[0]. re.finditer() is more modern and powerful function than re.findall(). The latter may be even deprecated in future. In future we may add yet few functions/methods: re.rmatch() (like re.match(), but matches at the end of the string instead of the start), re.rsearch() (searches from the end), re.rfinditer() (iterates in the reversed order). Unlike to findfirst() they will implement features that cannot be easily expressed using existing functions.
Oh, no, this is the worst idea!

Serhiy Storchaka wrote:
re.finditer() is more modern and powerful function than re.findall(). The latter may be even deprecated in future.
Hmm, perhaps another consideration then would be to think of improvements to make to the existing documentation, particularly with including some code examples or expanding upon the docs for re.finditer() to make its usage more clear. Personally, it took me quite a while to understand its role in the module (as someone who does not use it on a frequent basis). Code examples should of course be used sparingly, but I think re.finditer() could benefit from at least one. Especially considering that far less complex functions in the module have several examples. See https://docs.python.org/3.8/library/re.html#re.finditer. Serhiy Storchaka wrote:
Oh, no, this is the worst idea!
Yeah, after having some time to reflect on that idea a bit more I don't think it would work. That would just end up adding confusion to re.search(), ultimately defeating the purpose of the parameter in the first place. It would be too drastic of a change in behavior for a single parameter to make. Thanks for the honesty though, not all of my ideas are good ones. But, if I can come up with something half-decent every once in a while I think it's worth throwing them out there. (: On Sat, Dec 7, 2019 at 2:56 AM Serhiy Storchaka <storchaka@gmail.com> wrote:

Code examples should of course be used sparingly, but I think re.finditer() could benefit from at least one
Clarification: I see that there's an example of it being used in https://docs.python.org/3.8/library/re.html#finding-all-adverbs-and-their-po... and one more complex example with https://docs.python.org/3.8/library/re.html#writing-a-tokenizer. I was specifically referring to including a basic example directly within https://docs.python.org/3.8/library/re.html#re.finditer, similar to the section for https://docs.python.org/3.8/library/re.html#re.split or https://docs.python.org/3.8/library/re.html#re.sub. Alternatively: creating a new section under https://docs.python.org/3.8/library/re.html#regular-expression-examples, titled "Finding the first match", where it briefly explains the difference in behavior between using re.findall()[0] and re.finditer().group(1) (or re.finditer.group() when there's not a subgroup). Based on the discussions in this thread and code examples, this seems to be rather commonly misunderstood. On Sat, Dec 7, 2019 at 7:29 AM Kyle Stanley <aeros167@gmail.com> wrote:

On Dec 7, 2019, at 04:51, Kyle Stanley <aeros167@gmail.com> wrote: Alternatively: creating a new section under https://docs.python.org/3.8/library/re.html#regular-expression-examples, titled "Finding the first match", where it briefly explains the difference in behavior between using re.findall()[0] and re.finditer().group(1) (or re.finditer.group() when there's not a subgroup).
Hold on, what is finditer().group(1) supposed to mean here? You’d need next(finditer()).group(1) or next(m.group(1) for m in finditer()) or something. But if you just want the first match, why are you using either findall or finditer instead of just search? Isn’t that exactly the confusion this thread was hoping to resolve, rather than forcing even more novices to deal with it by pushing them into it in a section named “Finding the first match”? Also (when there are subgroups), surely the relevant difference is either between findall()[0][0] and next(finditer()).group(1), which both return the first group of the first match, or between findall()[0] and next(finditer()).groups(), which both return a tuple of groups of the first match, not between findall()[0] and next(finditer()).group(1), which return a tuple vs. just the first one?

That was a mistake, I intended to write re.search().group(1), not re.finditer().group(1). I clarified this in another reply to the thread about an hour ago. Sorry for the confusion, I wrote the reply after being up for a while and got re.findter() and re.search() mixed up in my head. You're correct. On Sat, Dec 7, 2019 at 7:57 PM Andrew Barnert <abarnert@yahoo.com> wrote:

On Thu, Dec 5, 2019 at 6:16 PM Juancarlo Añez <apalala@gmail.com> wrote:
Um, finditer() returns a Match object, and IIUC findfirst() should return a string, or a tuple of groups if there's more than one group. So the actual implementation would be a bit more involved. Something like this, to match findall() better: for match in re.finditer(pattern, text, flags=flags): # Only act on first match groups = match.groups() if not groups: return match.group(0) # Whole match if len(groups) == 1: return groups[0] # One match return groups # No match, use default return default Alternatively, replace the first line with this: match = re.search(pattern, text, flags=flags) if match is not None: (There are apparently subtle differences between re.search() and re.findall() -- not sure if they matter in this case.) And if the point of proposing first is that novices will figure out how to
write first(findall(…)) so we don’t need to add findfirst, then I think we need findfirst even more, because novices shouldn’t learn that bad idea.
Yes, my point exactly.
I posted another thread to argue in favor of *first()*, independently of *findfirst().*
Also agreed, I've observed that as a common pattern. -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

06.12.19 04:31, Guido van Rossum пише:
(There are apparently subtle differences between re.search() and re.findall() -- not sure if they matter in this case.)
There is no any differences. Also, analyzing examples from GitHub, in most cases the pattern contains no or single group, so the code can be written as (if no groups) result = (re.search(pattern, string) or [default])[0] or (is a single group) result = (re.search(pattern, string) or ['', default])[1] And since most code do not handle the case when the pattern is not found in any case, it can be simplified even more.

My glitch. In my mind *finditer()* returned what *findall()*, but it returns *Match* objects. The implementation based on *search()*. Seems appropiate. I just looked in *_sre.c*, and *findall() *uses *search()* and is quite optimized. It seems that the good implementation would be to write a *findalliter()* using the current *findall()* code, and implement *findall() *and *findfirst()* by calling that. On Thu, Dec 5, 2019 at 10:31 PM Guido van Rossum <guido@python.org> wrote:
-- Juancarlo *Añez*

On Thu, Dec 5, 2019, at 12:25, Andrew Barnert via Python-ideas wrote:
If match objects are too hard to use, maybe they should be made more user-friendly? What about adding str and iterable semantics to match objects so it can be used as str(re.search(...)); tuple(re.search(...)); a, b = re.search(...)?

On Dec 6, 2019, at 09:51, Random832 <random832@fastmail.com> wrote:
If match objects are too hard to use, maybe they should be made more user-friendly? What about adding str and iterable semantics to match objects so it can be used as str(re.search(...)); tuple(re.search(...)); a, b = re.search(...)?
That’s a clever idea, and it might work. For iteration, the only question is what it returns when there’s only one capture group. If you do that with the findall entries you’ll get a tuple of the characters in the string, rather than a single-element tuple. I don’t think that’s behavior anyone would actually want for tuple(match) if we were designing the whole re module API from scratch. But would it be too inconsistent if you didn’t do it that way? For string, str(match) already works, and sometimes provides useful debugging info. At the REPL this is probably no big deal (it’s easier to dump the repr than the str anyway), but what about logs? For example. I’ve got a parse error on a request, and my logs tell me the last successful match was <_sre.SRE_Match object; span=(21137, 21142), match='alpha'>, so I know to look around 21137 characters into the request to find the problem. After upgrading Python, the logs would just say alpha, which wouldn’t help me. I’d have to go change the code to log %r instead of %s (or, maybe, stop being so hacky and explicitly log the span and groups, and also log where the failed search started rather than guessing from the previous one, and make the parser give useful errors in the first place, etc.) before I could debug future requests. You’re not supposed to even rely on repr being consistent across Python implementations and versions, much less on str being developer- rather than user-friendly, but sometimes people do, and sometimes we all have to deal with their code. I don’t think this is a huge objection, but it is worth figuring out how often and how badly people would be affected.

On 2019-12-06 18:24, Andrew Barnert via Python-ideas wrote:
1. Match objects are also be returned by re.match, and you wouldn't expect that to look for more matches. 2. What would tuple(re.search(...)) do? Wouldn't it do the same as tuple(re.findall(...))? 3. a, b = re.search(...) would fail if it didn't return exactly 2 matches, and it would keep looking after the second match for a third match because that's how assigning from an iterator currently works - it's iterated until it's exhausted.
For iteration, the only question is what it returns when there’s only one capture group. If you do that with the findall entries you’ll get a tuple of the characters in the string, rather than a single-element tuple. I don’t think that’s behavior anyone would actually want for tuple(match) if we were designing the whole re module API from scratch. But would it be too inconsistent if you didn’t do it that way?
For string, str(match) already works, and sometimes provides useful debugging info. At the REPL this is probably no big deal (it’s easier to dump the repr than the str anyway), but what about logs? For example. I’ve got a parse error on a request, and my logs tell me the last successful match was <_sre.SRE_Match object; span=(21137, 21142), match='alpha'>, so I know to look around 21137 characters into the request to find the problem. After upgrading Python, the logs would just say alpha, which wouldn’t help me. I’d have to go change the code to log %r instead of %s (or, maybe, stop being so hacky and explicitly log the span and groups, and also log where the failed search started rather than guessing from the previous one, and make the parser give useful errors in the first place, etc.) before I could debug future requests. You’re not supposed to even rely on repr being consistent across Python implementations and versions, much less on str being developer- rather than user-friendly, but sometimes people do, and sometimes we all have to deal with their code. I don’t think this is a huge objection, but it is worth figuring out how often and how badly people would be affected.

On Fri, Dec 6, 2019, at 14:50, MRAB wrote:
I'm not sure what you meant by looking for more matches, though I suspect it's because, as below, I wasn't clear with what I meant by iterable semantics.
2. What would tuple(re.search(...)) do? Wouldn't it do the same as tuple(re.findall(...))?
I intended the tuple [well, the iterable semantics that would allow the tuple call to succeed] to return m.groups(), i.e. the same tuple as re.findall()[0] does when the re contains capturing groups. Sorry for not making that clear enough.
3. a, b = re.search(...) would fail if it didn't return exactly 2 matches
2 capturing groups, not 2 matches. Again, sorry for not making that explicitly clear.

The semantics are not the same as those of: re.findall(...)[0] Subscripting a Match object will yield the match for a single group, which is always a string, while the first element in the list returned by *findall()* will be a tuple if several groups matched. As others have pointed out, there is an asymmetry in the library regarding Match-return and string/tuple-return functions, and that leads to *findal(...)[0].* -- Juancarlo *Añez*

06.12.19 19:49, Random832 пише:
If match objects are too hard to use, maybe they should be made more user-friendly? What about adding str and iterable semantics to match objects so it can be used as str(re.search(...)); tuple(re.search(...)); a, b = re.search(...)?
What is semantic of these operations?

06.12.19 23:21, Random832 пише:
This is incompatible with subscripting. match[0] returns match.group(0), not match.groups()[0]. This idea was already discussed and was rejected as ambiguous. https://bugs.python.org/issue9529
def __str__(self): return self.group(0)
If you can use group() and groups() which return exactly what you need why do you want to use str() and tuple()?

On Sat, Dec 7, 2019, at 01:43, Serhiy Storchaka wrote:
This is incompatible with subscripting. match[0] returns match.group(0), not match.groups()[0].
And dict[0] returns the value whose key is 0, not the first key of the dictionary. set[0] does not work at all. there is no general guarantee of consistency between iteration and subscripting.
Recall that this thread is about proposing a new redundant method for regexes, on the apparent theory that match objects are too hard to use and so people need a method that just returns a string or a tuple instead of a match object. I could just as well ask, if you can use re.search(...).group(0) why do you want re.findfirst(...)?

On Dec 5, 2019, at 08:53, Juancarlo Añez <apalala@gmail.com> wrote:
I think this was proposed for itertools and rejected. I don’t remember why, but generally there’s resistance to adding anything that you could write yourself (and are unlikely to get wrong) on top of itertools and builtins, unless it needs to loop and yield itself (in which case it might need the performance boost of iterating in C instead of Python), because that’s what the recipes are for. And I suppose if you see the recipe for nth you don’t learn anything from the recipe for first. But people seem more open to recipes being “everything useful” rather than only “everything useful that also teaches an important idea”, and the recipe docs even link to more-itertools for people looking to use them out of the box (and first is in more-itertools). Also, I think it’s pretty clear that people often don’t think of first when they need it, so even if they could write it if they thought of it, they don’t because they don’t. So maybe it’s worth at least adding first as a recipe, even if people don’t think it’s worth adding to the module itself? (Personally, I use first if I’ve already imported more-itertools for something else, but otherwise I just next Iter.)

On Dec 5, 2019, at 08:53, Juancarlo Añez <apalala@gmail.com> wrote:
The proposed implementation of a findfirst() would handle many common cases, and be friendly to newcomers (why do I need to deal with a Match object?), specially if the semantics are those of findall():
next(iter(findall(...)), default=default)
The problem with using findall instead of finditer or search is that it scans the whole document rather than just until the first match, and it builds a potentially huge list just to throw it away. It’s pretty common that one or both of those will be a serious performance issue. Imagine asking to find the first double consonant in the OED and it takes a minute to run and pins a gigabyte of memory. It’s unfortunate that these functions aren’t better matched. Why is there a simple-semantics find-everything and a match-semantics find-iteratively and find-one? But I don’t think adding a simple-semantics find-one that works by inefficiently finding all is the right solution. And if the point of proposing first is that novices will figure out how to write first(findall(…)) so we don’t need to add findfirst, then I think we need findfirst even more, because novices shouldn’t learn that bad idea.

05.12.19 21:07, Guido van Rossum пише:
The case for findfirst() becomes stronger! There seem plenty of ways to get this wrong.
I write several functions every day. There are many ways to get this wrong. But I do not propose to include all these functions in the stdlib. If I want to include even a single function, I try to find several examples that would benefit from adding this function in the stdlib. If I found less examples than I expected I withdraw my idea. We still do not know a use case for findfirst. If the OP would show his code and several examples in others code this could be an argument for usefulness of this feature.

Serhiy Storchaka wrote:
I'm not sure about the OP's exact use case, but using GitHub's code search for .py files that match with "first re.findall" shows a decent amount of code that uses the format ``re.findall()[0]``. It would be nice if GitHub's search properly supported symbols and regular expressions, but this presents a decent number of examples. See https://github.com/search?l=Python&q=first+re.findall&type=Code. I also spent some time looking for a few specific examples, since there were a number of false positives in the above results. Note that I didn't look much into the actual purpose of the code or judge it based on quality, I was just looking for anything that seemed remotely practical and contained something along the lines of ``re.findall()[0]``. Several of the links below contain multiple lines where findfirst would likely be a better alternative, but I only included one permalink per code file. https://github.com/MohamedAl-Hussein/my_projects/blob/15feca5254fe1b2936d393... https://github.com/MohamedAl-Hussein/FIFA/blob/2b1390fe46f94648e5b0bcfd28bc6... https://github.com/democracyworks/dog-catcher/blob/9f6200084d4505091399d36ab... https://github.com/democracyworks/dog-catcher/blob/9f6200084d4505091399d36ab... https://github.com/jessyL6/CQUPTHUB-spiders_task1/blob/db73c47c0703ed01eb2a6... https://github.com/kerinin/giscrape/blob/d398206ed4a7e48e1ef6afbf37b4f98784c... https://github.com/songweifun/parsebook/blob/529a86739208e9dc07abbb31363462e... I'm sure there are far more examples and perhaps some more "realistic" ones, I only went through the first few pages of results. On Thu, Dec 5, 2019 at 3:08 PM Serhiy Storchaka <storchaka@gmail.com> wrote:

To overcome Github's search limitations, one can use Chrome's codesearch or the public github dataset available on bigquery (note: it's only a sample from 2012 if I'm not mistaken). https://cs.chromium.org/search/?q=lang:py+re%5C.findall%5C(.*%5C)%5C%5B0%5C%5D&sq=package:chromium&type=cs returns 5 results while the following query: SELECT COUNT(*) FROM (SELECT c.id id, c.content content, f.repo_name repo_name, f.path path FROM `bigquery-public-data.github_repos.sample_files` f JOIN ( SELECT * FROM `bigquery-public-data.github_repos.sample_contents` ) c ON f.id = c.id WHERE ENDS_WITH(f.path, ".py") AND REGEXP_CONTAINS(c.content, "re\\.findall\\(.*\\)\\[0\\]") ) returns 84 entries. On Thu, Dec 5, 2019 at 6:51 PM Kyle Stanley <aeros167@gmail.com> wrote:
-- Sebastian Kreft

05.12.19 23:47, Kyle Stanley пише:
Thank you Kyle for your investigation!
https://github.com/MohamedAl-Hussein/my_projects/blob/15feca5254fe1b2936d393...
It is easy to rewrite it using re.search(). - input_processor=MapCompose(lambda x: re.findall(r'pointDRI = ([0-9]+)', x)[0], eval), + input_processor=MapCompose(lambda x: re.search(r'pointDRI = ([0-9]+)', x).group(1), eval), I also wonder if it is worth to replace eval with more efficient and safe int.
https://github.com/MohamedAl-Hussein/FIFA/blob/2b1390fe46f94648e5b0bcfd28bc6...
It is the same code differently formatted.
https://github.com/democracyworks/dog-catcher/blob/9f6200084d4505091399d36ab...
- clerk_name = name_re.findall(clerk)[0] + clerk_name = name_re.search(clerk).group(1)
https://github.com/democracyworks/dog-catcher/blob/9f6200084d4505091399d36ab...
- official_name = name_re.findall(town)[0].title() + official_name = name_re.search(town).group().title()
https://github.com/jessyL6/CQUPTHUB-spiders_task1/blob/db73c47c0703ed01eb2a6...
- first_1_results = re.findall(first_1,all_list9)[0] + first_1_results = re.findall(first_1,all_list9).group(1)
https://github.com/kerinin/giscrape/blob/d398206ed4a7e48e1ef6afbf37b4f98784c...
It is a complex example which performs multiple searches with different regular expressions. It is all can be replaced with a single more efficient regular expression. - if re.search('^(\w+) (\w+)$', parcel.owner): - last, first = re.findall( '(\w+) (\w+)',parcel.owner )[0] - elif re.search('^(\w+) (\w+) (\w+)$', parcel.owner): - last, first, middle = re.findall( '(\w+) (\w+) (\w+)',parcel.owner )[0] - elif re.search('^(\w+) (\w+) & (\w+)$', parcel.owner): - last, first = re.findall( '(\w+) (\w+)',parcel.owner )[0] - elif re.search('^(\w+) (\w+) (\w+) &: (\w+)$', parcel.owner): - last, first, middle = re.findall( '(\w+) (\w+) (\w+)',parcel.owner )[0] - elif re.search('^(\w+) (\w+) & (\w+) (\w+)$', parcel.owner): - last, first = re.findall( '(\w+) (\w+)',parcel.owner )[0] - elif re.search('^(\w+) (\w+) (\w+) &: (\w+) (\w+)$', parcel.owner): - last, first, middle = re.findall( '(\w+) (\w+) (\w+)',parcel.owner )[0] - elif re.search('^(\w+) (\w+) & (\w+) (\w+) (\w+)$', parcel.owner): - last, first = re.findall( '(\w+) (\w+)',parcel.owner )[0] - elif re.search('^(\w+) (\w+) (\w+) &: (\w+) (\w+) (\w+)$', parcel.owner): - last, first, middle = re.findall( '(\w+) (\w+) (\w+)', parcel.owner )[0] + m = re.fullmatch('(\w+) (\w+)(?: (\w+))?(?: &(?: \w+){1,3})?', parcel.owner) + if m: + last, first, middle = m.groups()
https://github.com/songweifun/parsebook/blob/529a86739208e9dc07abbb31363462e...
This is the only example which checks if findall() returns an empty list. It calls findall() twice! Fortunately it can be easily optimized using a fact that the Match object support subscription. I used group() above because it is more explicit and works in older Python. - self.item.first_tutor_name = REGPX_A.findall(value)[0] if REGPX_A.findall(value) else '' + self.item.first_tutor_name = (REGPX_A.search(value) or [''])[0] It seems that in most cases the author just do not know about re.search(). Adding re.findfirst() will not fix this.

Serhiy Storchaka wrote:
Thank you Kyle for your investigation!
No problem, this seemed like an interesting feature proposal and I was personally curious about the potential use cases. Thanks for the detailed analysis, I learned a few new things from it. (: Serhiy Storchaka wrote:
- clerk_name = name_re.findall(clerk)[0] + clerk_name = name_re.search(clerk).group(1)
This pattern seems to be common across most of the above examples (minus the last two), specifically replacing ``re.findall()[0]`` with ``re.findall().group(1)`` when there are subgroups within the regex or ``re.findall().group()`` without subgroups. Serhiy Storchaka wrote:
It seems that in most cases the author just do not know about re.search(). Adding re.findfirst() will not fix this.
That's definitely possible, but it might be just as likely that they saw re.findall() as being more simple to use compared to re.search(). Although it has worse performance by a substantial amount when parsing decent amounts of text (assuming the first match isn't at the end), ``re.findall()[0]`` *consistently* returns the first string that was matched, as long as no subgroups were used. This allows them to circumvent the usage of match objects entirely, which makes it a bit easier to learn. Especially for those who are less familiar with OOP, or are already familiar with other popular flavors of regex (such as JS). I'll admit this is mostly speculation, but I think there's an especially large number of re users (compared to other modules) that aren't necessarily developers, and might just be someone who wants to write a script to quickly parse some documents. These types of users are the ones who would likely benefit the most from the proposed re.findfirst(), particularly if it directly returns a string as Guido is suggesting. I think at the end of the day, the critical question to answer is this: *Do we want to add a new helper function that's easy to use, consistent, and provides good performance for finding the first match, even if the functionality already exists within the module?* Personally, I lean a bit more towards "yes", but I think that "no" would also be a reasonable response. From my perspective, a significant reason why Python is appealing to so many users that aren't professional developers is that it's much easier to pick up the basics. Python allows users write a quick script with *decent* performance without having to learn too much, compared to most other mainstream programming languages. IMO, the addition of an re.findfirst() helps to reinforce that reason. Another option to consider might be adding a boolean parameter to re.search() that changes the behavior to directly return a string instead of a match object, similar to re.findall() when there are not multiple subgroups. For example:
The above would have the same exact return value as ``pattern.findall()[0]``, but it's more efficient since it would only parse the text until the first match is found, and it doesn't need to create a list. For backwards compatibility, this parameter would default to True. Feel free to change the name if you like the idea, "match_obj" was simply the first one that came to my head. The cons of this solution is that it might be excessively overloading re.search(), and that it not be as noticeable or easy to find as the addition of a new function. But, it could provide the same functionality as the proposed re.findfirst(), without adding an entirely new function for behavior that already exists. On Fri, Dec 6, 2019 at 2:47 AM Serhiy Storchaka <storchaka@gmail.com> wrote:

06.12.19 23:20, Kyle Stanley пише:
My concern is that this will add complexity to the module documentation which is already too complex. re.findfirst() has more complex semantic (if no capture groups returns this, if one capture group return that, and in other cases return even something of different type) than re.search() which just returns a match object or None. This will increase chance that the user miss the appropriate function and use suboptimal functions like findall()[0]. re.finditer() is more modern and powerful function than re.findall(). The latter may be even deprecated in future. In future we may add yet few functions/methods: re.rmatch() (like re.match(), but matches at the end of the string instead of the start), re.rsearch() (searches from the end), re.rfinditer() (iterates in the reversed order). Unlike to findfirst() they will implement features that cannot be easily expressed using existing functions.
Oh, no, this is the worst idea!

Serhiy Storchaka wrote:
re.finditer() is more modern and powerful function than re.findall(). The latter may be even deprecated in future.
Hmm, perhaps another consideration then would be to think of improvements to make to the existing documentation, particularly with including some code examples or expanding upon the docs for re.finditer() to make its usage more clear. Personally, it took me quite a while to understand its role in the module (as someone who does not use it on a frequent basis). Code examples should of course be used sparingly, but I think re.finditer() could benefit from at least one. Especially considering that far less complex functions in the module have several examples. See https://docs.python.org/3.8/library/re.html#re.finditer. Serhiy Storchaka wrote:
Oh, no, this is the worst idea!
Yeah, after having some time to reflect on that idea a bit more I don't think it would work. That would just end up adding confusion to re.search(), ultimately defeating the purpose of the parameter in the first place. It would be too drastic of a change in behavior for a single parameter to make. Thanks for the honesty though, not all of my ideas are good ones. But, if I can come up with something half-decent every once in a while I think it's worth throwing them out there. (: On Sat, Dec 7, 2019 at 2:56 AM Serhiy Storchaka <storchaka@gmail.com> wrote:

Code examples should of course be used sparingly, but I think re.finditer() could benefit from at least one
Clarification: I see that there's an example of it being used in https://docs.python.org/3.8/library/re.html#finding-all-adverbs-and-their-po... and one more complex example with https://docs.python.org/3.8/library/re.html#writing-a-tokenizer. I was specifically referring to including a basic example directly within https://docs.python.org/3.8/library/re.html#re.finditer, similar to the section for https://docs.python.org/3.8/library/re.html#re.split or https://docs.python.org/3.8/library/re.html#re.sub. Alternatively: creating a new section under https://docs.python.org/3.8/library/re.html#regular-expression-examples, titled "Finding the first match", where it briefly explains the difference in behavior between using re.findall()[0] and re.finditer().group(1) (or re.finditer.group() when there's not a subgroup). Based on the discussions in this thread and code examples, this seems to be rather commonly misunderstood. On Sat, Dec 7, 2019 at 7:29 AM Kyle Stanley <aeros167@gmail.com> wrote:

On Dec 7, 2019, at 04:51, Kyle Stanley <aeros167@gmail.com> wrote: Alternatively: creating a new section under https://docs.python.org/3.8/library/re.html#regular-expression-examples, titled "Finding the first match", where it briefly explains the difference in behavior between using re.findall()[0] and re.finditer().group(1) (or re.finditer.group() when there's not a subgroup).
Hold on, what is finditer().group(1) supposed to mean here? You’d need next(finditer()).group(1) or next(m.group(1) for m in finditer()) or something. But if you just want the first match, why are you using either findall or finditer instead of just search? Isn’t that exactly the confusion this thread was hoping to resolve, rather than forcing even more novices to deal with it by pushing them into it in a section named “Finding the first match”? Also (when there are subgroups), surely the relevant difference is either between findall()[0][0] and next(finditer()).group(1), which both return the first group of the first match, or between findall()[0] and next(finditer()).groups(), which both return a tuple of groups of the first match, not between findall()[0] and next(finditer()).group(1), which return a tuple vs. just the first one?

That was a mistake, I intended to write re.search().group(1), not re.finditer().group(1). I clarified this in another reply to the thread about an hour ago. Sorry for the confusion, I wrote the reply after being up for a while and got re.findter() and re.search() mixed up in my head. You're correct. On Sat, Dec 7, 2019 at 7:57 PM Andrew Barnert <abarnert@yahoo.com> wrote:

On Thu, Dec 5, 2019 at 6:16 PM Juancarlo Añez <apalala@gmail.com> wrote:
Um, finditer() returns a Match object, and IIUC findfirst() should return a string, or a tuple of groups if there's more than one group. So the actual implementation would be a bit more involved. Something like this, to match findall() better: for match in re.finditer(pattern, text, flags=flags): # Only act on first match groups = match.groups() if not groups: return match.group(0) # Whole match if len(groups) == 1: return groups[0] # One match return groups # No match, use default return default Alternatively, replace the first line with this: match = re.search(pattern, text, flags=flags) if match is not None: (There are apparently subtle differences between re.search() and re.findall() -- not sure if they matter in this case.) And if the point of proposing first is that novices will figure out how to
write first(findall(…)) so we don’t need to add findfirst, then I think we need findfirst even more, because novices shouldn’t learn that bad idea.
Yes, my point exactly.
I posted another thread to argue in favor of *first()*, independently of *findfirst().*
Also agreed, I've observed that as a common pattern. -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

06.12.19 04:31, Guido van Rossum пише:
(There are apparently subtle differences between re.search() and re.findall() -- not sure if they matter in this case.)
There is no any differences. Also, analyzing examples from GitHub, in most cases the pattern contains no or single group, so the code can be written as (if no groups) result = (re.search(pattern, string) or [default])[0] or (is a single group) result = (re.search(pattern, string) or ['', default])[1] And since most code do not handle the case when the pattern is not found in any case, it can be simplified even more.

My glitch. In my mind *finditer()* returned what *findall()*, but it returns *Match* objects. The implementation based on *search()*. Seems appropiate. I just looked in *_sre.c*, and *findall() *uses *search()* and is quite optimized. It seems that the good implementation would be to write a *findalliter()* using the current *findall()* code, and implement *findall() *and *findfirst()* by calling that. On Thu, Dec 5, 2019 at 10:31 PM Guido van Rossum <guido@python.org> wrote:
-- Juancarlo *Añez*

On Thu, Dec 5, 2019, at 12:25, Andrew Barnert via Python-ideas wrote:
If match objects are too hard to use, maybe they should be made more user-friendly? What about adding str and iterable semantics to match objects so it can be used as str(re.search(...)); tuple(re.search(...)); a, b = re.search(...)?

On Dec 6, 2019, at 09:51, Random832 <random832@fastmail.com> wrote:
If match objects are too hard to use, maybe they should be made more user-friendly? What about adding str and iterable semantics to match objects so it can be used as str(re.search(...)); tuple(re.search(...)); a, b = re.search(...)?
That’s a clever idea, and it might work. For iteration, the only question is what it returns when there’s only one capture group. If you do that with the findall entries you’ll get a tuple of the characters in the string, rather than a single-element tuple. I don’t think that’s behavior anyone would actually want for tuple(match) if we were designing the whole re module API from scratch. But would it be too inconsistent if you didn’t do it that way? For string, str(match) already works, and sometimes provides useful debugging info. At the REPL this is probably no big deal (it’s easier to dump the repr than the str anyway), but what about logs? For example. I’ve got a parse error on a request, and my logs tell me the last successful match was <_sre.SRE_Match object; span=(21137, 21142), match='alpha'>, so I know to look around 21137 characters into the request to find the problem. After upgrading Python, the logs would just say alpha, which wouldn’t help me. I’d have to go change the code to log %r instead of %s (or, maybe, stop being so hacky and explicitly log the span and groups, and also log where the failed search started rather than guessing from the previous one, and make the parser give useful errors in the first place, etc.) before I could debug future requests. You’re not supposed to even rely on repr being consistent across Python implementations and versions, much less on str being developer- rather than user-friendly, but sometimes people do, and sometimes we all have to deal with their code. I don’t think this is a huge objection, but it is worth figuring out how often and how badly people would be affected.

On 2019-12-06 18:24, Andrew Barnert via Python-ideas wrote:
1. Match objects are also be returned by re.match, and you wouldn't expect that to look for more matches. 2. What would tuple(re.search(...)) do? Wouldn't it do the same as tuple(re.findall(...))? 3. a, b = re.search(...) would fail if it didn't return exactly 2 matches, and it would keep looking after the second match for a third match because that's how assigning from an iterator currently works - it's iterated until it's exhausted.
For iteration, the only question is what it returns when there’s only one capture group. If you do that with the findall entries you’ll get a tuple of the characters in the string, rather than a single-element tuple. I don’t think that’s behavior anyone would actually want for tuple(match) if we were designing the whole re module API from scratch. But would it be too inconsistent if you didn’t do it that way?
For string, str(match) already works, and sometimes provides useful debugging info. At the REPL this is probably no big deal (it’s easier to dump the repr than the str anyway), but what about logs? For example. I’ve got a parse error on a request, and my logs tell me the last successful match was <_sre.SRE_Match object; span=(21137, 21142), match='alpha'>, so I know to look around 21137 characters into the request to find the problem. After upgrading Python, the logs would just say alpha, which wouldn’t help me. I’d have to go change the code to log %r instead of %s (or, maybe, stop being so hacky and explicitly log the span and groups, and also log where the failed search started rather than guessing from the previous one, and make the parser give useful errors in the first place, etc.) before I could debug future requests. You’re not supposed to even rely on repr being consistent across Python implementations and versions, much less on str being developer- rather than user-friendly, but sometimes people do, and sometimes we all have to deal with their code. I don’t think this is a huge objection, but it is worth figuring out how often and how badly people would be affected.

On Fri, Dec 6, 2019, at 14:50, MRAB wrote:
I'm not sure what you meant by looking for more matches, though I suspect it's because, as below, I wasn't clear with what I meant by iterable semantics.
2. What would tuple(re.search(...)) do? Wouldn't it do the same as tuple(re.findall(...))?
I intended the tuple [well, the iterable semantics that would allow the tuple call to succeed] to return m.groups(), i.e. the same tuple as re.findall()[0] does when the re contains capturing groups. Sorry for not making that clear enough.
3. a, b = re.search(...) would fail if it didn't return exactly 2 matches
2 capturing groups, not 2 matches. Again, sorry for not making that explicitly clear.

The semantics are not the same as those of: re.findall(...)[0] Subscripting a Match object will yield the match for a single group, which is always a string, while the first element in the list returned by *findall()* will be a tuple if several groups matched. As others have pointed out, there is an asymmetry in the library regarding Match-return and string/tuple-return functions, and that leads to *findal(...)[0].* -- Juancarlo *Añez*

06.12.19 19:49, Random832 пише:
If match objects are too hard to use, maybe they should be made more user-friendly? What about adding str and iterable semantics to match objects so it can be used as str(re.search(...)); tuple(re.search(...)); a, b = re.search(...)?
What is semantic of these operations?

06.12.19 23:21, Random832 пише:
This is incompatible with subscripting. match[0] returns match.group(0), not match.groups()[0]. This idea was already discussed and was rejected as ambiguous. https://bugs.python.org/issue9529
def __str__(self): return self.group(0)
If you can use group() and groups() which return exactly what you need why do you want to use str() and tuple()?

On Sat, Dec 7, 2019, at 01:43, Serhiy Storchaka wrote:
This is incompatible with subscripting. match[0] returns match.group(0), not match.groups()[0].
And dict[0] returns the value whose key is 0, not the first key of the dictionary. set[0] does not work at all. there is no general guarantee of consistency between iteration and subscripting.
Recall that this thread is about proposing a new redundant method for regexes, on the apparent theory that match objects are too hard to use and so people need a method that just returns a string or a tuple instead of a match object. I could just as well ask, if you can use re.search(...).group(0) why do you want re.findfirst(...)?
participants (8)
-
Andrew Barnert
-
Guido van Rossum
-
Juancarlo Añez
-
Kyle Stanley
-
MRAB
-
Random832
-
Sebastian Kreft
-
Serhiy Storchaka