Serhiy Storchaka wrote:
Thank you Kyle for your investigation!
No problem, this seemed like an interesting feature proposal and I was personally curious about the potential use cases. Thanks for the detailed analysis, I learned a few new things from it. (: Serhiy Storchaka wrote:
- clerk_name = name_re.findall(clerk)[0] + clerk_name = name_re.search(clerk).group(1)
This pattern seems to be common across most of the above examples (minus the last two), specifically replacing ``re.findall()[0]`` with ``re.findall().group(1)`` when there are subgroups within the regex or ``re.findall().group()`` without subgroups. Serhiy Storchaka wrote:
It seems that in most cases the author just do not know about re.search(). Adding re.findfirst() will not fix this.
That's definitely possible, but it might be just as likely that they saw re.findall() as being more simple to use compared to re.search(). Although it has worse performance by a substantial amount when parsing decent amounts of text (assuming the first match isn't at the end), ``re.findall()[0]`` *consistently* returns the first string that was matched, as long as no subgroups were used. This allows them to circumvent the usage of match objects entirely, which makes it a bit easier to learn. Especially for those who are less familiar with OOP, or are already familiar with other popular flavors of regex (such as JS). I'll admit this is mostly speculation, but I think there's an especially large number of re users (compared to other modules) that aren't necessarily developers, and might just be someone who wants to write a script to quickly parse some documents. These types of users are the ones who would likely benefit the most from the proposed re.findfirst(), particularly if it directly returns a string as Guido is suggesting. I think at the end of the day, the critical question to answer is this: *Do we want to add a new helper function that's easy to use, consistent, and provides good performance for finding the first match, even if the functionality already exists within the module?* Personally, I lean a bit more towards "yes", but I think that "no" would also be a reasonable response. From my perspective, a significant reason why Python is appealing to so many users that aren't professional developers is that it's much easier to pick up the basics. Python allows users write a quick script with *decent* performance without having to learn too much, compared to most other mainstream programming languages. IMO, the addition of an re.findfirst() helps to reinforce that reason. Another option to consider might be adding a boolean parameter to re.search() that changes the behavior to directly return a string instead of a match object, similar to re.findall() when there are not multiple subgroups. For example:
re.search(" (\w) ", "there is a one letter word in the middle", match_obj=False) 'a'
The above would have the same exact return value as
``pattern.findall()[0]``, but it's more efficient since it would only parse
the text until the first match is found, and it doesn't need to create a
list. For backwards compatibility, this parameter would default to True.
Feel free to change the name if you like the idea, "match_obj" was simply
the first one that came to my head.
The cons of this solution is that it might be excessively overloading
re.search(), and that it not be as noticeable or easy to find as the
addition of a new function. But, it could provide the same functionality as
the proposed re.findfirst(), without adding an entirely new function for
behavior that already exists.
On Fri, Dec 6, 2019 at 2:47 AM Serhiy Storchaka
05.12.19 23:47, Kyle Stanley пише:
Serhiy Storchaka wrote:
We still do not know a use case for findfirst. If the OP would show
his
code and several examples in others code this could be an argument for usefulness of this feature.
I'm not sure about the OP's exact use case, but using GitHub's code search for .py files that match with "first re.findall" shows a decent amount of code that uses the format ``re.findall()[0]``. It would be nice if GitHub's search properly supported symbols and regular expressions, but this presents a decent number of examples. See https://github.com/search?l=Python&q=first+re.findall&type=Code.
I also spent some time looking for a few specific examples, since there were a number of false positives in the above results. Note that I didn't look much into the actual purpose of the code or judge it based on quality, I was just looking for anything that seemed remotely practical and contained something along the lines of ``re.findall()[0]``. Several of the links below contain multiple lines where findfirst would likely be a better alternative, but I only included one permalink per code file.
Thank you Kyle for your investigation!
https://github.com/MohamedAl-Hussein/my_projects/blob/15feca5254fe1b2936d393...
It is easy to rewrite it using re.search().
- input_processor=MapCompose(lambda x: re.findall(r'pointDRI = ([0-9]+)', x)[0], eval), + input_processor=MapCompose(lambda x: re.search(r'pointDRI = ([0-9]+)', x).group(1), eval),
I also wonder if it is worth to replace eval with more efficient and safe int.
https://github.com/MohamedAl-Hussein/FIFA/blob/2b1390fe46f94648e5b0bcfd28bc6...
It is the same code differently formatted.
https://github.com/democracyworks/dog-catcher/blob/9f6200084d4505091399d36ab...
- clerk_name = name_re.findall(clerk)[0] + clerk_name = name_re.search(clerk).group(1)
https://github.com/democracyworks/dog-catcher/blob/9f6200084d4505091399d36ab...
- official_name = name_re.findall(town)[0].title() + official_name = name_re.search(town).group().title()
https://github.com/jessyL6/CQUPTHUB-spiders_task1/blob/db73c47c0703ed01eb2a6...
- first_1_results = re.findall(first_1,all_list9)[0] + first_1_results = re.findall(first_1,all_list9).group(1)
https://github.com/kerinin/giscrape/blob/d398206ed4a7e48e1ef6afbf37b4f98784c...
It is a complex example which performs multiple searches with different regular expressions. It is all can be replaced with a single more efficient regular expression.
- if re.search('^(\w+) (\w+)$', parcel.owner): - last, first = re.findall( '(\w+) (\w+)',parcel.owner )[0] - elif re.search('^(\w+) (\w+) (\w+)$', parcel.owner): - last, first, middle = re.findall( '(\w+) (\w+) (\w+)',parcel.owner )[0] - elif re.search('^(\w+) (\w+) & (\w+)$', parcel.owner): - last, first = re.findall( '(\w+) (\w+)',parcel.owner )[0] - elif re.search('^(\w+) (\w+) (\w+) &: (\w+)$', parcel.owner): - last, first, middle = re.findall( '(\w+) (\w+) (\w+)',parcel.owner )[0] - elif re.search('^(\w+) (\w+) & (\w+) (\w+)$', parcel.owner): - last, first = re.findall( '(\w+) (\w+)',parcel.owner )[0] - elif re.search('^(\w+) (\w+) (\w+) &: (\w+) (\w+)$', parcel.owner): - last, first, middle = re.findall( '(\w+) (\w+) (\w+)',parcel.owner )[0] - elif re.search('^(\w+) (\w+) & (\w+) (\w+) (\w+)$', parcel.owner): - last, first = re.findall( '(\w+) (\w+)',parcel.owner )[0] - elif re.search('^(\w+) (\w+) (\w+) &: (\w+) (\w+) (\w+)$', parcel.owner): - last, first, middle = re.findall( '(\w+) (\w+) (\w+)', parcel.owner )[0]
+ m = re.fullmatch('(\w+) (\w+)(?: (\w+))?(?: &(?: \w+){1,3})?', parcel.owner) + if m: + last, first, middle = m.groups()
https://github.com/songweifun/parsebook/blob/529a86739208e9dc07abbb31363462e...
This is the only example which checks if findall() returns an empty list. It calls findall() twice! Fortunately it can be easily optimized using a fact that the Match object support subscription. I used group() above because it is more explicit and works in older Python.
- self.item.first_tutor_name = REGPX_A.findall(value)[0] if REGPX_A.findall(value) else '' + self.item.first_tutor_name = (REGPX_A.search(value) or [''])[0]
It seems that in most cases the author just do not know about re.search(). Adding re.findfirst() will not fix this. _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/5O2TP5... Code of Conduct: http://python.org/psf/codeofconduct/