[Python-ideas] Re: Fwd: Re: Fwd: re.findfirst()

6 Dec 2019

      Serhiy Storchaka wrote:
...
Thank you Kyle for your investigation!
No problem, this seemed like an interesting feature proposal and I was
personally curious about the potential use cases. Thanks for the detailed
analysis, I learned a few new things from it. (:

Serhiy Storchaka wrote:
...
-       clerk_name = name_re.findall(clerk)[0]
+       clerk_name = name_re.search(clerk).group(1)
This pattern seems to be common across most of the above examples (minus
the last two), specifically replacing ``re.findall()[0]`` with
``re.findall().group(1)`` when there are subgroups within the regex or
``re.findall().group()`` without subgroups.

Serhiy Storchaka wrote:
...
It seems that in most cases the author just do not know about
re.search(). Adding re.findfirst() will not fix this.
That's definitely possible, but it might be just as likely that they saw
re.findall() as being more simple to use compared to re.search(). Although
it has worse performance by a substantial amount when parsing decent
amounts of text (assuming the first match isn't at the end),
``re.findall()[0]`` *consistently* returns the first string that was
matched, as long as no subgroups were used. This allows them to circumvent
the usage of match objects entirely, which makes it a bit easier to learn.
Especially for those who are less familiar with OOP, or are already
familiar with other popular flavors of regex (such as JS).

I'll admit this is mostly speculation, but I think there's an especially
large number of re users (compared to other modules) that aren't
necessarily developers, and might just be someone who wants to write a
script to quickly parse some documents. These types of users are the ones
who would likely benefit the most from the proposed re.findfirst(),
particularly if it directly returns a string as Guido is suggesting.

I think at the end of the day, the critical question to answer is this:

*Do we want to add a new helper function that's easy to use, consistent,
and provides good performance for finding the first match, even if the
functionality already exists within the module?*

Personally, I lean a bit more towards "yes", but I think that "no" would
also be a reasonable response.

From my perspective, a significant reason why Python is appealing to so
many users that aren't professional developers is that it's much easier to
pick up the basics. Python allows users write a quick script with *decent*
performance without having to learn too much, compared to most other
mainstream programming languages. IMO, the addition of an re.findfirst()
helps to reinforce that reason.

Another option to consider might be adding a boolean parameter to
re.search() that changes the behavior to directly return a string instead
of a match object, similar to re.findall() when there are not multiple
subgroups. For example:
...
...
...
re.search(" (\w) ", "there is a one letter word in the middle",
match_obj=False)
'a'
The above would have the same exact return value as
``pattern.findall()[0]``, but it's more efficient since it would only parse
the text until the first match is found, and it doesn't need to create a
list. For backwards compatibility, this parameter would default to True.
Feel free to change the name if you like the idea, "match_obj" was simply
the first one that came to my head.

The cons of this solution is that it might be excessively overloading
re.search(), and that it not be as noticeable or easy to find as the
addition of a new function. But, it could provide the same functionality as
the proposed re.findfirst(), without adding an entirely new function for
behavior that already exists.

On Fri, Dec 6, 2019 at 2:47 AM Serhiy Storchaka  wrote:
...
05.12.19 23:47, Kyle Stanley пише:
...
Serhiy Storchaka wrote:
...
We still do not know a use case for findfirst. If the OP would show
his
...
...
code and several examples in others code this could be an argument for
usefulness of this feature.
I'm not sure about the OP's exact use case, but using GitHub's code
search for .py files that match with "first re.findall" shows a decent
amount of code that uses the format ``re.findall()[0]``. It would be
nice if GitHub's search properly supported symbols and regular
expressions, but this presents a decent number of examples. See
https://github.com/search?l=Python&q=first+re.findall&type=Code.
I also spent some time looking for a few specific examples, since there
were a number of false positives in the above results. Note that I
didn't look much into the actual purpose of the code or judge it based
on quality, I was just looking for anything that seemed remotely
practical and contained something along the lines of
``re.findall()[0]``. Several of the links below contain multiple lines
where findfirst would likely be a better alternative, but I only
included one permalink per code file.
Thank you Kyle for your investigation!
...
https://github.com/MohamedAl-Hussein/my_projects/blob/15feca5254fe1b2936d393...
It is easy to rewrite it using re.search().
-         input_processor=MapCompose(lambda x: re.findall(r'pointDRI =
([0-9]+)', x)[0], eval),
+         input_processor=MapCompose(lambda x: re.search(r'pointDRI =
([0-9]+)', x).group(1), eval),
I also wonder if it is worth to replace eval with more efficient and
safe int.
...
https://github.com/MohamedAl-Hussein/FIFA/blob/2b1390fe46f94648e5b0bcfd28bc6...
It is the same code differently formatted.
...
https://github.com/democracyworks/dog-catcher/blob/9f6200084d4505091399d36ab...
-       clerk_name = name_re.findall(clerk)[0]
+       clerk_name = name_re.search(clerk).group(1)
...
https://github.com/democracyworks/dog-catcher/blob/9f6200084d4505091399d36ab...
-     official_name = name_re.findall(town)[0].title()
+     official_name = name_re.search(town).group().title()
...
https://github.com/jessyL6/CQUPTHUB-spiders_task1/blob/db73c47c0703ed01eb2a6...
-             first_1_results = re.findall(first_1,all_list9)[0]
+             first_1_results = re.findall(first_1,all_list9).group(1)
...
https://github.com/kerinin/giscrape/blob/d398206ed4a7e48e1ef6afbf37b4f98784c...
It is a complex example which performs multiple searches with different
regular expressions. It is all can be replaced with a single more
efficient regular expression.
-   if re.search('^(\w+) (\w+)$', parcel.owner):
-     last, first = re.findall( '(\w+) (\w+)',parcel.owner )[0]
-   elif re.search('^(\w+) (\w+) (\w+)$', parcel.owner):
-     last, first, middle = re.findall( '(\w+) (\w+) (\w+)',parcel.owner
)[0]
-   elif re.search('^(\w+) (\w+) & (\w+)$', parcel.owner):
-     last, first = re.findall( '(\w+) (\w+)',parcel.owner )[0]
-   elif re.search('^(\w+) (\w+) (\w+) &: (\w+)$', parcel.owner):
-     last, first, middle = re.findall( '(\w+) (\w+) (\w+)',parcel.owner
)[0]
-   elif re.search('^(\w+) (\w+) & (\w+) (\w+)$', parcel.owner):
-     last, first = re.findall( '(\w+) (\w+)',parcel.owner )[0]
-   elif re.search('^(\w+) (\w+) (\w+) &: (\w+) (\w+)$', parcel.owner):
-     last, first, middle = re.findall( '(\w+) (\w+) (\w+)',parcel.owner
)[0]
-   elif re.search('^(\w+) (\w+) & (\w+) (\w+) (\w+)$', parcel.owner):
-     last, first = re.findall( '(\w+) (\w+)',parcel.owner )[0]
-   elif re.search('^(\w+) (\w+) (\w+) &: (\w+) (\w+) (\w+)$',
parcel.owner):
-     last, first, middle = re.findall( '(\w+) (\w+) (\w+)',
parcel.owner     )[0]
+   m = re.fullmatch('(\w+) (\w+)(?: (\w+))?(?: &(?: \w+){1,3})?',
parcel.owner)
+   if m:
+     last, first, middle = m.groups()
...
https://github.com/songweifun/parsebook/blob/529a86739208e9dc07abbb31363462e...
This is the only example which checks if findall() returns an empty
list. It calls findall() twice! Fortunately it can be easily optimized
using a fact that the Match object support subscription. I used group()
above because it is more explicit and works in older Python.
-             self.item.first_tutor_name = REGPX_A.findall(value)[0] if
REGPX_A.findall(value) else ''
+             self.item.first_tutor_name = (REGPX_A.search(value) or
[''])[0]
It seems that in most cases the author just do not know about
re.search(). Adding re.findfirst() will not fix this.
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-leave@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at
https://mail.python.org/archives/list/python-ideas@python.org/message/5O2TP5...
Code of Conduct: http://python.org/psf/codeofconduct/