New GitHub issue #122288 from picnixz:<br>
<hr>
<pre>
# Feature or enhancement
### Proposal:
I implemented `fnmatch.translate` and `fnmatch.filter` in C in #121446 but the performances for `fnmatch.filter` are less pronounced than those in `fnmatch.translate`. While I believe that the factor 2x improvement brought by the C implementation is worthwhile for `fnmatch.translate`, this comes at a high maintenance cost. Instead, I translated (no pun intended) the implementation in Python and obtained the following timings (PGO build):
```
+------------------------------------------+-----------------------+-----------------------+
| Benchmark | fnmatch-translate-ref | fnmatch-translate-py |
+==========================================+=======================+=======================+
| abc/[!]b-ac-z9-1]/def/\*?/*/**c/?*[][!] | 6.21 us | 5.15 us: 1.20x faster |
+------------------------------------------+-----------------------+-----------------------+
| !abc/[!]b-ac-z9-1]/def/\*?/*/**c/?*[][!] | 6.46 us | 5.32 us: 1.21x faster |
+------------------------------------------+-----------------------+-----------------------+
| a**?**cd**?**??k*** | 2.22 us | 2.11 us: 1.05x faster |
+------------------------------------------+-----------------------+-----------------------+
| a/**/b/**/c | 2.02 us | 1.60 us: 1.26x faster |
+------------------------------------------+-----------------------+-----------------------+
| man/man1/bash.1 | 3.11 us | 1.50 us: 2.08x faster |
+------------------------------------------+-----------------------+-----------------------+
| Geometric mean | (ref) | 1.32x faster |
+------------------------------------------+-----------------------+-----------------------+
```
It's far from the benchmarks reported in https://github.com/python/cpython/issues/121445, but I think this could be considered. Note that the improvements are brought by a new (but equivalent) algorithm during the translation phase. Currently, the translation algorithm does:
- Do not compress consecutive '*' into 1 within the same loop iteration.
- Use a sentinel to indicate a wildcard instead.
- Add to a list `re.escape(c)` for every non-special character (i.e, not for groups beginning by `?*[`).
- In a second phase, merge the groups delimited by the sentinels in order to produce the final regex.
My implementation instead does the following:
- Instead of having many chunks with 1 or 2 characters, I try to have as many chunks with strings that are already formatted, e.g., `["abc", "*"]` instead of `["a", "b", "c", SENTINEL]`.
- Remember the positions of '*' in the above split instead of using a sentinel object.
That way, I can easily merge parts during the second phase (technically, it'd be possible to do it in one phase but two passes makes it clearer).
### Has this already been discussed elsewhere?
No response given
### Links to previous discussion of this feature:
_No response_
</pre>
<hr>
<a href="https://github.com/python/cpython/issues/122288">View on GitHub</a>
<p>Labels: type-feature, performance</p>
<p>Assignee: </p>