Yeah... oops. Obviously I typed the version in email. Should have done it in the shell. But you got the intention of set-ifying the characters in the large string. Yes on lies, damn lies, and benchmarks. On Fri, Jun 2, 2023, 7:29 PM Chris Angelico <rosuav@gmail.com> wrote:
On Sat, 3 Jun 2023 at 08:28, David Mertz, Ph.D. <david.mertz@gmail.com> wrote:
This is just bar talk at this point. I think we've shown that this is easy enough to do that programmers can roll their own.
But as idle chat goes, note that in your code:
set(unicodedata.category(ch) for ch in s)
If `s` is a billion characters long, then we make a billion calls to the `.category()` method. Python calls are comparatively expensive, even on well optimized data structures like strings.
In my version:
bool(set(s) & set(unicode_categories['Sc'])
The billion characters are first reduced to a smallish set of hundreds or thousands of distinct characters without needing method calls. Then that is intersected with a smallish set of characters in the category.
You could optimize your version, however, simply by using:
set(unicodedata.category(set(ch)) for ch in s)
Or perhaps:
set(unicodedata.category(ch) for ch in set(s))
But measure before considering this worthwhile.
Yours provides more information, since it lists all the categories. But if you REALLY only care about one category, then you still have to ask `'Sc' in set(unicodedata.category(set(ch)) for ch in s)`. Which is fine, that's not a hard question to ask.
If you REALLY want to just check whether any category is there, you probably want something like:
any(unicodedata.category(ch) == "Sc" for ch in s)
which is completely different from what you were suggesting, and still doesn't require the string of all codepoints in the category.
Point is, querying the string is almost always going to be more efficient than intersecting with the full gamut of that category.
ChrisA _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/KMHZOQ... Code of Conduct: http://python.org/psf/codeofconduct/