Add a .whitespace property to module unicodedata

I suggest including a simple str variable in unicodedata module to mirror string.whitespace, so it would contain all characters defined in CPython function [_PyUnicode_IsWhitespace()](https://github.com/python/cpython/blob/main/Objects/unicodetype_db.h#L6314) so that: # existent string.whitespace = ' \t\n\r\x0b\x0c' # proposed unicodedata.whitespace = ' \t\n\x0b\x0c\r\x1c\x1d\x1e\x1f\x85\xa0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000'

It feels to me like "split on whitespace" or "remove whitespace" are quite common operations. I've been frustrated a number of times by settling for the ASCII whitespace class when I really wanted the Unicode whitespace class. On Thu, Jun 1, 2023 at 12:20 PM Paul Moore <p.f.moore@gmail.com> wrote:
-- The dead increasingly dominate and strangle both the living and the not-yet born. Vampiric capital and undead corporate persons abuse the lives and control the thoughts of homo faber. Ideas, once born, become abortifacients against new conceptions.

OK, fair enough. What about "has whitespace (including Unicode beyond ASCII)"? On Thu, Jun 1, 2023 at 1:08 PM Chris Angelico <rosuav@gmail.com> wrote:
-- The dead increasingly dominate and strangle both the living and the not-yet born. Vampiric capital and undead corporate persons abuse the lives and control the thoughts of homo faber. Ideas, once born, become abortifacients against new conceptions.

On Thu, 1 Jun 2023 at 18:16, David Mertz, Ph.D. <david.mertz@gmail.com> wrote:
OK, fair enough. What about "has whitespace (including Unicode beyond ASCII)"?
❯ py -m timeit -s "import re; r = re.compile(r'\s', re.U)" "r.search('ab\u2002cd')" 1000000 loops, best of 5: 262 nsec per loop Paul

On 01.06.2023 18:18, Paul Moore wrote:
Same here. For those few cases, where it might be useful, you can easily put the string into your application code. Putting this into the stdlib would just mean that we'd have to recheck whether new Unicode whitespace chars were added, every time the standard upgrades. With ASCII, this won't happen in the foreseeable future ;-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Jun 01 2023)
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/

I guess this is pretty general for the described need:
It's milliseconds not nanoseconds, but presumably something you do once at the start of an application. Can anyone think of a more efficient and/or more concise way of doing this? This definitely feels better than making a static sequence of characters since the Unicode Consortium may (and has) changed the definition. In particular, MONGOLIAN VOWEL SEPARATOR (U+180E) was removed from the whitespace category to which it previously belonged. I'm not sure why U+FEFF isn't included, but that seems to match the current standards, so all good. On Thu, Jun 1, 2023 at 1:29 PM Marc-Andre Lemburg <mal@egenix.com> wrote:
-- The dead increasingly dominate and strangle both the living and the not-yet born. Vampiric capital and undead corporate persons abuse the lives and control the thoughts of homo faber. Ideas, once born, become abortifacients against new conceptions.

On 6/1/23 2:06 PM, David Mertz, Ph.D. wrote:
I'm not sure why U+FEFF isn't included, but that seems to match the current standards, so all good.
I think because Zero Width, No-Breaking Space, (aka BOM Mark) doesn't act like a "Space" character. If used as the BOM mark, it is intended that it gets stripped out when read and the UTF-16/UTF-32 data file that follows it be typically just read and have its byte order corrected as the mark indicates. If used elsewhere as the ZWNBSP (which has been deprecated and replaced with U+2060) then it use is intentionally "no-break" so not a space to seperate on. -- Richard Damon

On 01.06.2023 20:06, David Mertz, Ph.D. wrote:
I guess this is pretty general for the described need:
%time unicode_whitespace = [chr(c) for c in range(0xFFFF) if unicodedata.category(chr(c)) == "Zs"]
Use sys.maxunicode instead of 0xFFFF
There isn't. You essentially have to scan the entire database for whitespacy chars.
Which was my point: including the above in a stdlib module wouldn't make sense, since it increases module load time (and possibly startup time), so it's better to generate a string and put this verbatim into the application. However, this would have to be part of the Unicode database update dance and whitespace is only possible category of chars which would be interesting. Digits or numbers are another, letter, linebreaks, symbols, etc. others: https://www.unicode.org/reports/tr44/#GC_Values_Table It's better to put this into the application in question or to have someone maintain such collections outside the stdlib in a package on PyPI. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Jun 02 2023)
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/

If we're talking PyPI, it would be nice to have: unicode_categories = {"Zs": [...], "Ll": [...], ...} For all the various categories. It would just take one pass through all the characters to generate it, but then every category would be fast to access later. On the other hand, it's a few lines of code with a lazy import. Probably not enough code to put on PyPI. On Fri, Jun 2, 2023 at 4:32 PM Marc-Andre Lemburg <mal@egenix.com> wrote:
-- The dead increasingly dominate and strangle both the living and the not-yet born. Vampiric capital and undead corporate persons abuse the lives and control the thoughts of homo faber. Ideas, once born, become abortifacients against new conceptions.

def does_string_have_currency_mark(s): return bool(set(s) & set(unicode_categories['Sc']) def does_string_have_numeric_digit(s): ... ... and so on. Those seem like questions one asks often enough. Not every day, but more than never. On Fri, Jun 2, 2023 at 4:59 PM Chris Angelico <rosuav@gmail.com> wrote:
-- The dead increasingly dominate and strangle both the living and the not-yet born. Vampiric capital and undead corporate persons abuse the lives and control the thoughts of homo faber. Ideas, once born, become abortifacients against new conceptions.

On Sat, 3 Jun 2023 at 07:08, David Mertz, Ph.D. <david.mertz@gmail.com> wrote:
These questions are much better answered with the unicodedata.category() function. First figure out what categories your string has: cats = set(unicodedata.category(ch) for ch in s) And then check whether Sc is in that set, or whatever others you care about. This way, the set contains only the categories, not the characters; there's no reason to do set intersection with all of the characters. ChrisA

On Sat, 3 Jun 2023 at 07:28, David Mertz, Ph.D. <david.mertz@gmail.com> wrote:
Sure. That's fine. With a sufficiently long strings my code is faster, but for "typical" strings yours will be.
Really? How? Your code has to build a set of every character in the string; mine builds a set of every category in the string. Set intersection won't be slower for a smaller set. ChrisA

This is just bar talk at this point. I think we've shown that this is easy enough to do that programmers can roll their own. But as idle chat goes, note that in your code: set(unicodedata.category(ch) for ch in s) If `s` is a billion characters long, then we make a billion calls to the `.category()` method. Python calls are comparatively expensive, even on well optimized data structures like strings. In my version: bool(set(s) & set(unicode_categories['Sc']) The billion characters are first reduced to a smallish set of hundreds or thousands of distinct characters without needing method calls. Then that is intersected with a smallish set of characters in the category. You could optimize your version, however, simply by using: set(unicodedata.category(set(ch)) for ch in s) Yours provides more information, since it lists all the categories. But if you REALLY only care about one category, then you still have to ask `'Sc' in set(unicodedata.category(set(ch)) for ch in s)`. Which is fine, that's not a hard question to ask. On Fri, Jun 2, 2023 at 5:36 PM Chris Angelico <rosuav@gmail.com> wrote:
-- The dead increasingly dominate and strangle both the living and the not-yet born. Vampiric capital and undead corporate persons abuse the lives and control the thoughts of homo faber. Ideas, once born, become abortifacients against new conceptions.

On Sat, 3 Jun 2023 at 08:28, David Mertz, Ph.D. <david.mertz@gmail.com> wrote:
Or perhaps: set(unicodedata.category(ch) for ch in set(s)) But measure before considering this worthwhile.
If you REALLY want to just check whether any category is there, you probably want something like: any(unicodedata.category(ch) == "Sc" for ch in s) which is completely different from what you were suggesting, and still doesn't require the string of all codepoints in the category. Point is, querying the string is almost always going to be more efficient than intersecting with the full gamut of that category. ChrisA

On Sat, 3 Jun 2023 at 09:42, David Mertz, Ph.D. <david.mertz@gmail.com> wrote:
Yeah... oops. Obviously I typed the version in email. Should have done it in the shell. But you got the intention of set-ifying the characters in the large string.
Yep. I thought of that as I was originally writing, but absent benchmarking data, I prefer the simplest way of writing something. ChrisA

Let's call the styles a tie. Using the SOWPODS scrabble wordlist (no currency symbols, so False answer):
Of course, this is a small character set of 26 lowercase letters (and newline as I did it). A more diverse alphabet might tip the timing slightly, but it's going to be a small matter either way. On Fri, Jun 2, 2023 at 7:49 PM Chris Angelico <rosuav@gmail.com> wrote:
-- The dead increasingly dominate and strangle both the living and the not-yet born. Vampiric capital and undead corporate persons abuse the lives and control the thoughts of homo faber. Ideas, once born, become abortifacients against new conceptions.

On Sat, 3 Jun 2023 at 10:12, David Mertz, Ph.D. <david.mertz@gmail.com> wrote:
Remember though, the original request was not for a set, but for a string. Try your timing again when working with a string. The any() form is almost certainly the most effective, although I suppose it could be implemented in C for better performance (avoiding calling back into Python repeatedly). Not sure it's necessary though. ChrisA

It feels to me like "split on whitespace" or "remove whitespace" are quite common operations. I've been frustrated a number of times by settling for the ASCII whitespace class when I really wanted the Unicode whitespace class. On Thu, Jun 1, 2023 at 12:20 PM Paul Moore <p.f.moore@gmail.com> wrote:
-- The dead increasingly dominate and strangle both the living and the not-yet born. Vampiric capital and undead corporate persons abuse the lives and control the thoughts of homo faber. Ideas, once born, become abortifacients against new conceptions.

OK, fair enough. What about "has whitespace (including Unicode beyond ASCII)"? On Thu, Jun 1, 2023 at 1:08 PM Chris Angelico <rosuav@gmail.com> wrote:
-- The dead increasingly dominate and strangle both the living and the not-yet born. Vampiric capital and undead corporate persons abuse the lives and control the thoughts of homo faber. Ideas, once born, become abortifacients against new conceptions.

On Thu, 1 Jun 2023 at 18:16, David Mertz, Ph.D. <david.mertz@gmail.com> wrote:
OK, fair enough. What about "has whitespace (including Unicode beyond ASCII)"?
❯ py -m timeit -s "import re; r = re.compile(r'\s', re.U)" "r.search('ab\u2002cd')" 1000000 loops, best of 5: 262 nsec per loop Paul

On 01.06.2023 18:18, Paul Moore wrote:
Same here. For those few cases, where it might be useful, you can easily put the string into your application code. Putting this into the stdlib would just mean that we'd have to recheck whether new Unicode whitespace chars were added, every time the standard upgrades. With ASCII, this won't happen in the foreseeable future ;-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Jun 01 2023)
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/

I guess this is pretty general for the described need:
It's milliseconds not nanoseconds, but presumably something you do once at the start of an application. Can anyone think of a more efficient and/or more concise way of doing this? This definitely feels better than making a static sequence of characters since the Unicode Consortium may (and has) changed the definition. In particular, MONGOLIAN VOWEL SEPARATOR (U+180E) was removed from the whitespace category to which it previously belonged. I'm not sure why U+FEFF isn't included, but that seems to match the current standards, so all good. On Thu, Jun 1, 2023 at 1:29 PM Marc-Andre Lemburg <mal@egenix.com> wrote:
-- The dead increasingly dominate and strangle both the living and the not-yet born. Vampiric capital and undead corporate persons abuse the lives and control the thoughts of homo faber. Ideas, once born, become abortifacients against new conceptions.

On 6/1/23 2:06 PM, David Mertz, Ph.D. wrote:
I'm not sure why U+FEFF isn't included, but that seems to match the current standards, so all good.
I think because Zero Width, No-Breaking Space, (aka BOM Mark) doesn't act like a "Space" character. If used as the BOM mark, it is intended that it gets stripped out when read and the UTF-16/UTF-32 data file that follows it be typically just read and have its byte order corrected as the mark indicates. If used elsewhere as the ZWNBSP (which has been deprecated and replaced with U+2060) then it use is intentionally "no-break" so not a space to seperate on. -- Richard Damon

On 01.06.2023 20:06, David Mertz, Ph.D. wrote:
I guess this is pretty general for the described need:
%time unicode_whitespace = [chr(c) for c in range(0xFFFF) if unicodedata.category(chr(c)) == "Zs"]
Use sys.maxunicode instead of 0xFFFF
There isn't. You essentially have to scan the entire database for whitespacy chars.
Which was my point: including the above in a stdlib module wouldn't make sense, since it increases module load time (and possibly startup time), so it's better to generate a string and put this verbatim into the application. However, this would have to be part of the Unicode database update dance and whitespace is only possible category of chars which would be interesting. Digits or numbers are another, letter, linebreaks, symbols, etc. others: https://www.unicode.org/reports/tr44/#GC_Values_Table It's better to put this into the application in question or to have someone maintain such collections outside the stdlib in a package on PyPI. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Jun 02 2023)
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/

If we're talking PyPI, it would be nice to have: unicode_categories = {"Zs": [...], "Ll": [...], ...} For all the various categories. It would just take one pass through all the characters to generate it, but then every category would be fast to access later. On the other hand, it's a few lines of code with a lazy import. Probably not enough code to put on PyPI. On Fri, Jun 2, 2023 at 4:32 PM Marc-Andre Lemburg <mal@egenix.com> wrote:
-- The dead increasingly dominate and strangle both the living and the not-yet born. Vampiric capital and undead corporate persons abuse the lives and control the thoughts of homo faber. Ideas, once born, become abortifacients against new conceptions.

def does_string_have_currency_mark(s): return bool(set(s) & set(unicode_categories['Sc']) def does_string_have_numeric_digit(s): ... ... and so on. Those seem like questions one asks often enough. Not every day, but more than never. On Fri, Jun 2, 2023 at 4:59 PM Chris Angelico <rosuav@gmail.com> wrote:
-- The dead increasingly dominate and strangle both the living and the not-yet born. Vampiric capital and undead corporate persons abuse the lives and control the thoughts of homo faber. Ideas, once born, become abortifacients against new conceptions.

On Sat, 3 Jun 2023 at 07:08, David Mertz, Ph.D. <david.mertz@gmail.com> wrote:
These questions are much better answered with the unicodedata.category() function. First figure out what categories your string has: cats = set(unicodedata.category(ch) for ch in s) And then check whether Sc is in that set, or whatever others you care about. This way, the set contains only the categories, not the characters; there's no reason to do set intersection with all of the characters. ChrisA

On Sat, 3 Jun 2023 at 07:28, David Mertz, Ph.D. <david.mertz@gmail.com> wrote:
Sure. That's fine. With a sufficiently long strings my code is faster, but for "typical" strings yours will be.
Really? How? Your code has to build a set of every character in the string; mine builds a set of every category in the string. Set intersection won't be slower for a smaller set. ChrisA

This is just bar talk at this point. I think we've shown that this is easy enough to do that programmers can roll their own. But as idle chat goes, note that in your code: set(unicodedata.category(ch) for ch in s) If `s` is a billion characters long, then we make a billion calls to the `.category()` method. Python calls are comparatively expensive, even on well optimized data structures like strings. In my version: bool(set(s) & set(unicode_categories['Sc']) The billion characters are first reduced to a smallish set of hundreds or thousands of distinct characters without needing method calls. Then that is intersected with a smallish set of characters in the category. You could optimize your version, however, simply by using: set(unicodedata.category(set(ch)) for ch in s) Yours provides more information, since it lists all the categories. But if you REALLY only care about one category, then you still have to ask `'Sc' in set(unicodedata.category(set(ch)) for ch in s)`. Which is fine, that's not a hard question to ask. On Fri, Jun 2, 2023 at 5:36 PM Chris Angelico <rosuav@gmail.com> wrote:
-- The dead increasingly dominate and strangle both the living and the not-yet born. Vampiric capital and undead corporate persons abuse the lives and control the thoughts of homo faber. Ideas, once born, become abortifacients against new conceptions.

On Sat, 3 Jun 2023 at 08:28, David Mertz, Ph.D. <david.mertz@gmail.com> wrote:
Or perhaps: set(unicodedata.category(ch) for ch in set(s)) But measure before considering this worthwhile.
If you REALLY want to just check whether any category is there, you probably want something like: any(unicodedata.category(ch) == "Sc" for ch in s) which is completely different from what you were suggesting, and still doesn't require the string of all codepoints in the category. Point is, querying the string is almost always going to be more efficient than intersecting with the full gamut of that category. ChrisA

On Sat, 3 Jun 2023 at 09:42, David Mertz, Ph.D. <david.mertz@gmail.com> wrote:
Yeah... oops. Obviously I typed the version in email. Should have done it in the shell. But you got the intention of set-ifying the characters in the large string.
Yep. I thought of that as I was originally writing, but absent benchmarking data, I prefer the simplest way of writing something. ChrisA

Let's call the styles a tie. Using the SOWPODS scrabble wordlist (no currency symbols, so False answer):
Of course, this is a small character set of 26 lowercase letters (and newline as I did it). A more diverse alphabet might tip the timing slightly, but it's going to be a small matter either way. On Fri, Jun 2, 2023 at 7:49 PM Chris Angelico <rosuav@gmail.com> wrote:
-- The dead increasingly dominate and strangle both the living and the not-yet born. Vampiric capital and undead corporate persons abuse the lives and control the thoughts of homo faber. Ideas, once born, become abortifacients against new conceptions.

On Sat, 3 Jun 2023 at 10:12, David Mertz, Ph.D. <david.mertz@gmail.com> wrote:
Remember though, the original request was not for a set, but for a string. Try your timing again when working with a string. The any() form is almost certainly the most effective, although I suppose it could be implemented in C for better performance (avoiding calling back into Python repeatedly). Not sure it's necessary though. ChrisA
participants (8)
-
Antonio Carlos Jorge Patricio
-
Barry
-
Chris Angelico
-
David Mertz, Ph.D.
-
Ethan Furman
-
Marc-Andre Lemburg
-
Paul Moore
-
Richard Damon